Recently we have been tasked by one of our clients to migrate a few millions of documents with metadata into Alfresco. We assessed different options and tools for doing this efficiently. The first challenge was document assembly and conversion, but I will not go into the details of this step in this blog entry. I will just mention for those who might be interested in more details that Ephesoft software was our tool of choice for this purpose.
The second challenge was uploading this huge amount of content with its metadata into Alfresco without huge hassle and in a reasonable timespan.
The Bulk Import tool
The tool we chose to import the content and data in Alfresco efficiently was the existing Bulk Import tool for Alfresco, developed by Peter Monks and available here: https://github.com/pmonks/alfresco-bulk-import.
The essence of how the tool works is that it recursively parses a folder structure on the disk and creates the file and folder structure accordingly in Alfresco. For each node to be created (folder or file), the Bulk Import tool is expecting an xml file containing the metadata for the node. Folders will be imported even without such a file associated, but all content needs to have a metadata xml present at the same location. The metadata xml can contain the type of the content, aspects and any standard or custom metadata you need to add to the file.
The challenge at this step was that the files which we needed to import were at two different main locations on Windows drives at client side and the folder structure of these files on disk was completely different than the folder structure which we needed to create on Alfresco. It would cost too many resources to first create the Alfresco structure on disk completely.
The beauty of open source comes into play here. We needed the Bulk Import tool to behave differently than standard – no problem, with the code available, we can customize the behavior to serve our needs.
The solution
We opted therefore to customize the Bulk Import tool, by making it use two additional parameters in the metadata xml files. These parameters specify the full path on disk where the files can be found and the full path in Alfresco where the files should be placed.
This way, we don’t need to create the folder structure locally as needed in Alfresco, since this operation would be too costly when it comes to millions of documents, no matter the method utilized. Also, we do not need to make any modifications to the original file locations at client side.
public FilesystemBulkImportItemVersion(final ServiceRegistry serviceRegistry, final ContentStore configuredContentStore, final MetadataLoader metadataLoader, final BigDecimal versionNumber, final File contentFile, final File metadataFile) { super(calculateType(metadataLoader, contentFile, metadataFile, ContentModel.TYPE_FOLDER.toPrefixString(serviceRegistry.getNamespaceService()), ContentModel.TYPE_CONTENT.toPrefixString(serviceRegistry.getNamespaceService())), versionNumber); this.mimeTypeService = serviceRegistry.getMimetypeService(); this.configuredContentStore = configuredContentStore; this.metadataLoader = metadataLoader; this.contentReference = contentFile; this.metadataReference = metadataFile; // cntz : set the correct content file reference based on metadata if (contentFile == null) { Map metadataMap = getMetadata(); if (metadataMap.containsKey("sourcePath")) { String path = (String)metadataMap.get("sourcePath"); this.contentReference = new File(path); } } // "stat" the content file then cache the results this.isDirectory = serviceRegistry.getDictionaryService().isSubClass( createQName(serviceRegistry, getType()), ContentModel.TYPE_FOLDER); if (this.contentReference == null || this.contentReference.isDirectory()) { cachedSizeInBytes = 0L; } else { cachedSizeInBytes = this.contentReference.length(); } }
Standard behavior remains also available – in which case the contentFile is not null and the new logic is not applied.
For the value of the target path in Alfresco, we needed two modifications:
- A change in the importItem method in the BatchImporterImpl class, to use the target path from the metadata xml instead of the default path for the upload, if the “rootNodeRef” parameter is present in the file:
// cntz : use the destination path from metadata to upload the content (instead of the default path) BulkImportItemVersion version = item.getVersions().first(); Map metadataMap = version.getMetadata(); NodeRef newTarget = target; if (metadataMap.containsKey("rootNodeRef")) { String alfPath = (String)metadataMap.get("rootNodeRef"); try { newTarget = convertPathToNodeRef(serviceRegistry, alfPath.trim()); } catch (FileNotFoundException e) { // Create the missing path newTarget = createPathInAlfresco(alfPath.trim(), dryRun); } } // cntz : replace target with newTarget, unless unable to create valid newTarget // (in which case the default location is still used) NodeRef nodeRef = null; if (newTarget == null) { nodeRef = findOrCreateNode(target, item, replaceExisting, dryRun); } else { nodeRef = findOrCreateNode(newTarget, item, replaceExisting, dryRun); }
- A new method in the same class (BatchImporterImpl) to create the path in Alfresco as specified in the “rootNodeRef” parameter, needed for the cases where the path does not already exist:
private final NodeRef createPathInAlfresco(String path, final boolean dryRun) { NodeRef result = null; String[] pathFragments = path.split("/"); String tempPath = ""; for (String fragment : pathFragments) { if (fragment.length() == 0) { continue; } tempPath += "/" + fragment; try { result = convertPathToNodeRef(serviceRegistry, tempPath); } catch (FileNotFoundException e) { result = serviceRegistry.getFileFolderService().create( result, fragment, ContentModel.TYPE_FOLDER).getNodeRef(); } } return(result); }
In addition to the Bulk Import tool changes, we needed to ensure smooth creation of the metadata files. For this initial step, we developed a simple customized java tool which would read data from xml exports from the old system and use this to generate xml files with metadata for the nodes which need to be created in Alfresco, including a source path and a target path that the customized Bulk Import tool knows how to use.
Performance wise, we were very happy with the choice of this multi-threaded bulk import tool. With two CPU cores and 8 gB RAM, the tool can upload around 25.000 documents per hour (documents of on average 500 kB).
Other customizations
Besides the above changes, we noticed a couple of other generic modifications that we needed to make to the standard Bulk Import tool to suit our needs and the needs of the client:
- If a file could not be found on disk, the whole import process got dropped due to a RuntimeException being thrown. This was far less than ideal because with such large amounts of content to import, you want to be able to allow the processes to run for hours without checking them regularly. For this reason, we modified the type of the exception thrown in case something goes wrong with the import, so that the node without content gets created, the error is logged and the process can continue. The change in the BatchImporterImpl class:
catch (final Exception e) { // Capture the item that failed, along with the exception // throw new ItemImportException(item, e); // cntz: Log the issue and continue System.out.println("Unexpected exception:\n " + (e == null ? "" : String.valueOf(e.getClass()) + ": " + e.getMessage()) + "\nWhile importing item: " + String.valueOf(item)); e.printStackTrace(); }
- We noticed that the cm:created and cm:name metadata values from the xml are not correctly set in Alfresco. Besides missing the correct information in the system, the wrong naming was also causing issues when running the import with “replace” option. We added checks in the code for these two parameters to ensure they are correctly set as per specified values in the metadata xml.
No universal solution for migrations
From our experience, each migration is different. The basis process is of course in essence the same, but tailoring the tools to the given conditions and limitations will always be necessary. In some situations, you need to extract data directly from the database, while in others you need to build it based on extracts in different formats (e.g. xml) from completely different systems.
For one client, we used Pentaho in combination with custom webscripts, for another we used Pentaho in combination with the standard Bulk Import tool and this time it was the customized Bulk Import tool in combination with a custom java batch file generator that did the job.
We will be happy to read about your experience or thoughts on content migrations around Alfresco or work together to find the smoothest and most efficient process for your unique scenario. Contact us!