Thursday, April 4, 2019

Adobe AEM Content Syncing

Background

Over the last couple of months I’ve been working on a migration project from AEM 6.2 to 6.4. And one thing we struggled with was being able to maintain parity of the content between existing and the newly created environments. The issues arose because the existing tools are not working out so great like the vltrcp, Grabbit, packages. Each one has its limitations that prevent us from using any one tool in particular. The main issue is that when someone asked us to mirror the current production environment it could take on the order of two days worth of time. It was a real time-sink. And by the time you're done with the sync the prod content is already outdated and doesn't match anymore.

At first we looked at using vltrcp which proved pretty error prone and quite slow (taking something like in a order of days for a single full sync (content and images). Delta syncs had other issues too like requiring us to remove the ordering of the nodes in some instances which then presented problems: we needed the content to maintain the same order within the jcr:content/par nodes.

Then we looked at using the TWC’s Grabbit tool which we also found quite awesome in theory.
It was super fast to sync plain content over since it did RCP syncs serializing the data into tiny bits using Google Protobuf. It was still slow syncing binary content since that can't really be compressed. So in practice, it had a lot of issues with keeping content synced correctly. It would skip over nodes when trying to use the delta option. Primarily because of the way the delta sync works. It simply checks the timestamp of the last run and only copies nodes that have a timestamp that is newer than the last run. So if a job failed to finish it would fail to copy over content it didn't get to copying during the failed job run. It also had issues when trying to do a full sync. The dam still took a substantial amount of time. (12-14 hours. Note that the size of the dam is approximately 80-90gb).

Eureka Moment

Thinking about this problem I realized that perhaps there is a better way. We have tools like ACS Commons Query Packager tool and we also have an HTTP based tool that AEM provides OOTB called the querydebug/querybuilder tool.

And what we ended up doing was initially doing a sync via modified Grabbit tool (we customized it to continue running on errors - otherwise it would stop in the middle of a sync and we had to restart again and again). And then for subsequent changes we just simply used ACS Commons Query Packager tool in conjunction with these other custom python scripts based off QueryBuilder API.

This allowed us to figure out exactly what is different between environments and just simply copy over the missing content in form of packages and be able to replicate only those paths without needing to do any tree activations. Each of these various phases was made into a separate modular script that can all be chained together.

The one tool that we didn't get a chance to explore is the oak-upgrade or crx2oak tool which could be used to copy over content between various environments.

Scenarios where using the sync scripts could be useful

• Automated syncs from prod to lower environments on a scheduled basis
• Migration from one AEM author to another and making sure the content is kept in sync until launch time
• Finding any orphaned pages in publishers and removing them

How these scripts can be chained together

Sync between Source Author and Target Author

Find the differences of what exists in source author vs target and generate a package from those differences for any particular path. Then the package can be installed via curl and we can replicate just the paths that were added instead of doing a tree activation - which is a costly operation.

  1. createContentDiffList.py
  2. createPackageFromPaths.py
  3. toggleWorkflows.py - disable the workflows
  4. toggleComponents.py - disable any pre-processor component
  5. curl to upload to author server (maybe should be part of another script or a separate script)
  6. replicatePaths.py or unzipAndReplicateQueryPackagePaths.py
  7. toggleComponents.py - enable any pre-processor component
  8. toggleWorkflows.py - enable the workflows

Find and publish any pages on author that were published but are not on publisher

So if there are pages in author that have been published but they are not in publisher. Somehow they may have been missed but they should exist on publishers since they are marked published on author.
  1. createContentDiffList.py using the --source_published True flag
  2. toggleWorkflows.py - disable the workflows
  3. toggleComponents.py - disable any pre-processor component
  4. replicatePaths.py
  5. toggleComponents.py - enable any pre-processor component
  6. toggleWorkflows.py - enable the workflows

Find and unpublish any pages that are marked un-published on author but exist in publisher (orphaned pages)

  1. createContentDiffList.py using the --source_unpublished True flag
  2. toggleWorkflows.py - disable the workflows
  3. toggleComponents.py - disable any pre-processor component
  4. replicatePaths.py - use the --deactivate flag to unpublish instead of publishing
  5. toggleComponents.py - enable any pre-processor component
  6. toggleWorkflows.py - enable the workflows