Tuesday 12 January 2010

Amazon S3 Provider for Microsoft Sync Framework

This post has moved to http://wishfulcode.com/2010/01/12/amazon-s3-provider-for-microsoft-sync-framework/

The Microsoft Sync Framework 2.0 provides an advanced way to synchronise data from different kinds of stores. The framework consists of classes which can perform syncrhonisation (Orchestrators), and classes which know about different kinds of data stores (Sync Providers), be they SQL Database stores or some file system. It handles complex scenarios involving conflict detection and concurrency.

I'm interested at the moment in the best way to keep a single (but redundant) repository of code for a web application, and have any number of nodes pull everything neccessary to run the application from this store, whilst also receiving updates in a manageable way. Obviously, this requires a more complex application to manage the synchronisation process, but the first step is to build a synchronisation provider for Amazon S3 storage.

That, and I wanted to experiment with the new Sync Framework 2.0...

Whilst it's not a completely trivial task to create a new fully functional Sync Framework provider, there are a couple of options that simplify the experience.

Because our repository (S3) cannot detect or notify us of changes and is in fact quite simple in architecture, we can get away with using the FullEnumerationSimpleSyncProvider provider as a base, as opposed to the AnchorEnumerationSimpleSyncProvider. This reduces our task down to providing information about each object in our store, and providing abilities to insert, update and delete existing objects (if we want to provide two-way synchronisation, which is fairly trivial but I haven't implemented it for this provider yet).

The list of methods we have to implement is:
/// <summary>
/// Retreives metadata about each item in the store
/// </summary>
/// <param name="context"></param>
public override void EnumerateItems(FullEnumerationContext context)
{}

/// <summary>
/// Where the actual data for objects is retreived in order to copy them to destinations.
/// Since this provider concerns Files, and is intended to be compatible with the FileSyncProvider, we must return an instance of IFileDataRetreiver.
/// </summary>
/// <param name="keyAndExpectedVersion">Metadata of object to retreive data for</param>
/// <param name="changeUnitsToLoad"></param>
/// <param name="recoverableErrorReportingContext"></param>
/// <returns></returns>
public override object LoadChangeData(ItemFieldDictionary keyAndExpectedVersion, IEnumerable<Microsoft.Synchronization.SyncId> changeUnitsToLoad, RecoverableErrorReportingContext recoverableErrorReportingContext)
{}


/// <summary>
/// Instructs the provider to delete an object from the store
/// </summary>
/// <param name="keyAndExpectedVersion">Metadata associated with item</param>
/// <param name="recoverableErrorReportingContext"></param>
/// <param name="commitKnowledgeAfterThisItem"></param>
public override void DeleteItem(ItemFieldDictionary keyAndExpectedVersion, RecoverableErrorReportingContext recoverableErrorReportingContext, out bool commitKnowledgeAfterThisItem)
{}

/// <summary>
/// Adds a new object to the repository when it is a destination.
/// Since this provider is concerned with files, the data will contain both the key and the data content for the file.
/// </summary>
/// <param name="itemData">Data for file</param>
/// <param name="changeUnitsToCreate"></param>
/// <param name="recoverableErrorReportingContext"></param>
/// <param name="keyAndUpdatedVersion">Metadata of new Item</param>
/// <param name="commitKnowledgeAfterThisItem"></param>
public override void InsertItem(object itemData, IEnumerable<Microsoft.Synchronization.SyncId> changeUnitsToCreate, RecoverableErrorReportingContext recoverableErrorReportingContext, out ItemFieldDictionary keyAndUpdatedVersion, out bool commitKnowledgeAfterThisItem)
{}

/// <summary>
/// Returns a basic Version Identification for this synchronisation provider, so that in the future we may make considerations for synchronising between different future versions of this code.
/// </summary>
public override short ProviderVersion
{}

/// <summary>
/// Defines our ID schema for each object
/// </summary>
public override Microsoft.Synchronization.SyncIdFormatGroup IdFormats
{}

/// <summary>
/// Defines our Metadata Schema, and how it arranges ID, Name, Timestamp and Size of each object.
/// </summary>
public override ItemMetadataSchema MetadataSchema
{}

/// <summary>
/// Fetches an initialised MetadataStore. In most cases this is a SqlMetadataStore, stored as a file.
/// </summary>
/// <param name="replicaId"></param>
/// <param name="culture"></param>
/// <returns></returns>
public override MetadataStore GetMetadataStore(out Microsoft.Synchronization.SyncId replicaId, out System.Globalization.CultureInfo culture)
{}

/// <summary>
/// Anything we want to achieve at the beginning of each sync session
/// </summary>
public override void BeginSession()
{}

/// <summary>
/// Anything we want to achieve at the end of each sync session. Usually closes handles on the metadata repository.
/// </summary>
public override void EndSession()
{}


A couple of tricky parts:

Working out how to handle metadata.
For the metadata, you need to build up a dictionary of key/values relating to your objects. It's ok to be specific to your provider here, since the framework does not compare the metadata of the two providers. It only uses it to check for changes in the current provider. The IDFormats schema is a little more tricky, and can only really be built by looking at other examples!

Metadata Schema
CustomFieldDefinition[] customFields = new CustomFieldDefinition[4];
customFields[0] = new CustomFieldDefinition(CUSTOM_FIELD_ID, typeof(ulong));
customFields[1] = new CustomFieldDefinition(CUSTOM_FIELD_NAME, typeof(string),1024);
customFields[2] = new CustomFieldDefinition(CUSTOM_FIELD_TIMESTAMP, typeof(ulong));
customFields[3] = new CustomFieldDefinition(CUSTOM_FIELD_SIZE, typeof(ulong));

IdentityRule[] identityRule = new IdentityRule[1];
identityRule[0] = new IdentityRule(new uint[] { CUSTOM_FIELD_ID });

return new ItemMetadataSchema(customFields, identityRule);

Id Format Schema
//set id format
// Set ReplicaIdFormat to use a GUID as an ID, and ItemIdFormat to use a GUID plus
// an 8-byte prefix.
_idFormats = new SyncIdFormatGroup();
_idFormats.ItemIdFormat.IsVariableLength = false;
_idFormats.ItemIdFormat.Length = 24;
_idFormats.ReplicaIdFormat.IsVariableLength = false;
_idFormats.ReplicaIdFormat.Length = 16;
_idFormats.ChangeUnitIdFormat.Length = 4;
_idFormats.ChangeUnitIdFormat.IsVariableLength = false;


Consideration for files.
The framework can be used to synchronise many different types of data. I wanted this provider to consider its data as file-based (hierarchical objects, grouped with a key) and be compatible with the existing File System provider. In order to accomplish this, the LoadChangeData method needs to return a class of type IFileDataRetreiver which, as the name implies, knows how to retreive the file from the metadata provided to LoadChangeData. To accomplish this I created the following class, AWSS3FileDataRetreiver:
/// <summary>
/// Implements the IFileDataRetreiver properties neccessary to use the microsoft sync framework with a filesyncprovider and the awss3syncprovider
/// as explained at: http://msdn.microsoft.com/en-us/library/microsoft.synchronization.files.ifiledataretriever%28SQL.105%29.aspx
/// </summary>
public class AWSS3FileDataRetriever : IFileDataRetriever
{
#region Object Variables
S3BucketLoginData LoginData { get; set; }
S3Service Service { get; set; }
string ObjectKey { get; set; }
DateTime LastModified { get; set; }
long ObjectDataSize { get; set; }
#endregion ObjectVariables

/// <summary>
/// Constructor
/// </summary>
/// <param name="loginData"></param>
/// <param name="objectKey"></param>
public AWSS3FileDataRetriever(S3BucketLoginData loginData, string objectKey, DateTime lastModified, long objectDataSize)
{
//set the object details
LoginData = loginData;
ObjectKey = objectKey;
LastModified = lastModified;
ObjectDataSize = objectDataSize;

//initialise
Service = LoginData.GetService("~");
}

#region IFileDataRetriever Members


/// <summary>
/// Not implemented. Use the FileStream property to gain access to the object's data
/// </summary>
public string AbsoluteSourceFilePath
{
get
{
//as explained at: http://social.microsoft.com/Forums/en-US/synctechnicaldiscussion/thread/1b37437c-7644-4723-87a6-835f3044b9c2
//the filesyncprovider will use the filestream (System.IO.Stream) if this throws a NotImplementedException
throw new NotImplementedException();
}
}

/// <summary>
/// Returns metadata associated with the file as it is represented on Amazon S3
/// </summary>
public FileData FileData
{
get
{
return new FileData(
ObjectKey.Replace('/', '\\'),
FileAttributes.Normal,
LastModified,
LastModified,
LastModified,
ObjectDataSize
);
}
}

/// <summary>
/// Gets a stream of data from the object on Amazon S3. It is the consumer's responsibility to close the stream.
/// </summary>
public System.IO.Stream FileStream
{
get
{
try
{
MemoryStream ms = new MemoryStream();
using (var str = Service.GetObjectStream(LoginData.BucketName, ObjectKey))
{
S3Helper.CopyStream(str, ms, this.ObjectDataSize, null);
}
ms.Position = 0;
return ms;
}
catch (Exception ex)
{
throw ex;
}
}
}

/// <summary>
/// Gets the relative directory path of the object, without the filename
/// </summary>
public string RelativeDirectoryPath
{
get
{
return Path.GetDirectoryName(ObjectKey.Replace('/','\\'));
}
}

#endregion
}


That's all that is required to get our provider working with the Microsoft Sync Framework. Full source code is available at CodePlex, where enhancements may also be contributed. In the solution is a console application which will use this provider to synchronise an entire S3 bucket to a local location. Please note that the project is using the excellent LitS3 project to handle communication with Amazon's S3 Web Services.