Thursday, May 05, 2005

Version Tolerant Serialization

Sorry, I haven't posted anything to my blog in a while (not that anyone was holding their breath... well, except for maybe Jason :-) )...

I've done a thing or two with .NET Serialization and Remoting in the past and presently I'm doing some stuff where we're considering all of the transport options that come with .NET: Remoting, Web Services, Enterprise Services, and the soon-to-be new kid on the block, Indigo. Recently, I watched an MSDN TV show with Matt Tavis on the new features added to Remoting coming up in .NET 2.0. I think its great that their expanding the functionality of Remoting (especially since many were claiming that it was dead back when Indigo was first announced). The reality is that they're adding some really cool features. At a high level they are:
  • A new channel called the IPC Channel which is optimized for same-machine inter-process Remoting. It uses named pipes instead of TCP/IP for communication.
  • An option to secure the TCP Channel using SSPI-based security that is offered in the new System.Net.Security namespace.
  • Socket cache control
  • Generics support
  • IPv6 support
  • Version tolerant serialization
That last feature is the one I'd like to comment on. Technically this new feature isn't specific to Remoting, but is a part of .NET Serialization, which is broader technology used by Remoting.

VTS and Remoting
In .NET 1.x if you serialized an object of a given type and then deserialized using a newer version of that type and the new type had additional (or removed) fields, you would get seriaization errors. The only work around was to implement the ISerializable interface. However, this approach was crude as you were responsible for reading and writing directly to the serialization stream and things got worse when you wanted to serialize an object hierarchy. Microsoft's answer to this issue is Version Tolerance Serialization. VTS essentially adds 5 new attributes to the System.Runtime.Serialization namespace that allow you to mark fields as optional (these would be used with the new fields on the new type) as well as mark methods in your class for handling specific events in the serialization/deserialization process. In a Remoting scenario, what this lets you do is update a type on the server-side to a new version and let the client-side versions remain the same, without throwing serialization errors.

Now, in and of itself, this is pretty cool. But I think its worth it to take a moment to offer up a word of caution. There's a reason why the original serialization mechanism was version intolerant (and it wasn't just because the Microsoft developers were too lazy to add tolerance support). If the object you're serializating across the wire is nothing more than data, then this new feature is fairly safe. But if you're actually remoting objects that are more object-oriented (both data and logic together) then caution needs to exercised. While VTS may prevent serialization (i.e. schema) errors from occurring, it won't stop differences in logic between the two type versions from totally hosing up your application. In other words, when you code up version 2 of the serializable object, you better be thinking about version 1, version 1.5, and any other version between what the client was compiled against and what you've got sitting in your IDE. The question to ask when considering this new feature is this: is it more work to force clients to always get the latest binaries or to add backward-compatibility support to my new type? I think in many cases it's more complex to do the later. Especially when you consider many of tools we have today that allow client applications to automatically update their binaries.

So, in a Remoting scenario where you allow the client and server to version independently I think one needs to exercise caution (and a lot of good unit testing) before hailing this feature as the answer to all our versioning problems. Throwing a new version of a type onto a server with minimal effort to support backward compatibility will lead to a lot of very hard to debug application errors and may even lead to bad data getting persisted to the database. Rocky Lhotka talks more about this problem is a more generic scense in this article on TheServerSide.net.

Howerver...

VTS in General
I think there is a scenario where this new feature shines without the aforementioned caviat. I was on a project once where we used serialization to persist objects to disk instead of persisting them to a traditional RDBMS. The application was a desktop app similar to Microsoft Word where the user worked with "documents" and could save and load them as files. It seemed like a natural fit to have the document be some kind of object model that was persisted to and from disk via serialization. Everything worked fine until we started thinking about version 2 of our app. While there were ways to relax the versioning restrictions in in .NET 1.x serialization when the schema didn't change, the moment we added new fields to our types, the fated serialization errors occurred.

But this scenario was different than Remoting in that we only had to worry about the difference in type data (schema), not logic. This is because we were not interoperating with the older type. We just had to figure out a way to transform data serialized by an old type so that it would fit the schema of a new type. In the end, we ended up writing an elaborate upgrade system that would parse the serialized data itself (which in our case was SOAP XML) and transform it to a newer version, adding fields as necessary. This was not an ideal solution. First, it was very complex. Second, it essentially broke the rule of encapsulation, since the upgrader tool had to have intimate knowledge of the fields of the type is was upgrading. It would have been better to place the responsibility of managing new fields of a type within the type itself. I believe the new VTS features of .NET 2.0 allow us to do just that and with minimal effort.