Sunday, April 01, 2007

Converting the primary key type in a SQL Server database from int to GUID

I maintain and develop a database for a web application that I've worked on for several years now. The application and the database have evolved significantly, and as things got larger and more complex, I've had to employ more advanced techniques and tools to perform database updates from development to production.

One of the "mistakes" I made when I originally designed the database was to use primary keys with the int data type (along with setting the primary key column as an identity column) for all of the tables. I have nothing against int's or identity columns and, in fact, that simple, tried-and-true configuration has worked well for this application... except for one, unfortunately very painful side-effect. When it comes time for me to push a major update from development to production, sorting out duplicate columns between both development and production database can be a real pain in the undercarriage.

The reason for this is often times in development I'm added new rows to tables that exist in both development and production to test and/or enable a new feature. If the primary key for that table is an int that is auto-generated by the database (since its an identity column) then there's a VERY good chance that same ID has been duplicated in a row in the production database. Things get even more complicated when that primary key ID is referenced by other columns in other tables via foreign key relationships. I tried a few production push sessions where I pretty much took the production database offline (which meant I pretty much took the website offline), pulled down the entire production database, integrated the development data into production by hand (I use RedGate products), and finally pushed the whole thing back up to production. Not pretty, but it worked... until recently. Within the last few months, for various reasons the production database has nearly tripled in size, making the whole pulling of the entire production database, and, worse yet, the pushing of it back up, ridiculously impractical. My last push took over 12 hours! Now, granted, I'm restricted by the limits of a cable Internet connection, but I came to the realization that I needed another solution.

Well, the solution is simple: switch to GUID primary keys. They never duplicate (in theory) so I can create rows in development that will never interfere with new rows in production. Unfortunately the road to this solution is not so simple. However, I was able to finally get there, so I thought I would share how I did it.

How to do the conversion

The main issue with a conversion of primary key columns from int/identity to GUID is the fact that you can't simply do a TABLE ALTER to remove the identity status of a column. You actually have to drop the column and re-add it!

Following are the steps I took. In this example, we'll call the starting database db and the final database db_new-schema.
  1. Create db_new-schema as an empty database with new schema (i.e. everything the same as the old schema except the primary and foreign key columns are all of type UNIQUEIDENTIFIER).
  2. Create another empty database called db_load whose schema is identical to db's except all the primary and foreign key columns are of type NVARCHAR(36) and contains only tables (views, stored procedures, functions are unnecessary). Also, there should be no foreign key constraints between tables.
  3. Use an ETL tool (like DTS or SSIS) to copy the raw table data from db to db_load.
  4. Create a SQL script that uses a cursor to enumerate the rows of a given table in db_load, updating each row's primary key with a new GUID value (using the NEWID function), and updates the rows in all foreign key tables with the same GUID value (selecting them using the old INT id). This script needed to modified and run for each table in the database until all of them were updated. In the end, all INT values were converted to new GUID's (although all primary and foreign key columns were still typed as VARCHAR).
  5. Convert all primary and foreign key columns from VARCHAR to UNIQUEIDENTIFIER.
  6. Use an ETL tool (or RedGate Data Compare) to load the db_new-schema database with all the data in db_load.

Good luck!