Sunday, April 01, 2007

Converting the primary key type in a SQL Server database from int to GUID

I maintain and develop a database for a web application that I've worked on for several years now. The application and the database have evolved significantly, and as things got larger and more complex, I've had to employ more advanced techniques and tools to perform database updates from development to production.

One of the "mistakes" I made when I originally designed the database was to use primary keys with the int data type (along with setting the primary key column as an identity column) for all of the tables. I have nothing against int's or identity columns and, in fact, that simple, tried-and-true configuration has worked well for this application... except for one, unfortunately very painful side-effect. When it comes time for me to push a major update from development to production, sorting out duplicate columns between both development and production database can be a real pain in the undercarriage.

The reason for this is often times in development I'm added new rows to tables that exist in both development and production to test and/or enable a new feature. If the primary key for that table is an int that is auto-generated by the database (since its an identity column) then there's a VERY good chance that same ID has been duplicated in a row in the production database. Things get even more complicated when that primary key ID is referenced by other columns in other tables via foreign key relationships. I tried a few production push sessions where I pretty much took the production database offline (which meant I pretty much took the website offline), pulled down the entire production database, integrated the development data into production by hand (I use RedGate products), and finally pushed the whole thing back up to production. Not pretty, but it worked... until recently. Within the last few months, for various reasons the production database has nearly tripled in size, making the whole pulling of the entire production database, and, worse yet, the pushing of it back up, ridiculously impractical. My last push took over 12 hours! Now, granted, I'm restricted by the limits of a cable Internet connection, but I came to the realization that I needed another solution.

Well, the solution is simple: switch to GUID primary keys. They never duplicate (in theory) so I can create rows in development that will never interfere with new rows in production. Unfortunately the road to this solution is not so simple. However, I was able to finally get there, so I thought I would share how I did it.

How to do the conversion

The main issue with a conversion of primary key columns from int/identity to GUID is the fact that you can't simply do a TABLE ALTER to remove the identity status of a column. You actually have to drop the column and re-add it!

Following are the steps I took. In this example, we'll call the starting database db and the final database db_new-schema.
  1. Create db_new-schema as an empty database with new schema (i.e. everything the same as the old schema except the primary and foreign key columns are all of type UNIQUEIDENTIFIER).
  2. Create another empty database called db_load whose schema is identical to db's except all the primary and foreign key columns are of type NVARCHAR(36) and contains only tables (views, stored procedures, functions are unnecessary). Also, there should be no foreign key constraints between tables.
  3. Use an ETL tool (like DTS or SSIS) to copy the raw table data from db to db_load.
  4. Create a SQL script that uses a cursor to enumerate the rows of a given table in db_load, updating each row's primary key with a new GUID value (using the NEWID function), and updates the rows in all foreign key tables with the same GUID value (selecting them using the old INT id). This script needed to modified and run for each table in the database until all of them were updated. In the end, all INT values were converted to new GUID's (although all primary and foreign key columns were still typed as VARCHAR).
  5. Convert all primary and foreign key columns from VARCHAR to UNIQUEIDENTIFIER.
  6. Use an ETL tool (or RedGate Data Compare) to load the db_new-schema database with all the data in db_load.

Good luck!

6 comments:

Jeff House said...

I think that Dr. Seuss wrote a book about guid

Watmann said...

Okay, we are in the exact opposite situation. All PKs are GUIDs...very bad performance due to heavy page spliting and huge indexes...how are you going to manage this from the performance side? How about NEWSEQUENTIALID ( ) instead to reduce page spliting and index fragmentation?

Pete said...

I've heard mixed reactions about NEWSEQUENTIALID, but I believe the consensus is that it does indeed solve most of the performance problem. The MSDN documentation is confusing, making it sound like you only get unique GUID's if your computer doesn't have a network card. I think the reality is they will be unique between machines as long as all servers have different MAC addresses on their network cards.

At any rate, I would definitely use NEWSEQUENTIALID except for the fact that my database server is still running SQL Server 2000; as far as I know, NEWSEQUENTIALID, is a new 2005 feature. Fortunately, my table sizes are relatively small (the largest table only has a few thousand rows) so I'm currently not being hit with too much of a performance issue. However, as soon as my provider upgrades the server 2005 I think I will consider using NEWSEQUENTIALID.

In your case (and in my future case when I upgrade to 2005), if you choose to begin using NEWSEQUENTIALID, I think you still have a bit of a conversion process to go through, although it won’t be as grueling as mine was (int to GUID). You’re still going to need to replace all the old GUID’s generated by NEWID with sequential GUID’s via NEWSEQUENTIALID. Simply setting the default value of your PK column to NEWSEQUENTIALID() will not help your clustered index since all the old row values are still random GUID’s. This is similar to step 4 in my post, but would require a slightly different approach since, from what I understand, you can’t use the NEWSEQUENTIALID function in general scripts; it can only be used as the default value for a table column.

Mark said...

Sounds like you needed SET IDENTITY_INSERT ON. The problems you face with managing rows across environments are like nothing compared to the pitiful performance you'll no doubt experience if you have to scale with GUIDs. Inserts and Joins will crawl.

Don't get me wrong, they work okay for small systems, but scale they do not.

Just think you found the wrong solution to your problem. The simplicity of GUIDs is not free.

Pete said...

I agree, GUID's are SLOW. And luckily for me, the database is not not that big. SET IDENTITY_INSERT ON, while it lets you manually insert identity values, doesn't solve the problem of duplicate rows. You still need some way of matching rows between both databases.

Per a previous comment, if I ever upgrade to SQL Server 2005, I think a nice hybrid solution would be to use NEWSEQUENTIALID.

Anonymous said...

Just stumbled across this article from a google search and would like to say thanks - this is exactly what I was after. I had to convert a system (operational for a number of years) over to use GUID's because the end-users wanted a 'portable' version that could be installed on laptops, and sync'ed up remotely at random times of the day. In other words, I was forced to go down the route of GUID primary keys.

A very time consuming process but one that you simplified greatly for me. Many thanks.

Stephen