DaveWentzel.com            All Things Data

Everything you need to know about GUIDs

I thought it was time to repost some information on GUIDs since I'm seeing them being proposed more and more for solutions that don't require them, such as ORMs.  

What's a GUID?

It's short for Globally Unique Identifier, which is Microsoft's implementation of the UUID (Universally Unique Identifier).  The UUID concept is an ISO standard that enables distributed systems to identify a piece of generated information uniquely without coordination from a central entity or piece of software.  This means that I can generate  a UUID using SQL Server on my machine at the same time as you generate a UUID using Sybase and we can rest assured that our identifiers will not clash if we ever merged the data elements together.  

A UUID (from here on out referred to as a GUID to ease confusion), is a 16byte number (or 36 characters when viewed more commonly as a string) that does have some degree of logic built into the number...it's not just monotonically increasing.  In fact, some vendors have allocated certain values to mean certain things which is totally acceptable and akin to the fact that certain IP address ranges are set aside for specific purposes, 127.x.x.x and 192.168.x.x for example.  

Here is a GUID I just generated on my machine using SQL Server and SELECT NEWID().


Notice it is split into 5 sections.  

GUID "versions"

There are actually different "versions" of GUIDs.  And just by looking at a generated GUID you can tell it's version by looking at the first number of the 3rd section, in our case above it is a 4, meaning it is a Version4 GUID.  In MS-land you'll usually only see Version 1 and 4 GUIDs so I'll focus solely on those two.  

A Version 1 GUID guarantees uniqueness by, and I'm oversimplifying here, concatenating your machine's network card's MAC address with a current timestamp.  This method was criticized since it actually reveals the provenance of the data "too much".  As a side note, if you remember the Melissa virus, they actually caught the writer because his code had an embedded Version 1 GUID in it, which means they could link it to his machine.  

Version 4 GUIDs are generated in the latest releases of SQL Server.  Version 4 GUIDs use only random numbers to guarantee uniqueness.  Many people think that there is a timestamp component to GUIDs based on the fact that the generated numbers tend to look "similar" to each other.  They look similar because of the Version and reserved use bits built into the GUID, not because of a timestamp component.  

Are GUIDs really unique?

This depends on your definition of unique.  A Version 1 GUID is actually fairly easy to demonstrate duplicates.  Assume you build two VMs.  On many virtualization platforms, such as MS Virtual Server, you can monkey around with the MAC address of the guest.  It is not inconceivable that you could build 2 guests with the same MACs, synch their times, install a Version 1 GUID generator on both, and start generating GUIDs.  You'll eventually get a clash.  

As for Version 4, well, I guess for all intents and purposes a GUID is unique, but in theory, you could get a clash since no random number generator is really and truly random.  

When *MUST* I use a GUID?

  1. If you plan on using SQL Server FILESTREAM storage then you must have a GUID col with the ROWGUIDCOL property set.  
  2. Merge Replication has the exact same requirement.  

Where else are they commonly used?

Many data architects insist on using them when designing distributed systems.  Rarely is this necessary however.  Using an Oracle sequence is a better choice where a separate SEQUENCE, using a different seed perhaps, can be assigned to each system in the distributed design.  We don't have SEQUENCEs in SQL Server (yet, hopefully) so we can create the equivalent using a "KeyTable" which is simply a table with at least 2 cols, one called TableName, and one called NextID.  A distributed system makes a call to KeyTable and says, as an example, "Give me 1,000 IDs for me to use locally" and KeyTable responds with the IDs and then reserves those entries by incrementing NextID.  The distributed system then caches and uses those 1,000 IDs until more are required and the process repeats.  Many distributed systems can also utilize the concept of a "smart key" which is an IDENTITY with a SystemID appended to it.  This is useful when a KeyTable cannot be used because there is no central system for *any* coordination.  This is very rare though.  There are many other schemes that you can use as well, based on your requirements.

Another use is by ORM Tools.  They like GUIDs because they can generate parent/child table row values entirely locally without the db specifying the ParentTableID first, for use on the children rows.  In other words, I can batch all of my INSERTs into one db call roundtrip.  An ORM can't do this with IDENTITIES.  Again, a KeyTable is a much better solution.  As a side note, although the roundtrip avoidance in an ORM is nice, I've never seen an ORM where they actually do batch the calls.  

When are GUIDs good and bad when used as a key? 
Pros Cons
Easy to create.  There are tons of GUID generators out there. HUGE.  16 bytes which is 2^128 unique values.  Compare this to a 4 byte INT.  
No intelligence built into the key (we shouldn't really ever do this anyway, this ensures we can't). Performance can be poor (see below for specifics)
  Due to their length they are not easy for users to remember (like an SSN).  You'll probably still want a natural key.  
Special SQL Server Concerns
Be careful.  GUIDs can cause fragmentation in your indexes/tables.  Obviously a clustered index based on a GUID will relieve hotspots (since they are random) but you will need to think carefully about your FILLFACTOR, or face terrible fragmentation problems.  A good rule of thumb is to NEVER base your clustered indexes off of a GUID col.  Here's another reason...if non-clustered non-unique indexes are built off a clustered index based on a GUID, even if the non-clustered index does not contain the GUID, the non-clustered, non-unique index will fragment.  Why?  There *must* be something that makes each non-clustered index key unique internally to SQL Server.  Even non-unique indexes have this requirement.  How does SQL Server find this uniqueness in this scenario?  By appending the GUID from the clustered index to the new index, hence even more fragmentation.  
Many proponents of GUIDs insist that the NEWSEQUENTIALID() function in recent SQL Server releases eliminates the fragmentation issues with GUIDs and NEWID().  I disagree.  This function creates a GUID that is greater than any previously created GUID, but only since your last reboot.  The proponents fail to remember that last part.  So, yes, fragmentation is a relieved a little bit, but less so if you reboot often (but Windows never requires that thankfully).  One additional wrinkle...you can't call this directly like NEWID()...it can only be used as a DEFAULT constraint on your table, which will limit its effectiveness in ORM applications.  

Add new comment