SQL notes I

  • Design a table with a primary key of varchar or int?

  • In gereral, VARCHAR or INT has no big differency in query. What matter is the access pattern.

Narrowness:
On absolute terms, a wider key will always be worse than a narrow key. The type carries no importance, is the width of the key that matters. When compared with INT though, few types can beat INT in narrowness, so INT usually wins that argument just by the fact that it is only 4 bytes wide.

More on this aspect:
In most cases, The Primary Key (PK) is defined as INT NOT NULL IDENTITY(1,1) field in each table.
On one side, with an IDENTITY field, the database handles all the details of making sure it’s really unique, and the INT datatype is just 4 bytes, and fixed, so it’s easier and more suited to be used for the primary (and clustering) key in the table.
On the other side, using a VARCHAR(10) or (20) just uses up too much space - 10 or 20 bytes instead of 4.

  • Another fact is when choosing the PK usually one also selects the clustered key. Those two are often confused. By default, SQL server chooses PK as one of clustering indexes.

Clustering Index:
The idea of clustered indexes is to store a complete table in a B-tree structure. If a table has a clustered index, it basically means the index is the table. A clustered index has a strict row order like any other B-tree index: it sorts the rows according to the index definition.

Pros
The upside of clustered indexes is that they can deliver subsequent rows quickly when accessed directly (not via a non-clustered index). In other words, they are fast if you use the clustering key to fetch several rows. Remember that the primary key is the clustering key per default. In that case, it means fetching several rows via primary key—with a single query.

Cons
How bad is the cluster index in query?

Rows Fetched | Time (Heap) | Time (Clustered)

10.000 | 31 | 78

20.000 | 47 |130

50.000 | 109 |297

100.000 | 203 | 624

200.000 | 390 | 1232

Ref: Unreasonable Defaults: Primary Key as Clustering Key

The clustering key value will be repeated on every single index entry on every single non-clustered index on the table, so potentially, it is wasting a lot of space (not just on disk - that’s cheap - but also in SQL Server’s main memory). Also, since it’s variable (might be 4, might be 20 chars) it’s harder to SQL server to properly maintain a good index structure.

Conclusion, it is always the right choice to start with INT NOT NULL IDENTITY(1,1)as the PK. However if considering VARCHAR type PK, it is better to avoid to use it as one of the clustering indexes.

  • Description of primary keys and clustered keys

Primary keys are logical business elements. The primary key is used by your application to identify an entity, and the discussion about primary keys is largely wether to use natural keys or surrogate key. The links go into much more detail, but the basic idea is that natural keys are derived from an existing entity property like ssn or phone number, while surrogate keys have no meaning whatsoever with regard to the business entity, like id or rowid and they are usually of type IDENTITY or some sort of uuid. My personal opinion is that surrogate keys are superior to natural keys, and the choice should be always identity values for local only applicaitons, guids for any sort of distributed data. A primary key never changes during the lifetime of the entity.

Clustered keys are the key that defines the physical storage of rows in the table. Most times they overlap with the primary key (the logical entity identifier), but that is not actually enforced nor required. When the two are different it means there is a non-clustered unique index on the table that implements the primary key. Clustered key values can actualy change during the lifetime of the row, resulting in the row being physically moved in the table to a new location. If you have to separate the primary key from the clustered key (and sometimes you do), choosing a good clustered key is significantly harder than choosing a primary key.

  • SQL scripts to set non-cluster index

-- Enforces the Primary Key constraint with a Unique Non-Clustered Index
CREATE TABLE Foo1
(
Col1 INT NOT NULL PRIMARY KEY NONCLUSTERED,
Col2 INT NOT NULL,
Col3 INT NOT NULL
)
GO

A table without clustered index uses Heap structure to query data.

-- Create the Primary Key constraint on an ever-increasing
-- key column
CREATE TABLE Foo2
(
Col1 INT NOT NULL PRIMARY KEY NONCLUSTERED,
Col2 UNIQUEIDENTIFIER NOT NULL,
Col3 INT NOT NULL
)
GO

-- Create the Clustered Index on a random key column
CREATE UNIQUE CLUSTERED INDEX ci_Col2 ON Foo2(Col2)
GO

A table has both clustered and non-clustered indexes.

Following are the advantages of focusing on index-only scans instead of clustered indexes:

  • You are not limited to one index. Any index can be as fast as a clustered index.

  • Adding INCLUDE columns to a non-clustered index doesn’t affect anything else than this particular index. There is no penalty that hurts all other indexes!

  • You don’t need to add all table columns to a non-clustered index to enable an index-only scan. Just add the columns that are relevant for the query you’d like to tune. That keeps the index small and can thus become even faster than a clustered index.

  • And the best part is: there is no mutual exclusion of index-only scans and clustered indexes. Index-only scans work irrespective of the table storage. You can extend non-clustered indexes for index-only scans even if there is a clustered index. That’s also an easy way to avoid paying the “clustered index penalty” on non-clustered indexes.

-- Enforces the Primary Key constraint with a Unique Non-Clustered Index
CREATE TABLE Foo1
(
Col1 INT NOT NULL PRIMARY KEY NONCLUSTERED,
Col2 INT NOT NULL,
Col3 INT NOT NULL
)
GO

-- Create the Primary Key constraint on an ever-increasing
-- key column
CREATE TABLE Foo2
(
Col1 INT NOT NULL PRIMARY KEY NONCLUSTERED,
Col2 UNIQUEIDENTIFIER NOT NULL,
Col3 INT NOT NULL
)
GO

-- Create the Clustered Index on a random key column
CREATE UNIQUE CLUSTERED INDEX ci_Col2 ON Foo2(Col2)
GO
  • What is a GUID?

GUID (or UUID) is an acronym for ‘Globally Unique Identifier’ (or ‘Universally Unique Identifier’). It is a 128-bit integer number used to identify resources. The term GUID is generally used by developers working with Microsoft technologies, while UUID is used everywhere else.

  • How unique is a GUID?

128-bits is big enough and the generation algorithm is unique enough that if 1,000,000,000 GUIDs per second were generated for 1 year the probability of a duplicate would be only 50%. Or if every human on Earth generated 600,000,000 GUIDs there would only be a 50% probability of a duplicate.

  • How are GUIDs used?

GUIDs are used in enterprise software development in C#, Java, and C++ as database keys, component identifiers, or just about anywhere else a truly unique identifier is required. GUIDs are also used to identify all interfaces and objects in COM programming.

SQL Server has the creation of GUID’s built in. There is no need to write a separate stored procedure for this.
You can use

- `NEWID()`
- `NEWSEQUENTIALID()`

The key difference between both procedures would be that the sequential GUID should be used if it is for a primary clustered key.