Using Declarative Referential Integrity in SQL Server

Relational databases are the "standard" for most business applications, and the massive installed base they enjoy means they aren’t going away any time soon. Other types of databases, such as hierarchical, object-oriented, and hash-file systems, have made a bid for the business applications market over the years, but relational databases continue to grow in popularity.

One of the reasons for this popularity is that relational databases are relatively easy for most developers to understand, build, and maintain. That isn’t to say there aren’t a lot of bad database designs out there, but that’s more the fault of the designer than the technology.

Key Technologies and Concepts

Microsoft SQL Server

Referential Integrity

Cascade Delete

Cascade Update

Foreign Key Constraints

Intersection Tables

One-to-many Relationships

Many-to-many Relationships

CREATE TABLE Statement

ON DELETE Clause

ON UPDATE Clause

What Is DRI?

The blessing and the curse of relational databases is that they are, well, relational. The key to a great design is in how you model the business data entities you need to store, and how you relate them to one another. Modern relational databases give you features that help you enforce those relationships to prevent the insertion of data that violates your business rules. Collectively, these features are known as "declarative referential integrity" (or "DRI") because you can declare the referential integrity rules as part of the database definition.

Microsoft SQL Server didn’t always do such a hot job with DRI. Before SQL Server version 7, you had to program triggers into the database to enforce referential integrity rules. At that time, Microsoft Access was actually a "better" relational database from the standpoint of built-in DRI features.

Fortunately, those days are gone, and SQL Server has supported syntax for declaring relational integrity for quite some time. However, as they say, you can lead a horse to water, but you can’t make him drink. Just because the features are available, that doesn’t mean developers are using them.

Why Use DRI?

Not all developers are convinced that DRI is a good idea in the first place. It places limits on how you can work with the data in your tables. Some developers would rather code the rules into the data layer of the application because it gives them the perception of more control and avoids the hassle of dealing with errors raised from the database system.

My feeling is that data integrity rules should be coded at the lowest level possible so all applications that access the data are consistently subject to the same rules. If you can’t do what you need in the database, then at least make sure every application accesses the database through a single data interface. That interface may consist of a series of stored procedures or it may be a web service, but whatever it is, all applications are required to use it. But I digress.

In a database application that uses SQL Server, the lowest level is the database definition. SQL Server offers relational integrity features that you code directly into your table definitions (thus the "declarative" aspect).

So what does DRI really buy you? I think the best way to answer that question is to point out some of the problems DRI solves for you.

Consistency. If you declare the rules in the database itself, all code that access the database is subject to the same rules. Letting two different developers write code that implements the rules means you will end up with two different sets of integrity rules.
Integrity. Coding the integrity rules into the application is begging for trouble. If you fail to enforce the rules just one time anywhere in your application (and I guarantee someone will), you can end up with "orphaned" rows which inevitably lead to bugs that are very difficult to troubleshoot. You’ll run circles in the application code until you realize the problem is with the data, not the code.
Simplicity. Developers who think it is easy to enforce relational integrity in the application almost always forget about nested relationships. It is easy to simulate a cascade delete from a parent table to a single child table, but much more difficult when you introduce grandchildren in multiple tables. I say, let the database worry about that.

Implementing Relationships

In general, DRI describes the relationship between the primary key of a parent table and a foreign key in a child table. However, you have several options for how you enforce that relationship. How you declare the referential integrity rules in your database naturally depends upon the business rules you want to apply.

For example, suppose your application has a Product table and a Category table. Each product can belong to one category, so you put a foreign key column (CategoryId) into the Product table.

Now you have some questions to ask yourself about that relationship:

Is the relationship required? In other words, is it okay for a product to not be assigned to a category, or must all products be assigned to a category?
What should happen if you delete a category? Do you want to delete all of the products in that category as well? Should you set the CategoryId in the related products to null? Or is it an error to delete a category that has products assigned to it?
If it is possible to change the key of the category, do you want to update the products to use the new key value?

These questions help you decide which relational integrity features you need to apply to your tables.

Optional vs. Required Relationships

I’m going to take on the concept of optional versus required relationships first because it is something you need to think about at the time you create the foreign key column in the child table.

If the relationship between the parent and child is required, that means the foreign key column in a child row must always include the primary key value from an existing parent row. If the relationship is optional, then the child row’s foreign key may be set to NULL (indicating "no relationship").

Going back to our product example, assume that the relationship between the Product table and the Category table is required. Your column definition might look something like this:

[CategoryId] INT NOT NULL

If the relationship is optional, you’d write this instead:

[CategoryId] INT NULL

For the record, let me just say that optional relationships are going to give you a pain. I personally avoid them whenever I can, but that isn’t always possible. Optional relationships are a drag because they often force you to use outer joins in your queries (an inner join will filter out the rows with a null key) and your application has to deal with the possibility of null values in the outer-joined columns when processing retrieved data.

Declaring a Relationship

Once your foreign key is defined, you can declare the relationship between the parent and child tables. For this discussion, it will help to have a couple of sample table definitions for the Product and Category tables:

CREATE TABLE [Category] (
   [CategoryId] INT NOT NULL
      IDENTITY(1,1)
      CONSTRAINT [PK_Category] PRIMARY KEY,
   [Title] VARCHAR (100) NULL
)
 
CREATE TABLE [Product] (
   [ProductId] INT NOT NULL
      IDENTITY(1,1)
      CONSTRAINT [PK_Product] PRIMARY KEY,
   [Title] VARCHAR (100) NULL,
   [CategoryId] INT NOT NULL
)

At this point, you suspect there’s a relationship between the two tables because the CategoryId column name matches between the two tables. You also know that the suspected relationship is required because I declared the column in the Product table as NOT NULL.

However, nothing stops you from putting whatever value you want into the Product’s CategoryId column. You know it can’t be null, but it also doesn’t have to be the identifier for an existing Category row.

Another bad thing that can happen is that you can delete a Category row, and nothing will update the related Product rows you may have. The Product rows will be "orphaned" with an invalid Category reference. You basically just killed the integrity of your data, and it will result in some of the most insidious bugs in your application’s behavior.

To fix that problem and ensure that all products belong to an existing category, you can create a relationship between the tables using SQL Server’s FOREIGN KEY constraint:

ALTER TABLE [Product]
   ADD CONSTRAINT [FK_Product_Category]
   FOREIGN KEY ([CategoryId])
   REFERENCES [Category] ([CategoryId])

By adding this foreign key constraint, you are telling SQL Server that all rows in the Product table require a corresponding row in the Category table. If you try to insert a product row that includes a non-existing key value in the CategoryId column, SQL Server will raise an error back to the application.

I’d like to point out that you declare the relationship as shown above whether you declared the foreign key column as nullable or not. Allowing NULL just means SQL Server allows you to insert either a NULL value or a valid value into the column. Anything else still raises an error.

Cascading Actions

In the relational database world, cascade doesn’t refer to a dishwashing soap or a mountain range in the Pacific Northwest. It refers to what happens when a parent row involved in a relationship is deleted or updated.

In virtually every database application I’ve ever designed, you realistically have three main choices for what happens when you delete a parent row:

Don’t allow the application to delete a parent that has children.
When the application deletes a parent, null out the parent reference in the related children.
When the application deletes a parent, delete its children as well.

The first option is the default behavior if you declare a relationship as I did between the Product and Category tables. It is also the most likely choice whenever you define a required one-to-many relationship between two tables. The reason you usually go this route is that, in a one-to-many relationship, the child table generally has information you don’t want to lose. If the relationship is optional, the second option is a good choice for the same reason.

For example, if I delete a category, does that mean I want to delete all of the products in that category? Perhaps, but probably not. It is more likely that I’ll want to move those products to another category first so I don’t completely lose the product data.

However, the situation is entirely different if the relationship between the two tables is many-to-many. Just for the sake of illustration, suppose that my business rule has changed to say that each product can belong to multiple categories.

In a relational database, there’s really no such thing as a many-to-many relationship. Yep, you read that right. The way you establish a many-to-many relationship is by creating an intersection (a.k.a. "relationship") table. That intersection table establishes two separate one-to-many relationships between the two tables involved in your so-called "many-to-many" relationship.

To implement the new business rule between product categories and products, I need to change the way the tables are defined. There’s no change to the Category table, but I do need to change the Product table and create an intersection table to express the many-to-many relationship:

CREATE TABLE [Product] (
   [ProductId] INT NOT NULL
      IDENTITY(1,1)
      CONSTRAINT [PK_Product] PRIMARY KEY,
   [Title] VARCHAR (100) NULL
)
 
CREATE TABLE [CategoryProduct] (
   [CategoryId] INT NOT NULL,
   [ProductId] INT NOT NULL
)
 
ALTER TABLE [CategoryProduct]
   ADD CONSTRAINT [PK_CategoryProduct]
   PRIMARY KEY ([CategoryId], [ProductId])
 
ALTER TABLE [CategoryProduct]
   ADD CONSTRAINT [FK_CategoryProduct_Category]
   FOREIGN KEY ([CategoryId])
   REFERENCES [Category] ([CategoryId])
 
ALTER TABLE [CategoryProduct]
   ADD CONSTRAINT [FK_CategoryProduct_Product]
   FOREIGN KEY ([ProductId])
   REFERENCES [Product] ([ProductId])

I removed the CategoryId foreign key from the Product table and created a new intersection table named CategoryProduct, which consists of nothing but the primary keys from the related tables. Those two keys together form the primary key of the intersection table, and then the many-to-many relationship is expressed by the two foreign key references to the related tables.

The whole point of this exercise is to show you how your thinking changes when you reconsider what you want to do when you delete a category. Now the relationship you are considering is between the Category "parent" table and the CategoryProduct "child" table.

When you delete a Category, it is probably fine to delete all of the related rows in CategoryProduct because all those rows do is establish the relationship: you aren’t going to lose any product data as a result of the operation.

At the same time, you’ll want to delete the related CategoryProduct rows if you delete a row from the Product table. Once again, you aren’t really deleting any "entity" data, you are just deleting the relationship.

There’s one other interesting thing you might notice about a many-to-many relationship: from the database’s perspective, they are always optional. Even though the foreign key columns in CategoryProduct are declared NOT NULL, you can create products and product categories independently of one another, and nothing says you have to create a row in the CategoryProduct table that ties them together. If a relationship is required, you’ll have to implement that rule in the application.

So now the question becomes how do you set up the cascade delete of children in the CategoryProduct table? The answer is you add a CASCADE rule to the foreign key constraint, like this:

ALTER TABLE [CategoryProduct]
   ADD CONSTRAINT [FK_CategoryProduct_Category]
   FOREIGN KEY ([CategoryId])
   REFERENCES [Category] ([CategoryId])
   ON DELETE CASCADE
 
ALTER TABLE [CategoryProduct]
   ADD CONSTRAINT [FK_CategoryProduct_Product]
   FOREIGN KEY ([ProductId])
   REFERENCES [Product] ([ProductId])
   ON DELETE CASCADE

The ON DELETE CASCADE clause tells SQL Server to delete all of the child rows when the parent row is deleted. Your application will no longer get an error when you delete the parent, SQL Server will just silently wipe out the children.

The more complex your database architecture, the more you will appreciate cascading deletes. The feature isn’t terribly impressive when you have only a couple of tables involved, but when several tables are involved, and when you have deletes that cascade down multiple levels in your data hierarchy, this feature can save you a lot of complicated application code.

More Alternatives

If you look up the ON DELETE and ON UPDATE clauses in the SQL Server documentation, you’ll find you can do other things with your relationships. For example, you can cascade update the foreign key references in the child table when the primary key of the parent is updated, and you can null out the foreign key references in the children when the parent is deleted.

I haven’t seen very many database applications that allow you to update a primary key value, but if you had reason to do so, ON UPDATE CASCADE would be a useful option. However, I often use an IDENTITY field for the primary key, and SQL Server doesn’t allow you to change the value it generates (nor have I ever had reason to change a primary key value in general).

If you create optional relationships, you might have a use for ON DELETE SET NULL. When you delete the parent row using this option, SQL Server sets the foreign key references in the related children to NULL.

Use DRI for Integrity and Consistency

The next time you are designing a new application database or adding new tables to an existing database, take a few moments to think about how you can leverage the DRI features of SQL Server to make your development life a little easier.

Using foreign key constraints helps maintain the integrity of your data and it contributes to more consistent application behavior.

Musings on writing, technology, and rural life