Version Control, Part 2: SQL Databases

In Part 1 we saw how Alice and Bob used a VCS to manage changes to a file of movie information.  In this post, we’ll examine how Carol and Dave, database specialists, deal with similar changes while developing the same application against a SQL-based DBMS.

Version 0.1

When using a relational DBMS, the database designer, administrator (DBA) or programmer defines the data structures, i.e., the tables and columns, using SQL data definition language (DDL), such as this:

CREATE TABLE film (
    id INTEGER NOT NULL PRIMARY KEY,
    title VARCHAR(32) NOT NULL,
    release_year INTEGER NOT NULL CHECK (release_year >= 1888)
);

By adding PRIMARY KEY on the id line, the responsibility for ensuring unique identifiers no longer rests with the application but with the DBMS (but the application needs to be prepared to deal with duplicate ID errors).

Similarly, the validation of release_year is now performed by the DBMS as specified by the CHECK constraint. The application may still implement validations, but the DBMS ensures data integrity even when records are updated using tools external to the primary application.

The DDL can be added to a VCS, say, as a script named film.sql. The application may still use a file like the film.h seen previously to represent the collection of data retrieved from the DBMS or input by a user.  For example, Alice can surround the typedef struct film by EXEC SQL BEGIN|END DECLARE SECTION statements to use it in an ECPG program.

Version 0.2

For the second version, Dave wants to add the movie length column, just as Bob did earlier. In his test environment, he can simply edit film.sql and add length SMALLINT at the end. Similarly, Carol can add rating CHAR(5) to her copy of film.sql and recreate the film table to test her changes. They could even merge their changes using the VCS and store a version 0.2 as follows:

CREATE TABLE film (
    id INTEGER NOT NULL PRIMARY KEY,
    title VARCHAR(32) NOT NULL,
    release_year INTEGER NOT NULL CHECK (release_year >= 1888),
    length SMALLINT CHECK (length > 0 AND length < 10000),
    rating CHAR(5)
);

This approach may be OK for testing (and new customers), but is not satisfactory for production environments with numerous records already in the database. For the flat-file scenario, there was no choice but to create a migration program to reformat or reload the file into the new structure. For a SQL database, you can use the following DDL statements:

ALTER TABLE film ADD COLUMN length SMALLINT
    CHECK (length > 0 AND length < 10000);
ALTER TABLE film ADD COLUMN rating CHAR(5);

Unfortunately, while the ALTER statement allows easy modification of the DBMS storage structures—even in tables with millions of rows, it is not orthogonal with respect to version control. Carol and Dave can commit the CREATE TABLE film.sql version 0.2 to the VCS for use with new installations, but they also have to put the ALTER TABLE statements under version control, probably as part of a script to upgrade existing databases.

The ideal solution may be a CREATE OR REPLACE TABLE statement, but that does not exist.

Version 0.3

For the next release, Dave adds the following:

CREATE TABLE language (
    language_id INTEGER NOT NULL PRIMARY KEY,
    name VARCHAR(20) NOT NULL,
    last_update TIMESTAMP WITH TIME ZONE NOT NULL
);
ALTER TABLE film ADD COLUMN language_id INTEGER NOT NULL;
ALTER TABLE film ADD FOREIGN KEY (language_id)
    REFERENCES language (language_id);

And Carol adds this:

CREATE TABLE category (
    category_id SERIAL NOT NULL PRIMARY KEY,
    name VARCHAR(25) NOT NULL,
    last_update TIMESTAMP WITH TIME ZONE NOT NULL
);
CREATE TABLE film_category (
    film_id INTEGER NOT NULL REFERENCES film (id),
    category_id INTEGER NOT NULL
        REFERENCES category (category_id),
    last_update TIMESTAMP WITH TIME ZONE NOT NULL,
    PRIMARY KEY (film_id, category_id)
);

We again have to deal with storing in the VCS both the CREATE and ALTER statements in order to accomodate new as well as existing customers at different release levels.  However, these changes also exemplify two other issues. The first is that some default DBMS actions are not visible in the DDL.  For example, PostgreSQL will give default names to PRIMARY and FOREIGN KEYS, e.g., film_language_id_fkey. It will also create a SEQUENCE in response to the SERIAL data type specification and add a column DEFAULT based on that sequence.  It may be useful to record these names and other objects in the VCS.

More importantly, there are now dependencies between the various tables. Previously, issuing a certain ALTER TABLE ADD COLUMN statement before another did not (or should not) have much impact on the database or application. Now, film_category can only be created after category and film exist, and language must be created before film (or before the FOREIGN KEY is added to it).

How can projects using PostgreSQL or other RDBMSs deal with the complexities of version control over the database? We’ll examine some solutions in forthcoming posts.

About these ads

16 thoughts on “Version Control, Part 2: SQL Databases”

  1. Thanks for tackling these issues! I’m looking forward to seeing your ideas on how to handle upgrades without making new instance creation horribly painful.

    Can I ask you one favor, though, since this is a tutorial? Can you give your tables natural keys as well as surrogate keys? We should teach people good DB habits. ;-)

    1. I understand your concern. When I tried using these tables with Django, to explore another area of Pyrseas design, I found I couldn’t specify the film_id, category_id pair as the primary key and had to accept Django’s “id” surrogate for the primary.

      I would have preferred to use ISO 639-1 codes (not entirely “natural” but at least an accepted standard) instead of arbitrary integer codes for language_id, but I didn’t want to modify the Pagila design too much. The category table could probably use the category name itself, perhaps all caps and with some length restriction, as the natural primary key.

      However, what would you use as the natural key for film: the title and the release year? What do you do when you find there were two movies titled “Hero” in 1983, 1992, 2009 and 2010 (very popular title!)? You could add an integer to disambiguate, as IMDb does, and create a UNIQUE constraint over title/year/instance, but in this case I believe you have to use the surrogate as the primary.

  2. You may want to have a look at a solution we have been developing, called ChronicDB, for managing schema version changes.

    The idea is to both capture snapshots of the schema definitions so that moving forward or backward between any schema version is possible, but additionally with the capability of introducing no downtime.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s