SQL Database Version Control – apgdiff

In “Database Development: Table Structure Changes,” Kenneth Downs argued that build scripts to modify table structures “are usually the least tested part of the system.” He suggested using a single metadata file that describes the database and can be placed under version control, and a difference engine as the tool of choice to perform database upgrades.

Another PostgreSQL Diff Tool (aka apgdiff) is a utility that implements such an engine.  For its metadata, apgdiff takes two pg_dump -s output files, representing previous and new database schemas, and generates ALTER TABLE and other SQL statements needed to upgrade from previous to new.

In this installment of our series on using VCS tools while developing a SQL-based application, we’ll see how Carol and Dave can use apgdiff to control the changes described in Version Control, Part 2: SQL Databases.

Version 0.1

To start off, Carol creates the film table in her database, caroldev, using the CREATE TABLE script, version 0.1, which we saw in the previous article. She also creates a general development database, moviesdev, and issues the following commands:

$ pg_dump -s moviesdev > film0.sql
$ pg_dump -s caroldev > film.sql

The film0.sql file is essentially empty, consisting only of some SET, REVOKE and GRANT statements. The film.sql file includes statements to CREATE and ALTER the film table, as can be seen in the following diff:

$ diff -u film0.sql film.sql
--- film0.sql
+++ film.sql
@@ -9,6 +9,34 @@
 SET client_min_messages = warning;
 SET escape_string_warning = off;

+SET search_path = public, pg_catalog;
+
+SET default_tablespace = '';
+
+SET default_with_oids = false;
+
+--
+-- Name: film; Type: TABLE; Schema: public; Owner: carol; Tablespace:
+--
+
+CREATE TABLE film (
+    id integer NOT NULL,
+    title character varying(32) NOT NULL,
+    release_year integer NOT NULL,
+    CONSTRAINT film_release_year_check CHECK ((release_year >= 1888))
+);
+
+
+ALTER TABLE public.film OWNER TO carol;
+
+--
+-- Name: film_pkey; Type: CONSTRAINT; Schema: public; Owner: carol; Tablespace:
+--
+
+ALTER TABLE ONLY film
+    ADD CONSTRAINT film_pkey PRIMARY KEY (id);
+
+
 --
 -- Name: public; Type: ACL; Schema: -; Owner: postgres
 --

Then Carol can use apgdiff to generate the SQL statements needed to create the table in the development and production databases, e.g.,

$ apgdiff film0.sql film.sql
CREATE TABLE film (
 id integer NOT NULL,
 title character varying(32) NOT NULL,
 release_year integer NOT NULL
);

ALTER TABLE film
 ADD CONSTRAINT film_pkey PRIMARY KEY (id);

ALTER TABLE film
 ADD CONSTRAINT film_release_year_check CHECK ((release_year >= 1888));

In this initial phase, the diff and apgdiff outputs are not that … different. The value of apgdiff will be more obvious in subsequent versions.

As part of the 0.1 release process, Carol commits film.sql to the VCS (she could also commit film0.sql as the film.sql base version, if desired). She may also want to store the apgdiff output in the VCS, since the latter is what will actually be used to upgrade the production database(s).

Version 0.2

For the next version, Dave clones the VCS repository and uses the film.sql script or the saved apgdiff output to create the film table in his own database. He then issues an ALTER TABLE to add the length column, invokes pg_dump -s to create a new film.sql and commits that to his repository. Carol uses another ALTER TABLE to add the rating column in her database, and stores the new pg_dump output to her repository.

Manny, the integration manager, merges the new film.sql versions into the project-wide repository (he has to deal with table ownership statements but perhaps the pg_dump outputs can be stripped of such statements prior to committing them to the VCS). Manny then uses apgdiff to produce the statements needed to upgrade to version 0.2:

$ git diff 0.1 0.2
diff --git a/film.sql b/film.sql
index 799b418..db820bd 100644
--- a/film.sql
+++ b/film.sql
@@ -23,6 +23,9 @@ CREATE TABLE film (
 id integer NOT NULL,
 title character varying(32) NOT NULL,
 release_year integer NOT NULL,
+    length smallint,
+    rating character(5),
+    CONSTRAINT film_length_check CHECK (((length > 0) AND (length < 10000))),
 CONSTRAINT film_release_year_check CHECK ((release_year >= 1888))
 );

$ git checkout 0.1
$ cp film.sql film.sql-0.1
$ git checkout 0.2
$ apgdiff film.sql-0.1 film.sql

ALTER TABLE film
 ADD COLUMN length smallint,
 ADD COLUMN rating character(5);

ALTER TABLE film
 ADD CONSTRAINT film_length_check CHECK (((length > 0) AND (length < 10000)));

Version 0.3

For the next round, Carol and Dave issue the statements shown under Version 0.3 in the previous article and commit newer versions of film.sql which Manny then integrates into the project repository. The apgdiff output comparing the two versions is as follows:

$ apgdiff film.sql-0.2 film.sql

CREATE SEQUENCE category_category_id_seq
 START WITH 1
 INCREMENT BY 1
 NO MAXVALUE
 NO MINVALUE
 CACHE 1;

CREATE TABLE category (
 category_id integer DEFAULT nextval('category_category_id_seq'::regclass) NOT NULL,
 name character varying(25) NOT NULL,
 last_update timestamp with time zone NOT NULL
);

CREATE TABLE film_category (
 film_id integer NOT NULL,
 category_id integer NOT NULL,
 last_update timestamp with time zone NOT NULL
);

CREATE TABLE "language" (
 language_id integer NOT NULL,
 name character varying(20) NOT NULL,
 last_update timestamp with time zone NOT NULL
);

ALTER TABLE film
 ADD COLUMN language_id integer NOT NULL;

ALTER TABLE category
 ADD CONSTRAINT category_pkey PRIMARY KEY (category_id);

ALTER TABLE film_category
 ADD CONSTRAINT film_category_pkey PRIMARY KEY (film_id, category_id);

ALTER TABLE "language"
 ADD CONSTRAINT language_pkey PRIMARY KEY (language_id);

ALTER TABLE film
 ADD CONSTRAINT film_language_id_fkey FOREIGN KEY (language_id) REFERENCES language(language_id);

ALTER TABLE film_category
 ADD CONSTRAINT film_category_category_id_fkey FOREIGN KEY (category_id) REFERENCES category(category_id);

ALTER TABLE film_category
 ADD CONSTRAINT film_category_film_id_fkey FOREIGN KEY (film_id) REFERENCES film(id);

You may notice a small discrepancy when running the above (tested with PostgreSQL 8.4.5 and apgdiff 2.3) to upgrade a second database and then compare the output from the pg_dump -s commands. In the original database, the category_category_id_seq sequence is owned by category.category_id, but this is missing from the upgraded database.

Review

Apgdiff appears to be a useful tool to manage changes to PostgreSQL databases in conjunction with a VCS. It does require that you run pg_dump -s and that you store its output in the VCS. However, apgdiff could be used against a set of manually-maintained SQL scripts.

Apgdiff requires Java 1.6 and Debian, Ubuntu and Gentoo have native apgdiff packages. It supports several (but not all) PostgreSQL features. For example, it omits the TYPEs, DOMAINs and AGGREGATEs used in the Pagila sample database (it was able to recreate 14 out of 21 Pagila tables). It depends on parsing SQL DDL syntax —albeit not fully, so it is susceptible to syntax changes between PostgreSQL releases. It cannot deal with RENAMEs, e.g., if the rating column is renamed to mpaa_rating, apgdiff will drop the first and add the second—not ideal in a populated database.

Apgdiff does not operate on the actual database being upgraded but merely on textual representations. On the plus side, it can be used to generate backout scripts, e.g., by diffing film.sql version 0.3 vs. 0.2 you can generate statements to undo the changes from 0.2 to 0.3. The next tool we’ll review is particularly concerned with this aspect of the upgrade process.

Version Control, Part 2: SQL Databases

In Part 1 we saw how Alice and Bob used a VCS to manage changes to a file of movie information.  In this post, we’ll examine how Carol and Dave, database specialists, deal with similar changes while developing the same application against a SQL-based DBMS.

Version 0.1

When using a relational DBMS, the database designer, administrator (DBA) or programmer defines the data structures, i.e., the tables and columns, using SQL data definition language (DDL), such as this:

CREATE TABLE film (
    id INTEGER NOT NULL PRIMARY KEY,
    title VARCHAR(32) NOT NULL,
    release_year INTEGER NOT NULL CHECK (release_year >= 1888)
);

By adding PRIMARY KEY on the id line, the responsibility for ensuring unique identifiers no longer rests with the application but with the DBMS (but the application needs to be prepared to deal with duplicate ID errors).

Similarly, the validation of release_year is now performed by the DBMS as specified by the CHECK constraint. The application may still implement validations, but the DBMS ensures data integrity even when records are updated using tools external to the primary application.

The DDL can be added to a VCS, say, as a script named film.sql. The application may still use a file like the film.h seen previously to represent the collection of data retrieved from the DBMS or input by a user.  For example, Alice can surround the typedef struct film by EXEC SQL BEGIN|END DECLARE SECTION statements to use it in an ECPG program.

Version 0.2

For the second version, Dave wants to add the movie length column, just as Bob did earlier. In his test environment, he can simply edit film.sql and add length SMALLINT at the end. Similarly, Carol can add rating CHAR(5) to her copy of film.sql and recreate the film table to test her changes. They could even merge their changes using the VCS and store a version 0.2 as follows:

CREATE TABLE film (
    id INTEGER NOT NULL PRIMARY KEY,
    title VARCHAR(32) NOT NULL,
    release_year INTEGER NOT NULL CHECK (release_year >= 1888),
    length SMALLINT CHECK (length > 0 AND length < 10000),
    rating CHAR(5)
);

This approach may be OK for testing (and new customers), but is not satisfactory for production environments with numerous records already in the database. For the flat-file scenario, there was no choice but to create a migration program to reformat or reload the file into the new structure. For a SQL database, you can use the following DDL statements:

ALTER TABLE film ADD COLUMN length SMALLINT
    CHECK (length > 0 AND length < 10000);
ALTER TABLE film ADD COLUMN rating CHAR(5);

Unfortunately, while the ALTER statement allows easy modification of the DBMS storage structures—even in tables with millions of rows, it is not orthogonal with respect to version control. Carol and Dave can commit the CREATE TABLE film.sql version 0.2 to the VCS for use with new installations, but they also have to put the ALTER TABLE statements under version control, probably as part of a script to upgrade existing databases.

The ideal solution may be a CREATE OR REPLACE TABLE statement, but that does not exist.

Version 0.3

For the next release, Dave adds the following:

CREATE TABLE language (
    language_id INTEGER NOT NULL PRIMARY KEY,
    name VARCHAR(20) NOT NULL,
    last_update TIMESTAMP WITH TIME ZONE NOT NULL
);
ALTER TABLE film ADD COLUMN language_id INTEGER NOT NULL;
ALTER TABLE film ADD FOREIGN KEY (language_id)
    REFERENCES language (language_id);

And Carol adds this:

CREATE TABLE category (
    category_id SERIAL NOT NULL PRIMARY KEY,
    name VARCHAR(25) NOT NULL,
    last_update TIMESTAMP WITH TIME ZONE NOT NULL
);
CREATE TABLE film_category (
    film_id INTEGER NOT NULL REFERENCES film (id),
    category_id INTEGER NOT NULL
        REFERENCES category (category_id),
    last_update TIMESTAMP WITH TIME ZONE NOT NULL,
    PRIMARY KEY (film_id, category_id)
);

We again have to deal with storing in the VCS both the CREATE and ALTER statements in order to accomodate new as well as existing customers at different release levels.  However, these changes also exemplify two other issues. The first is that some default DBMS actions are not visible in the DDL.  For example, PostgreSQL will give default names to PRIMARY and FOREIGN KEYS, e.g., film_language_id_fkey. It will also create a SEQUENCE in response to the SERIAL data type specification and add a column DEFAULT based on that sequence.  It may be useful to record these names and other objects in the VCS.

More importantly, there are now dependencies between the various tables. Previously, issuing a certain ALTER TABLE ADD COLUMN statement before another did not (or should not) have much impact on the database or application. Now, film_category can only be created after category and film exist, and language must be created before film (or before the FOREIGN KEY is added to it).

How can projects using PostgreSQL or other RDBMSs deal with the complexities of version control over the database? We’ll examine some solutions in forthcoming posts.

Version Control, Part I: Pre-SQL

A version control system (VCS) such as Git or Subversion is an indispensable component of modern software projects. Many applications use a relational database management system (DBMS) such as PostgreSQL. While a DBMS facilitates an application’s data management tasks, it presents some version control challenges.

To understand the version control issues we’ll examine a project with two programmers, Alice and Bob, developing an application loosely based on the Pagila sample database. We’ll start our tour with an application predating the use of SQL.

When developing a file-based application in C, C++ or other programming languages, you typically define the data structures in your source code, e.g., in header files that are included or otherwise referenced by the application code.

Version 0.1

For example, a C application to maintain information about movies may start off by defining the following in a file named film.h:

typedef struct
{
    int id;
    char title[32];
    int release_year;
} FILM;

The typedef may represent the actual layout of the file records as well as the application buffers used to hold the data input by users or to be displayed on screen.  In addition, the application will need code to ensure the uniqueness of the film identifiers, e.g., to maintain a file index, and to validate the release year upon input, e.g., release_year >= 1888.

The film.h file and the rest of the application are submitted to a VCS and are released as version 0.1.

Version 0.2

For the next version, Bob adds short length; to the end of the FILM structure.  He also adds help text to tell users the new field is to capture the length of the movie in minutes and code to validate that length > 0 and length < 10000.

Separately, Alice adds char rating[6]; to the end of the struct, application help text explaining this is for the MPAA rating, and code to validate the input against a list of known values.

Finally, Bob and Alice submit their changes to the VCS and either he or she (or an integration manager) merge the changes and settle on the following structure:

typedef struct
{
    int id;
    char title[32];
    int release_year;
    short length;
    char rating[6];
} FILM;

The VCS allows Bob and Alice to keep track of changes to the FILM structure, e.g., to diff the tagged revisions of film.h:

$ git diff 0.1 0.2
diff --git a/film.h b/film.h
index 8670425..1df5aaa 100644
--- a/film.h
+++ b/film.h
@@ -5,4 +5,6 @@ typedef struct
     int id;
     char title[32];
     int release_year;
+    short length;
+    char rating[6];
 } FILM;

To allow users to migrate existing applications, a conversion program is also needed to reformat the data file.

Version 0.3

For the third revision, Alice adds a genre array to FILM, and a CATEGORY structure to allow users to maintain the allowed values for genres.  In the meantime, Bob adds a language field to the FILM structure, together with an application-maintained array of language codes and names.  Once merged and released, the structures look as follows (with comments):

typedef struct
{
    int id;            /* genre/category ID */
    char name[26];     /* genre name */
} CATEGORY;

typedef struct
{
    int id;            /* film ID */
    char title[32];    /* film title */
    int release_year;  /* release year */
    short length;      /* length in minutes */
    char rating[6];    /* MPAA rating */
    int categories[5]; /* film genres */
    int language;      /* spoken language code */
} FILM;

The conversion program has to be updated to migrate to the expanded structure, but the VCS still supports comparison of different revisions.  You can see it also allows each programmer to work independently on unrelated features and a distributed VCS such as Git can keep track of the merge actions, as seen below—version 0.2 was merged by the project manager whereas for version 0.3 Bob pulled Alice’s change before committing:

$ git log --pretty=format:"%h %s" --graph
* eb0a385 Add language.
* e6568c5 Add film genres.
*   f404fe2 Merge Alice's add rating.
|\ 
| * bdc8247 Add MPAA rating.
* | 5772f38 Add length.
|/ 
* 76ecd04 First version

With this preliminary flat-file scenario established, we’re ready to move to our next installment where we will contrast the above to the situation using a SQL DBMS.

What is Pyrseas?

“Taming Serpents and Pachyderms” will explore the thinking behind the Pyrseas project, a free and open source software project to develop a framework and utilities to create, upgrade and maintain a PostgreSQL database.

The goal of the project is to follow through and enhance on the concepts of the Andromeda project. The name Pyrseas comes from Python, the programming language, and Perséas1, the Greek mythological hero who rescued Andromeda from a sea monster2.

Some may ask why are new utilities or a framework needed: Aren’t SQL CREATE and ALTER TABLE statements and tools such as pgAdmin or phpPgAdmin sufficient? I hope to provide a complete answer through these series of posts. However, for now consider that while the SQL statements and tools may suffice for the simpler use cases, they pose challenges when dealing with multiple developers, version control, multiple table updates, and denormalized databases.

We’ll start the exploration with the issues posed by agile software development and version control on application projects relying on a DBMS to store and manage their data.  Then we’ll address the problems of automating basic CRUD database operations so that they can be performed by naive users, e.g., without exposing internal identifiers.


1. The common English name for Perséas is Perseus and in Ancient Greek it’s Perseos. However, in modern Greek Περσέας is the more common spelling for the mythical hero.

2. Perséas is perhaps better known for having killed Medusa.

Musings on Python, Postgres and other species

%d bloggers like this: