Tag Archives: git

Testing Python and PostgreSQL on Windows – Basics

In my previous post, I wrote:

Although I have not yet personally run the [Pyrseas] unit tests on Windows …, I believe the tox setup should be quite portable …, since the tests only depend on Python and psycopg2 being able to connect to Postgres, i.e., they do not depend on running any PG utilities from the command line.

Several moons ago, I had done a cursory test of the Pyrseas utilities on Windows from a source zip file, but now I wanted to set up a full development environment (well, almost full—I used Notepad for minor editing) and run through all the unit tests on as many Python/PostgreSQL combinations as possible and ideally using tox.

This post describes what I found out during the install/test process. Hopefully others will find it useful.

Operating System

I chose to use Windows XP Home Edition running under VirtualBox. It’s not a professional solution, but I wasn’t prepared to pay for the “privilege” of using Windows and it’s likely others also have a home edition CD or similar media from an earlier hardware purchase.

Version Control

Pyrseas sources are stored on GitHub. Installing Git and cloning the repository was probably the most uneventful step. The Git download page gives you an installer which offers three options. I chose “Use Git Bash only” as this appears to be the most friendly to someone coming from a Linux/Unix environment. It doesn’t change nor does it require you to change the PATH, all you need to do is select “Git Bash” from the Start menu and a Bash shell is opened for you.

DBMS

Installing PostgreSQL was fairly straightforward. The Windows download page leads to EnterpriseDB one-click installers for multiple platforms and for the more recent versions you have to choose between 32-bit and 64-bit systems. The installer asks for an installation directory, data directory, postgres user password, port number and locale, offering defaults except for the password.

The installer installs both the DBMS and pgAdmin III. If you’re more comfortable with psql, you can select “SQL Shell (psql)” from the Start menu. With either the latter or pgAdmin, you won’t have to change PATH, unless you want to run psql or some other PostgreSQL utility from a Command Prompt window.

Python

Installing Python can be done from Windows MSI installers available from Python.org for the latest releases of Python 2.7 and 3.2. Aside from specifying the installation directory, you’re given the choice of additional components to install, e.g., Tcl/Tk, documentation.

The installers provide a “Python (command line)” option from the Start Menu, but for testing or development, you’ll probably want to open your own Command Prompt window, in order to customize your setup. This requires that you add, e.g., C:\Python27 and C:\Python27\Scripts, to your PATH. Alternatively, you could use the Git Bash window to stay within a Unix-like environment (in which case you’ll still have to add the equivalent directories, e.g., /c/Python27, to PATH).

So far so good. A forthcoming post will cover more, shall we say, entertaining topics.

Dueling Frameworks

Cue the music

In this corner, the time-tested CherryPy, standing at version 3.2, six years old, enabled for Python 3, sporting an HTTP 1.1-compliant WSGI webserver, support for other WSGI servers or adapters, a plugin mechanism, built-in tools and much more.

In the other corner, the newcomer Flask, at version 0.7.2, about one year old (but with older ancestry), claiming to be a “microframework … with good intentions” and “fun.”

In my post A Funny Thing Happened on the Way to the Webserver I mentioned I took a look at the second contender and liked what I saw. First impressions, as they say, can be deceiving. What may seem appealing is sometimes harder to use or disappointing when you get down to writing code and testing it.

Coincidentally, after I had decided on the “Dueling Banjos” theme, Audrey Roy pointed me1 to Richard Jones’ “Web micro-framework Battle!” which provided additional food for thought.

As you may expect, my selection criteria don’t exactly match Richard’s. Perhaps the most striking difference is how he implemented routing (URL mapping, page 14 of the presentation) for the baseline cgi+wsgiref implementation vs. my choice for doing dispatching in my minimalist WSGI database UI (see in particular, the dispatch methods in dbapp.py and film.py).

Like Richard, I don’t appreciate “magic” in a framework or application. but I’m less concerned with strictly RESTful URLs. Richard appears to prefer the decorator approach to URL mapping (as seen by his Bottle, Flask and Pesto examples, among others), but also used URL mapping tables (seen in the web.py, Werkzeug and baseline examples).

The latter approach reminds me of a generic menu interface I wrote eons ago in C. It may be OK for a vtable in C++, but I usually find it inappropriate for application level code. In fact, the mapping table seems “magical” since it doesn’t encapsulate the code with the corresponding URL.

Although Flask looked attractive due to its URL routing decorator, it now appears limiting because it can only be used on functions but not on class methods. (Note: I haven’t explored all of Flask, e.g., haven’t looked at its Blueprints).

As a result, refactoring the WSGI application to use CherryPy was quite straightforward, whereas changing it to accomodate Flask involved quite a bit of surgery. This is exemplified by these Git stats:

 cherrypy/film.py |   73 ++++++++++++++++-------------------------------------
 1 files changed, 22 insertions(+), 51 deletions(-)

 flask/film.py |  235 ++++++++++++++++++++++++++-------------------------------
 1 files changed, 107 insertions(+), 128 deletions(-)

While many modifications simply involved indentation and removing the self parameter, other changes were necessary. For example, as there can only be one mapping to a function named ‘index’, I had to rename the film.py ‘index’ function/method to ‘list’ to avoid conflict with dbapp.py. This does not bode well for extensibility: a real application would support multiple entities—films, actors, customers, etc.—with similar function/methods. Maybe this is what Blueprints are for, but why invent another wheel when Python classes are available?

There were other minor annoyances with Flask, like how to instantiate its application object without adding a subdirectory, or how to pass the database connection to the film.py functions without using a global variable.

I may continue to explore Flask, although at the moment CherryPy seems more suitable to my purposes. I’m also intrigued by some concepts of Richard’s winners—Pesto and Bottle. A Pesto-like dispatcher with support for class methods on top of CherryPy would probably make my day.

The code implementing the CherryPy database UI interface is on GitHub tagged as version v0.3.0 and the corresponding Flask code is also there tagged as v0.3.1.


1 BTW, +1 for Audrey’s wish list item: Python in the Browser (see pages 16-18 of her presentation).

A Minimalist Command Line Database User Interface

To begin exploring the design of a generic database user interface as mentioned in my previous post, I’ve written a minimalist command line program. To try it, do this:

$ git clone git://github.com/jmafc/database-ui-tutorial.git dbui
$ cd dbui
$ createdb moviesdev
$ yamltodb moviesdev film.yaml | psql moviesdev

Of course, there are some pre-requisites: Git, PostgreSQL, Python, Psycopg2 and Pyrseas. If you don’t want to install the latter, create the film table according to the 0.1 version here. If you simply want to look at the code, you can find dbapp.py at GitHub here.

The following is a sample usage session:

$ cmdline/dbapp.py moviesdev
   A - Add
   L - List
   U - Update
   D - Delete
   Q - Quit
Command? l
    Id Title                            Year
 19777 The Cocoanuts                    1929
 20629 All Quiet on the Western Front   1930
 20640 Animal Crackers                  1930
   A - Add
   L - List
   U - Update
   D - Delete
   Q - Quit
Command? a
Id [0]: 1234
Title []: Testing
Release year [0]: 2010
Film 'Testing - 2010' added
   A - Add
   L - List
   U - Update
   D - Delete
   Q - Quit
Command? u
Id: 1234
Updating 'Testing - 2010'
Title [Testing]: A Test Movie
Release year [2010]:
Film 'A Test Movie - 2010' updated
   A - Add
   L - List
   U - Update
   D - Delete
   Q - Quit
Command? d
Id: 1234
Delete film 'A Test Movie - 2010' (y/n) [n]: y
Film 'A Test Movie - 2010' deleted
   A - Add
   L - List
   U - Update
   D - Delete
   Q - Quit
Command? q
Done

Caveat emptor: The code is very succint and has limited error checking.

Database User Interface Basics

Examining the sample application leads us to identify the following essential features:

  1. The application needs to present data both as single records and in list or tabular form.
  2. The data is entered and displayed as characters but possibly stored or interfaced to the DBMS in binary formats, hence conversion procedures between external and internal formats are necessary.
  3. Data items need an initialization or default value.
  4. Many or most data items need validation procedures, including a failure message. Some validations are generic to the data type, others are specific to the item.
  5. Depending on the operation mode, modification of some items may be prevented, e.g., primary key cannot be entered or changed in Update mode.
  6. One or more facilities are needed to select a record for Update or Delete, e.g., by the primary key, from a list, etc.
  7. When reporting actions taken, a short external representation of a record is desirable1.

Some of these may seem tautological, but they apply regardless of whether the user interface is line-oriented, character-oriented (like ncurses), graphical or web-based.

A program to develop database applications needs to assist the developer in specifying these features. For example, the program has to provide a means to design a single record or tabular format. The advantage of having access to the database catalogs is that the program can facilitate this by supplying useful defaults at various stages in the design process.

Once the design is finalized, we’ll probably want to store it in the database itself, but –mindful of version control– it’s recommended we keep the design in a YAML specification file that can be stored in a VCS.


1 The primary key for the film table is an integer, and in the sample session, the IMDb identifier was used as it were a “natural” key. In most applications, such an id would be auto-generated and possibly not shown to the user.

Python Development Tools of the Trade

Last week, Bruce Momjian commented that pgindent is one of the reasons for the clarity of the PostgreSQL source code. Today, Andrew Dunstan remarked on “how cleanly it compiles” and no doubt this is due in great part to the buildfarm. So I thought I’d reflect on the tools I’ve been using to ensure the clarity and quality of the Pyrseas code.

Editors

Early in my career, I had a brief encounter with TECO. I also encountered Brief (but for a longer time). Perhaps that explains why, after many years of using vi and vim, in the past few years I’ve gone back to my roots, if you will, and I’m using Emacs for programming.

Some of the Emacs features I find helpful or convenient in developing quality code include (in no particular order):

  • Syntax highlighting: Whether it’s Python, ReStructured Text for the Sphinx documentation, or C in the PostgreSQL code, syntax coloring is nearly always on, by default based on the filename extension.
  • Parentheses matching: Not only for parens, but also for brackets and braces (particularly helpful when creating or editing Python dicts and JSON/YAML maps).
  • which-func-mode: Being able to tell which function/method you’re editing, on the status line, when the head of the function may not be in view.
  • Git branch: Shows in what branch you’re editing, again on the status line. For Subversion, it shows SVN-nnn where nnn is the revision number that last affected the current file.

Unit Testing

I have used the Python unittest testing framework, almost religiously, to develop tests before writing code. I initially also looked at py.test and briefly at nose, but decided the standard library module was good enough. The unit tests have been invaluable when adding features or refactoring code.

Version Control

Git was my choice for a VCS. Two Git features are worth mentioning. One is the change coloring in git diff, but also the context information, e.g., it shows the Python class in which the difference occurs. Second is the ability to stash away work in progress.

Code Quality

I’ve been using pep8 to ensure the Pyrseas code follows as much as possible the Python Style Guide. This is the tool closest to pgindent, but it doesn’t reformat the code.

Lastly, before making the code available publicly, I ran pylint against it. Then I edited it accordingly to improve readability and to eliminate some warning categories.

SQL Database Version Control – Pyrseas

Now that Pyrseas has been released, it’s time to see how the DBAs Carol and Dave would use it to manage the changes described in Version Control, Part 2: SQL Databases.

Version 0.1

Carol starts by creating the film table in her database, caroldev, using the CREATE TABLE script, version 0.1, which we saw in the original article. She then uses the Pyrseas dbtoyaml tool to ouput the database specification:

$ dbtoyaml caroldev
schema public:
  table film:
    check_constraints:
      film_release_year_check:
        columns:
        - release_year
        expression: (release_year >= 1888)
    columns:
    - id:
        not_null: true
        type: integer
    - title:
        not_null: true
        type: character varying(32)
    - release_year:
        not_null: true
        type: integer
    primary_key:
      film_pkey:
        access_method: btree
        columns:
        - id

This should be mostly self-explanatory, but let’s go over it. Indentation indicates structure (nesting) and hyphens are used for each element in a list. Thus, there is one schema (public) and one table (film) defined within it. The table has three columns (id, title and release_year) with the given data types and none of them are nullable. The PRIMARY KEY (film_pkey) is on the id column and there is a CHECK constraint on the release_year column. By the way, the alphabetical order within each level is simply the default output format of PyYAML.

Carol can redirect the output of dbtoyaml to a file, say, film.yaml and store it in a Git repository or some other VCS. Dave will then be able to pull the file from the repository and use it to generate the SQL statements needed to create the table in his database, or the development, test and production databases, using the yamltodb tool as follows:

$ yamltodb -1 davedev film.yaml
BEGIN;
CREATE TABLE film (
    id integer NOT NULL,
    title character varying(32) NOT NULL,
    release_year integer NOT NULL);
ALTER TABLE film ADD CONSTRAINT film_pkey PRIMARY KEY (id);
ALTER TABLE film ADD CONSTRAINT film_release_year_check CHECK (release_year >= 1888);
COMMIT;

If he is satisfied with the SQL, he can pipe the output to psql, e.g.,

$ yamltodb -1 davedev film.yaml | psql davedev

Version 0.2

For the second version, Dave issues an ALTER TABLE to add the length column, invokes dbtoyaml to create a new film.yaml and commits that to his repository. Carol uses another ALTER TABLE to add the rating column in her database, and stores a new film.yaml in her repository.

Manny, the integration manager, merges the new film.yaml versions into the project-wide repository. He uses yamltodb to output the statements needed to upgrade to version 0.2:

$ git diff 0.1 0.2
diff --git a/film.yaml b/film.yaml
index 0633c1b..98f19c4 100644
--- a/film.yaml
+++ b/film.yaml
@@ -1,6 +1,10 @@
 schema public:
   table film:
     check_constraints:
+      film_length_check:
+        columns:
+        - length
+        expression: ((length > 0) AND (length < 10000))
       film_release_year_check:
         columns:
         - release_year
@@ -15,6 +19,10 @@ schema public:
     - release_year:
         not_null: true
         type: integer
+    - length:
+        type: smallint
+    - rating:
+        type: character(5)
     primary_key:
       film_pkey:
         access_method: btree
$ yamltodb moviesdev film.yaml
ALTER TABLE film
    ADD COLUMN length smallint;
ALTER TABLE film
    ADD COLUMN rating character(5);
ALTER TABLE film ADD CONSTRAINT film_length_check CHECK ((length > 0) AND (length < 10000));

Version 0.3

For the next round, Carol and Dave issue the statements shown under Version 0.3 in the previous article and commit newer versions of film.yaml which Manny integrates into the project repository. The yamltodb output against the project database is as follows:

$ yamltodb -1 film.yaml
BEGIN;
CREATE SEQUENCE category_category_id_seq
    START WITH 1
    INCREMENT BY 1
    NO MAXVALUE
    NO MINVALUE
    CACHE 1;
CREATE TABLE film_category (
    film_id integer NOT NULL,
    category_id integer NOT NULL,
    last_update timestamp with time zone NOT NULL);
CREATE TABLE category (
    category_id integer NOT NULL DEFAULT nextval('category_category_id_seq'::regclass),
    name character varying(25) NOT NULL,
    last_update timestamp with time zone NOT NULL);
ALTER SEQUENCE category_category_id_seq OWNED BY category.category_id;
ALTER TABLE category ADD CONSTRAINT category_pkey PRIMARY KEY (category_id);
ALTER TABLE film_category ADD CONSTRAINT film_category_pkey PRIMARY KEY (film_id, category_id);
ALTER TABLE film_category ADD CONSTRAINT film_category_category_id_fkey FOREIGN KEY (category_id) REFERENCES category (category_id);
ALTER TABLE film_category ADD CONSTRAINT film_category_film_id_fkey FOREIGN KEY (film_id) REFERENCES film (id);
COMMIT;

Summary

I cannot objectively review the tools I have created, but sincerely hope others will find them useful in managing changes to PostgreSQL databases in conjunction with a VCS.

Pyrseas dbtoyaml and yamltodb require Python, psycopg2 and the PyYAML library. They currently support the basic PostgreSQL features (schemas, tables, PRIMARY KEYs, FOREIGN KEYs, UNIQUE and CHECK constraints, indexes) but the intention is to expand them to cover all features.

The primary audience of the tools are DBAs and developers, but they can be used in conjunction with something like depesz Versioning to implement a stricter system for control over production databases.

I’d like to take this opportunity to thank Adam Cornett, an early adopter/tester, who found an issue with a FOREIGN KEY across schemas, which has been fixed in the master repository.

First release of Pyrseas

just a quick note to announce that the first release of Pyrseas is now available as follows:

Github

https://github.com/jmafc/Pyrseas: Git repository. You can report any issues there.

Python Package Index

http://pypi.python.org/pypi/Pyrseas:  Source tar archive.

PgFoundry

http://pgfoundry.org/projects/pyrseas/: Source tar archive.

Discussion

You can of course leave comments below, but there is also a general mailing list available at PgFoundry, which also hosts forums.

My deepest thanks to the PgFoundry team for all the support tools.

In addition there is a website which is currently mostly bare, but I hope to upload the documentation soon. Update: The initial documentation is now available here.

Pyrseas was developed on Debian Linux with PostgreSQL 8.4, and also tested against 9.0, using Python 2.6, psycopg 2.2 (also tested with 2.4), and PyYAML 3.09. I intend to test on Windows, but I’ll be thankful if some adventurous soul tries it out first.

SQL Database Version Control – apgdiff

In “Database Development: Table Structure Changes,” Kenneth Downs argued that build scripts to modify table structures “are usually the least tested part of the system.” He suggested using a single metadata file that describes the database and can be placed under version control, and a difference engine as the tool of choice to perform database upgrades.

Another PostgreSQL Diff Tool (aka apgdiff) is a utility that implements such an engine.  For its metadata, apgdiff takes two pg_dump -s output files, representing previous and new database schemas, and generates ALTER TABLE and other SQL statements needed to upgrade from previous to new.

In this installment of our series on using VCS tools while developing a SQL-based application, we’ll see how Carol and Dave can use apgdiff to control the changes described in Version Control, Part 2: SQL Databases.

Version 0.1

To start off, Carol creates the film table in her database, caroldev, using the CREATE TABLE script, version 0.1, which we saw in the previous article. She also creates a general development database, moviesdev, and issues the following commands:

$ pg_dump -s moviesdev > film0.sql
$ pg_dump -s caroldev > film.sql

The film0.sql file is essentially empty, consisting only of some SET, REVOKE and GRANT statements. The film.sql file includes statements to CREATE and ALTER the film table, as can be seen in the following diff:

$ diff -u film0.sql film.sql
--- film0.sql
+++ film.sql
@@ -9,6 +9,34 @@
 SET client_min_messages = warning;
 SET escape_string_warning = off;

+SET search_path = public, pg_catalog;
+
+SET default_tablespace = '';
+
+SET default_with_oids = false;
+
+--
+-- Name: film; Type: TABLE; Schema: public; Owner: carol; Tablespace:
+--
+
+CREATE TABLE film (
+    id integer NOT NULL,
+    title character varying(32) NOT NULL,
+    release_year integer NOT NULL,
+    CONSTRAINT film_release_year_check CHECK ((release_year >= 1888))
+);
+
+
+ALTER TABLE public.film OWNER TO carol;
+
+--
+-- Name: film_pkey; Type: CONSTRAINT; Schema: public; Owner: carol; Tablespace:
+--
+
+ALTER TABLE ONLY film
+    ADD CONSTRAINT film_pkey PRIMARY KEY (id);
+
+
 --
 -- Name: public; Type: ACL; Schema: -; Owner: postgres
 --

Then Carol can use apgdiff to generate the SQL statements needed to create the table in the development and production databases, e.g.,

$ apgdiff film0.sql film.sql
CREATE TABLE film (
 id integer NOT NULL,
 title character varying(32) NOT NULL,
 release_year integer NOT NULL
);

ALTER TABLE film
 ADD CONSTRAINT film_pkey PRIMARY KEY (id);

ALTER TABLE film
 ADD CONSTRAINT film_release_year_check CHECK ((release_year >= 1888));

In this initial phase, the diff and apgdiff outputs are not that … different. The value of apgdiff will be more obvious in subsequent versions.

As part of the 0.1 release process, Carol commits film.sql to the VCS (she could also commit film0.sql as the film.sql base version, if desired). She may also want to store the apgdiff output in the VCS, since the latter is what will actually be used to upgrade the production database(s).

Version 0.2

For the next version, Dave clones the VCS repository and uses the film.sql script or the saved apgdiff output to create the film table in his own database. He then issues an ALTER TABLE to add the length column, invokes pg_dump -s to create a new film.sql and commits that to his repository. Carol uses another ALTER TABLE to add the rating column in her database, and stores the new pg_dump output to her repository.

Manny, the integration manager, merges the new film.sql versions into the project-wide repository (he has to deal with table ownership statements but perhaps the pg_dump outputs can be stripped of such statements prior to committing them to the VCS). Manny then uses apgdiff to produce the statements needed to upgrade to version 0.2:

$ git diff 0.1 0.2
diff --git a/film.sql b/film.sql
index 799b418..db820bd 100644
--- a/film.sql
+++ b/film.sql
@@ -23,6 +23,9 @@ CREATE TABLE film (
 id integer NOT NULL,
 title character varying(32) NOT NULL,
 release_year integer NOT NULL,
+    length smallint,
+    rating character(5),
+    CONSTRAINT film_length_check CHECK (((length > 0) AND (length < 10000))),
 CONSTRAINT film_release_year_check CHECK ((release_year >= 1888))
 );

$ git checkout 0.1
$ cp film.sql film.sql-0.1
$ git checkout 0.2
$ apgdiff film.sql-0.1 film.sql

ALTER TABLE film
 ADD COLUMN length smallint,
 ADD COLUMN rating character(5);

ALTER TABLE film
 ADD CONSTRAINT film_length_check CHECK (((length > 0) AND (length < 10000)));

Version 0.3

For the next round, Carol and Dave issue the statements shown under Version 0.3 in the previous article and commit newer versions of film.sql which Manny then integrates into the project repository. The apgdiff output comparing the two versions is as follows:

$ apgdiff film.sql-0.2 film.sql

CREATE SEQUENCE category_category_id_seq
 START WITH 1
 INCREMENT BY 1
 NO MAXVALUE
 NO MINVALUE
 CACHE 1;

CREATE TABLE category (
 category_id integer DEFAULT nextval('category_category_id_seq'::regclass) NOT NULL,
 name character varying(25) NOT NULL,
 last_update timestamp with time zone NOT NULL
);

CREATE TABLE film_category (
 film_id integer NOT NULL,
 category_id integer NOT NULL,
 last_update timestamp with time zone NOT NULL
);

CREATE TABLE "language" (
 language_id integer NOT NULL,
 name character varying(20) NOT NULL,
 last_update timestamp with time zone NOT NULL
);

ALTER TABLE film
 ADD COLUMN language_id integer NOT NULL;

ALTER TABLE category
 ADD CONSTRAINT category_pkey PRIMARY KEY (category_id);

ALTER TABLE film_category
 ADD CONSTRAINT film_category_pkey PRIMARY KEY (film_id, category_id);

ALTER TABLE "language"
 ADD CONSTRAINT language_pkey PRIMARY KEY (language_id);

ALTER TABLE film
 ADD CONSTRAINT film_language_id_fkey FOREIGN KEY (language_id) REFERENCES language(language_id);

ALTER TABLE film_category
 ADD CONSTRAINT film_category_category_id_fkey FOREIGN KEY (category_id) REFERENCES category(category_id);

ALTER TABLE film_category
 ADD CONSTRAINT film_category_film_id_fkey FOREIGN KEY (film_id) REFERENCES film(id);

You may notice a small discrepancy when running the above (tested with PostgreSQL 8.4.5 and apgdiff 2.3) to upgrade a second database and then compare the output from the pg_dump -s commands. In the original database, the category_category_id_seq sequence is owned by category.category_id, but this is missing from the upgraded database.

Review

Apgdiff appears to be a useful tool to manage changes to PostgreSQL databases in conjunction with a VCS. It does require that you run pg_dump -s and that you store its output in the VCS. However, apgdiff could be used against a set of manually-maintained SQL scripts.

Apgdiff requires Java 1.6 and Debian, Ubuntu and Gentoo have native apgdiff packages. It supports several (but not all) PostgreSQL features. For example, it omits the TYPEs, DOMAINs and AGGREGATEs used in the Pagila sample database (it was able to recreate 14 out of 21 Pagila tables). It depends on parsing SQL DDL syntax —albeit not fully, so it is susceptible to syntax changes between PostgreSQL releases. It cannot deal with RENAMEs, e.g., if the rating column is renamed to mpaa_rating, apgdiff will drop the first and add the second—not ideal in a populated database.

Apgdiff does not operate on the actual database being upgraded but merely on textual representations. On the plus side, it can be used to generate backout scripts, e.g., by diffing film.sql version 0.3 vs. 0.2 you can generate statements to undo the changes from 0.2 to 0.3. The next tool we’ll review is particularly concerned with this aspect of the upgrade process.

Version Control, Part I: Pre-SQL

A version control system (VCS) such as Git or Subversion is an indispensable component of modern software projects. Many applications use a relational database management system (DBMS) such as PostgreSQL. While a DBMS facilitates an application’s data management tasks, it presents some version control challenges.

To understand the version control issues we’ll examine a project with two programmers, Alice and Bob, developing an application loosely based on the Pagila sample database. We’ll start our tour with an application predating the use of SQL.

When developing a file-based application in C, C++ or other programming languages, you typically define the data structures in your source code, e.g., in header files that are included or otherwise referenced by the application code.

Version 0.1

For example, a C application to maintain information about movies may start off by defining the following in a file named film.h:

typedef struct
{
    int id;
    char title[32];
    int release_year;
} FILM;

The typedef may represent the actual layout of the file records as well as the application buffers used to hold the data input by users or to be displayed on screen.  In addition, the application will need code to ensure the uniqueness of the film identifiers, e.g., to maintain a file index, and to validate the release year upon input, e.g., release_year >= 1888.

The film.h file and the rest of the application are submitted to a VCS and are released as version 0.1.

Version 0.2

For the next version, Bob adds short length; to the end of the FILM structure.  He also adds help text to tell users the new field is to capture the length of the movie in minutes and code to validate that length > 0 and length < 10000.

Separately, Alice adds char rating[6]; to the end of the struct, application help text explaining this is for the MPAA rating, and code to validate the input against a list of known values.

Finally, Bob and Alice submit their changes to the VCS and either he or she (or an integration manager) merge the changes and settle on the following structure:

typedef struct
{
    int id;
    char title[32];
    int release_year;
    short length;
    char rating[6];
} FILM;

The VCS allows Bob and Alice to keep track of changes to the FILM structure, e.g., to diff the tagged revisions of film.h:

$ git diff 0.1 0.2
diff --git a/film.h b/film.h
index 8670425..1df5aaa 100644
--- a/film.h
+++ b/film.h
@@ -5,4 +5,6 @@ typedef struct
     int id;
     char title[32];
     int release_year;
+    short length;
+    char rating[6];
 } FILM;

To allow users to migrate existing applications, a conversion program is also needed to reformat the data file.

Version 0.3

For the third revision, Alice adds a genre array to FILM, and a CATEGORY structure to allow users to maintain the allowed values for genres.  In the meantime, Bob adds a language field to the FILM structure, together with an application-maintained array of language codes and names.  Once merged and released, the structures look as follows (with comments):

typedef struct
{
    int id;            /* genre/category ID */
    char name[26];     /* genre name */
} CATEGORY;

typedef struct
{
    int id;            /* film ID */
    char title[32];    /* film title */
    int release_year;  /* release year */
    short length;      /* length in minutes */
    char rating[6];    /* MPAA rating */
    int categories[5]; /* film genres */
    int language;      /* spoken language code */
} FILM;

The conversion program has to be updated to migrate to the expanded structure, but the VCS still supports comparison of different revisions.  You can see it also allows each programmer to work independently on unrelated features and a distributed VCS such as Git can keep track of the merge actions, as seen below—version 0.2 was merged by the project manager whereas for version 0.3 Bob pulled Alice’s change before committing:

$ git log --pretty=format:"%h %s" --graph
* eb0a385 Add language.
* e6568c5 Add film genres.
*   f404fe2 Merge Alice's add rating.
|\ 
| * bdc8247 Add MPAA rating.
* | 5772f38 Add length.
|/ 
* 76ecd04 First version

With this preliminary flat-file scenario established, we’re ready to move to our next installment where we will contrast the above to the situation using a SQL DBMS.