Automated Database Augmentation

Suppose you have a PostgreSQL database like the Pagila sample with 14 tables, each with a last_update timestamp column to record the date and time each row was modified, and it is now a requirement to capture which user effected each change. Or perhaps you have several tables without such audit trail columns and need to add them quickly. Or maybe you have decided to denormalize your design by adding a calculated column, e.g., extended price = unit price times quantity ordered, or a derived column, e.g., carrying the customer name in the invoice table.

If you have some experience as a DBA, the word “drudgery” may have come to mind at the prospect of implementing the above features. It’s possible that, after a while, you’ve developed an approach for dealing with some of them but still wish there’d be some way to automate these thankless tasks.

You may have looked at the Andromeda project’s “automations” which provide some of these capabilities. However, in order to take advantage of the automations, you’ll first have to manually describe your database in a YAML format (and you’ll have to install Apache and PHP). Or you could have tried to use the follow-on project, Triangulum, but essentially you’d still have to create a YAML schema (no need for Apache, but you still need PHP).

Some relief is forthcoming. As a result of discussions resulting from my Business Logic in the Database post, I have been collaborating with Roger Hunwicks on a potential solution to these common DBA needs. The new Pyrseas tool is tentatively named dbextend1 and its initial documentation is available in the Pyrseas extender branch. This is how I envision dbextend being used.

Consider the opening example. The DBA would create a simple YAML file such as the (abbreviated) one below, listing the tables and the needed features:

schema public:
  table actor:
    audit_columns: default
  table category:
    audit_columns: default
  table store:
    audit_columns: default

The DBA would then use this file, say audext.yaml, as input to dbextend, e.g.,

dbextend pagiladb audext.yaml

dbextend reads the PostgreSQL catalogs (using code shared with dbtoyaml and yamltodb), building its internal representation. It also reads the YAML extensions file and builds a parallel (albeit much smaller) structure. Thirdly, it reads extension configuration information, e.g., a definition of what columns need to be added for “audit_columns: default“, for example, modified_timestamp and modified_by_user, what trigger(s) to add, and what function(s) to be created.

The output of dbextend is a YAML schema file, just like the one output by dbtoyaml, which can be piped directly to yamltodb to generate SQL to implement the desired features.

In case you’re wondering, dbextend —like other Pyrseas tools— will require Python, psycopg2 and pyyaml.

What features would you like to see automated? What are your suggested best practices for automating these common needs?

Picture credit: Thanks to Mr. O’Brien, a fourth-grade teacher in Minnesota.

1 We’re still receptive to some other suitable name.


Database Redesign and User Interface Refactoring

In this series about database user interfaces, until now I’ve focused on a single table with just three attributes. As I did with version control, I’ll be introducing additional entities and attributes and exploring how they could be presented or updated via the interface.

Here are some of the attributes we could add to our lonely film table:

  • a single-valued attribute
  • a multi-valued attribute with few values per entity
  • a multi-valued attribute with a potentially large number of values per entity
  • an audit trail attribute

Single valued attributes

Practically every film in existence has a primary spoken language (or intertitle language in the case of silent films). Rather than using the language table I showed before (patterned after Pagila), i.e., a smallint language_id, it is preferable to use a standard coding scheme, i.e., ISO 639. For purposes of this tutorial, ISO 639-1 will suffice. Another attribute we’ll add is the film running time in minutes.

Multi-valued attribute, small fan-out

Films are classified into one or more categories, usually identifying a genre, e.g., action, comedy, science fiction. Although there is no definitive list of genres and there are disagreements on categorizing any given film, there is a limited set of genres and most films fall into one or two categories and rarely more than six. Instead of the serial integer and separate film_category table used before (also as in Pagila), since we are using PostgreSQL, perhaps we’ll experiment with an array of varchar (or text) to list the categories in the film table.

Multi-valued attribute, large fan-out

Making a movie is of course a collective endeavor, employing several individuals as cast and crew (even student films). We’ll need a table to record their names and possibly other information and a second table to connect each film to those involved in it.


Each of these kinds of attributes present different issues for the user interface.

Assuming we limit ourselves to a short list of genres, we could use a drop-down list to allow the user to choose the categories for each film, but we have to cater for multiple items being selected.

We may allow the user to enter the film language using a two-letter code, but we probably want to display the full language name as confirmation.

On the other hand, we don’t want to force the user to remember (or even be aware of) person identifiers to enter the cast and crew, so some search capability seems necessary.

Refactoring the Interface

To explore these additions, we’ll go back to the minimalist command line interface. However, we’ll first refactor it with the knowledge we gained up to this point.

Instead of the bl.py module we had earlier, we’ll copy the bl/film.py module from the CherryPy interface. We’ll also replace the dblib.py module by the more robust version, but adding a DbConnection.close() method.

The major refactoring occurs in dbapp.py. First, we add a top level “menu” (only two options for now: Films and Quit) and split film-related functionality to a separate film.py module. The latter is patterned after the corresponding module in the CherryPy version. A brief review of the result may be useful:

  • The Film class (in bl/film.py) is the part that knows how to fetch from and update the film table
  • The FilmForm class is the component that obtains valid data from the user
  • The FilmHandler, for the most part, acts a controller, sequencing the various operations

The code is available on GitHub, tagged as v0.4.0.


Audit Trails

Back in February, I defined the film table version 0.1 with just three columns:

    title VARCHAR(32) NOT NULL,
    release_year INTEGER NOT NULL CHECK (release_year >= 1888)

It was unclear then how far I’d go in supporting all of PostgreSQL data definition features. Now, Pyrseas is making progress towards covering all those capabilities.

Consequently, I’d like to add one column to the above and tag it as version 0.1.1. This is the column to be added (from the Pagila sample database):

   last_update timestamp with time zone NOT NULL DEFAULT now()

With this new column, I’ll be able to explore another kind of field in the database user interface: one that will not be present on new records (because the DEFAULT will take care of filling in that data), but will be displayed, in readonly mode, for updates. Note: In Pagila, last_update is defined as timestamp without time zone, but that seems too ambiguous for my taste.

To complete this, we need an ON UPDATE trigger together with a function to be called by the trigger. These are also provided by Pagila, but to highlight some of the capabilities of Pyrseas, I’ll first show the output of dbtoyaml against a database with all the necessary objects:

language plpgsql:
  trusted: true
schema public:
  description: standard public schema
  function last_updated():
    language: plpgsql
    returns: trigger
    source: "\nBEGIN\n    NEW.last_update = CURRENT_TIMESTAMP;\n    RETURN NEW;\n\
      END "
  table film:
        - release_year
        expression: (release_year >= 1888)
    - id:
        not_null: true
        type: integer
    - title:
        not_null: true
        type: character varying(32)
    - release_year:
        not_null: true
        type: integer
    - last_update:
        default: now()
        not_null: true
        type: timestamp with time zone
        access_method: btree
        - id
        - update
        level: row
        procedure: last_updated()
        timing: before

If you feed this YAML to yamltodb against a database that already has the film table version 0.1, it generates the following (against a PostgreSQL 8.4 DBMS—vs. 9.0 the creation of PL/pgSQL is bypassed):

SET check_function_bodies = false;
CREATE FUNCTION last_updated() RETURNS trigger
    LANGUAGE plpgsql
    AS $_$
    NEW.last_update = CURRENT_TIMESTAMP;
END $_$;
    ADD COLUMN last_update timestamp with time zone NOT NULL DEFAULT now();
CREATE TRIGGER last_updated
    EXECUTE PROCEDURE last_updated();

Thus we can successfully upgrade the version 0.1 database to version 0.1.1, with our audit trail column.

Happy trails to you, ’till we meet again …


PostgreSQL Version Control Feature Coverage

In a comment to one of my early posts about version control, Peter Eisentraut stated he was “somewhat discouraged because [seeing how fast PostgreSQL develops and adds new features,] I don’t see how a tool like apgdiff can keep up and stay useful unless a lot more resources are put into it.” At the time, the first release of Pyrseas (0.1.0) was well over a month away. Currently, Pyrseas 0.3.0 is nearing release . This seems like a good time to summarize where Pyrseas stands vis-à-vis the list of PostgreSQL data definition features:

DDL Feature 0.1 0.2 0.3

Note that PosgreSQL 9.1 will add EXTENSION to the list above. Note also that although the links in the matrix go to the corresponding CREATE object documentation page, Pyrseas supports CREATE, CREATE OR REPLACE, DROP and ALTER, as applicable. A final note: an “X” in a particular release doesn’t imply that the feature is supported in all its syntactic glory.

Caveats aside, progress has been made. Pyrseas 0.1.0 supported most of the basic relational database objects and concepts present in the SQL Standard, and usually covered by generic version control tools. Pyrseas 0.3.0 will support the features showcased in the Pagila sample database, which I hope is somewhat representative of the average PostgreSQL database (in terms of DDL features, of course).

The goal for the 0.4.0 release will be to cover the remaining features excluding Text Search, SQL/MED and ROLE/GROUP, which will be tackled in 0.5.0 or later. Is there a feature you or your organization is interested in? Let me know in the comments.


Gearing up for a second release

The first release of Pyrseas was aimed at addressing the essential version control issues described in my first posts. The second release, somewhat accidentally, has focused on being able to recreate the PostgreSQL autodoc regression database. In addition, Pyrseas will be released on the PostgreSQL Extension Network (PGXN).

Pyrseas already supports the basic schema, table, column, primary key, foreign key and index CREATE and ALTER operations provided by PostgreSQL (and every relational DBMS worthy of that designation). Here is a summary of PostgreSQL features that will be supported in the upcoming release:

  • COMMENTs on schemas, tables, columns and functions.
  • ALTER TABLE RENAME COLUMN and enhanced support for other ALTER object RENAME statements.
  • VIEWs
  • INHERITed tables, and by extension, partitioned tables.

Support for some of these features is still elementary. However, the Pyrseas utilities are now able to recreate the autodoc database and revert back to an empty database, i.e., they also know how to drop each object in the correct order.

The Pyrseas unit tests have also been enhanced so they can be run against different PostgreSQL installations. For example, using the PYRSEAS_TEST_PORT environment variable I was able to run the tests against the 9.1 Beta 1 release, as well as 9.0 and 8.4.

Looking forward, the subsequent release will tackle being able to recreate the Pagila sample database, thus adding support for TYPEs, DOMAINs, AGGREGATEs, TRIGGERs and RULEs, and improving support for existing features.

Update: Pyrseas 0.2.0 has been released and is now available via PyPI, PGXN, PgFoundry and from the GitHub repository.


Version Control, Part I: Pre-SQL

A version control system (VCS) such as Git or Subversion is an indispensable component of modern software projects. Many applications use a relational database management system (DBMS) such as PostgreSQL. While a DBMS facilitates an application’s data management tasks, it presents some version control challenges.

To understand the version control issues we’ll examine a project with two programmers, Alice and Bob, developing an application loosely based on the Pagila sample database. We’ll start our tour with an application predating the use of SQL.

When developing a file-based application in C, C++ or other programming languages, you typically define the data structures in your source code, e.g., in header files that are included or otherwise referenced by the application code.

Version 0.1

For example, a C application to maintain information about movies may start off by defining the following in a file named film.h:

typedef struct
    int id;
    char title[32];
    int release_year;

The typedef may represent the actual layout of the file records as well as the application buffers used to hold the data input by users or to be displayed on screen.  In addition, the application will need code to ensure the uniqueness of the film identifiers, e.g., to maintain a file index, and to validate the release year upon input, e.g., release_year >= 1888.

The film.h file and the rest of the application are submitted to a VCS and are released as version 0.1.

Version 0.2

For the next version, Bob adds short length; to the end of the FILM structure.  He also adds help text to tell users the new field is to capture the length of the movie in minutes and code to validate that length > 0 and length < 10000.

Separately, Alice adds char rating[6]; to the end of the struct, application help text explaining this is for the MPAA rating, and code to validate the input against a list of known values.

Finally, Bob and Alice submit their changes to the VCS and either he or she (or an integration manager) merge the changes and settle on the following structure:

typedef struct
    int id;
    char title[32];
    int release_year;
    short length;
    char rating[6];

The VCS allows Bob and Alice to keep track of changes to the FILM structure, e.g., to diff the tagged revisions of film.h:

$ git diff 0.1 0.2
diff --git a/film.h b/film.h
index 8670425..1df5aaa 100644
--- a/film.h
+++ b/film.h
@@ -5,4 +5,6 @@ typedef struct
     int id;
     char title[32];
     int release_year;
+    short length;
+    char rating[6];
 } FILM;

To allow users to migrate existing applications, a conversion program is also needed to reformat the data file.

Version 0.3

For the third revision, Alice adds a genre array to FILM, and a CATEGORY structure to allow users to maintain the allowed values for genres.  In the meantime, Bob adds a language field to the FILM structure, together with an application-maintained array of language codes and names.  Once merged and released, the structures look as follows (with comments):

typedef struct
    int id;            /* genre/category ID */
    char name[26];     /* genre name */

typedef struct
    int id;            /* film ID */
    char title[32];    /* film title */
    int release_year;  /* release year */
    short length;      /* length in minutes */
    char rating[6];    /* MPAA rating */
    int categories[5]; /* film genres */
    int language;      /* spoken language code */

The conversion program has to be updated to migrate to the expanded structure, but the VCS still supports comparison of different revisions.  You can see it also allows each programmer to work independently on unrelated features and a distributed VCS such as Git can keep track of the merge actions, as seen below—version 0.2 was merged by the project manager whereas for version 0.3 Bob pulled Alice’s change before committing:

$ git log --pretty=format:"%h %s" --graph
* eb0a385 Add language.
* e6568c5 Add film genres.
*   f404fe2 Merge Alice's add rating.
| * bdc8247 Add MPAA rating.
* | 5772f38 Add length.
* 76ecd04 First version

With this preliminary flat-file scenario established, we’re ready to move to our next installment where we will contrast the above to the situation using a SQL DBMS.