The Future of Pyrseas: Part 2

When I started working on Pyrseas, I reviewed several other products. Robert Brewer’s Post Facto was probably the one with the most unique design. Although it compared database schemas in order to generate SQL to synch them up, it did not store database object definitions in a standard VCS repository. Rather, it used a Postgres database as the repository analog.

While Post Facto’s design certainly influenced Pyrseas, there is one aspect of  the former that, unfortunately, I did not emulate.

The Dependables

As any developer knows, database objects have dependencies on each other: table A has a primary key PK1, table B is declared with a foreign key dependent on PK1, function C is dependent on type X, view D is based on table A and includes a call to function C.

Pyrseas currently deals with these dependencies in an object-specific manner. For example, it does at least two passes through pg_class objects (tables, views, sequences, etc.) in order to create, alter or drop these objects in the correct order. However, this ad hoc approach can result in incorrect sequencing of generated SQL statements in some cases, particularly those like view D above.

The missing feature from Post Facto that avoids this conundrum? If you answered topological sort you were obviously paying attention in your Algorithms class. If you didn’t, may I suggest chapter 15, “Devising and engineering an algorithm: Topological Sort” of Bertrand Meyer’s Touch of Class.

Daniele’s Quest

Over two years ago, someone opened an issue about the need to create primary keys before creating views. Later, Daniele Varrazzo reported another issue with dependencies.

Many of you Postgres users will recognize Daniele as the maintainer of Psycopg, the popular Python PG adapter, which of course is used by Pyrseas.  Daniele and I chatted online, I mentioned Post Facto’s solution and he, fortuitously and generously, started implementing a topological sort on a deptrack branch of Pyrseas.

We then collaborated for about eight months. He did most of the initial coding and I ran tests and fixed some issues. Unfortunately, Daniele is very busy, with a full-time job, Psycopg and other interests, so the work came to a near standstill.

Where We Stand

The last time changes were submitted to the deptrack branch, about six months ago, only four tests failed (out of over 600) running on both Python 2.7 and 3.4 against Postgres 9.3. Regrettably, three of those tests are integration and functional tests, so correcting those is critical to adding this feature.

In addition, although most tests complete successfully, the run times have been impacted severely. This will require an effort at re-optimizing performance before releasing the changes. Last but not least, the implementation needs internal documentation so that it can be properly maintained.

Sadly, I have not had much time or incentives to address these shortcomings. Are there any Pyrseas, Postgres or Python enthusiasts looking for a challenge?


SQL Version Control Implementation Choices II

In the previous post, I discussed two alternative implementations for the yamltodb difference engine: SQL comparisons using extra dictionary tables, and comparing the YAML/JSON maps (Python dicts). Today I’ll review the last option: internal structure comparisons.

Internal Structures – Take 1

The original implementation for yamltodb was based on the implementation of dbtoyaml. Since the YAML maps output by the latter were hierarchical, the internal structures were also organized in layers: a Python class instance (Database) to represent the database, having a dict of classes representing the schemas, each in turn with a dict of “schema objects,” i.e., tables, sequences, etc. Each layer was populated by queries against the corresponding PostgreSQL catalogs.

This organization was also used for the difference engine. A second Database instance represented the YAML input. The two hierarchies were then traversed and compared at each layer, outputting SQL as differences were found.

This approach served to flesh out the SQL generation code and to deal with various feature tests, e.g., creation of a schema, creation of a table, adding columns to a table, and even RENAMEing schemas and tables. However, it couldn’t easily handle FOREIGN KEYs and SEQUENCEs owned by tables (resulting from a SERIAL data type).

Internal Structures – Take 2

To tackle the various dependencies, I refactored the code to use structures somewhat along the lines of Post Facto. The Database class now holds objects derived from class DbObjectDict which corresponds roughly to a single PostgreSQL catalog. Thus, SchemaDict is nearly equivalent to pg_namespace, ClassDict to pg_class and ConstraintDict to pg_constraint. Each one of those classes is also a Python dict and holds specialized objects, e.g., ClassDict holds objects (derived from DbClass) of class Sequence, Table or View (for the time being, class Index is handled separately).

Each XxxDict class is hashed by a Python tuple (an immutable sequence), paralleling the underlying catalog UNIQUE keys. After the dicts are populated, either from the catalogs or from the YAML map, methods are called to link related objects, e.g., columns to tables, foreign keys to the primary key tables, etc.

Note that Pyrseas implements dual sets of classes: collections (dicts) of objects and classes for individual object types. In constrast Post Facto has a single set of objects and uses a Python classmethod to fetch the catalog information. I found the dual sets more convenient to deal with certain implementation issues.

In order to generate the SQL in the correct sequence considering inter-object dependencies, Post Facto does a topological sort of its objects. Influenced by pg_dump, I preferred a lazier approach.  The diff_map methods of Database and lower level classes generate the SQL in a stratified manner with deferred actions.

For example, since one cannot drop tables or primary keys if foreign keys still refer to the former, ClassDict‘s diff_map, upon finding a table has been dropped (it doesn’t exist in the input YAML map), only marks it for dropping. Then it has a pass to drop foreign keys and views, followed by another pass to drop other constraints and indexes and finally the tables themselves.

As a second illustration, I recently added support for inherited tables (and by extension, partitioned tables). Since PostgreSQL allows for multiple inheritance, creating (and dropping) tables in an inheritance “network” requires special care. My solution: the first pass stacks children tables and only creates “root” tables, while the second pass iterates over the stack to create the rest in the right order.

I’d love to hear feedback on this or from anyone who has experimented with the Pyrseas utilities.


SQL Database Version Control – Post Facto

In this series on SQL database version control solutions, so far we’ve seen a difference engine and some change control aids. The difference tool, apgdiff, suggests you need to store schema listings, i.e., the output from pg_dump -s, in your VCS repository. The change control tools, depesz Versioning and Liquibase (also dbdeploy), instead imply you keep the database change scripts (or XML specifications) in the repository. Of course, you may also want to store the generated SQL scripts in the first case, as well as the pg_dump -s output in the latter.

Today’s tool is probably unique. It’s a bit like saying: if the database will not come to version control, version control must go to the database. Post Facto builds a version control repository on top of a normal PostgreSQL database.

How can our DBAs Carol and Dave use Post Facto to implement the changes described in Version Control, Part 2: SQL Databases? Note: Unfortunately, the source code for Post Facto has not been updated for 15 months and it is still listed in alpha status. I do have the source, having retrieved it several months ago, but you may get errors checking it out if you use a recent Subversion client (e.g., svn 1.6.12). I briefly tried out Post Facto in the past, but this post may be less definitive than previous ones.

Version 0.1

For the first release, the manager Manny needs to create two databases: a postfacto database and a repodb database. The first name is fixed, but you can choose the second. Post Facto will create these databases at some point if it doesn’t find them, but by creating them ahead of time he can presumably assert some control over the process. Carol has to create her caroldev database and Manny can initialize the repository with the following command:

$ pf import caroldev postfacto://manny@localhost/repodb/movies/trunk
Committed revision 1.

Carol can then issue the CREATE TABLE film statement as shown in the original article, and commit her changes as follows:

$ pf add caroldev "*"
Added 5 objects:
  constraints: public.film
  indexes: public.film_pkey
  tables: public.film
  types: public._film, public.film
$ pf commit  --message "Add film table" caroldev
Committed revision 2.

Dave can check out version 0.1 as follows:

$ pf checkout postfacto:///repodb/movies/trunk davedev
Checked out revision 2.

Behind the scenes, Post Facto has created two additional databases, postfacto_w_caroldev and postfacto_w_davedev, the “working copies” for Carol and Dave, respectively. To install the release in production or in a QA environment, you seemingly need to checkout from the repodb as well, probably requiring remote access to the machine holding the repository.

Technical Note 1: You may need to change max_prepared_transactions, e.g., set it to 3, in postgresql.conf (requires restart) for the checkout to work. Technical Note 2: the pf commands have to be issued under a superuser role (or someone with SELECT permission on pg_authid).

Version 0.2

The next step would be for Dave to issue the ALTER TABLE statement to add the length column, as shown previously. Dave would then add and commit his changes with the following:

$ pf add davedev "*"
Added 0 objects:
$ pf commit -m "Add length column" davedev
Committed revision 3.

Similarly, Carol can issue her ALTER TABLE statement to add the rating column and commit her changes using a similar set of commands. The pf log command shows what transpired so far:

$ pf log caroldev
Rev     Who        Created             Message
4       carol      2011-03-07 10:31:55 Add rating
3       dave       2011-03-07 10:22:42 Add length column
2       carol      2011-03-07 09:23:41 Add film table
1       manny      2011-03-07 09:19:58 Initial import.

If Manny were to check out postfacto:///repodb/movies/trunk into a new database, say for QA purposes, he would see the table with the two new columns. However, I have not been able to merge or update the changes so that Carol and Dave can see the merged table. The command:

$ pf merge postfacto:///repodb/movies/trunk@4 davedev
Merging postfacto:///repodb/movies/trunk@4 into davedev... done.

says it has merged the change at revision 4 into davedev but psql shows that is not the case. The command pf update, which takes a database name and an optional revision number, fails with a Python error: ValueError: too many values to unpack (which I have not investigated further). It appears the only alternative is to use pf checkout --force to force each database to be recreated with the complete new schema.

Version 0.3

The process for subsequent releases is essentially the same. Each developer or DBA makes changes to his or her database and commits them to the repository. Ideally, changes would then be merged so they could continue working with an up-to-date schema.


Post Facto was developed by Robert Brewer while he was at Etsy. He gave a presentation about it at PGCon 2009. In the first 15 minutes of the audio he gives a very useful summary of the typical problems associated with propagating changes to databases and some common solutions.

Post Facto is written in Python (requires 2.3 or higher, but has not been ported to 3.x) and uses pyPgSQL to access PostgreSQL (up to version 8.3). Since it works directly with the catalogs, AFAICT it supports nearly all PostgreSQL features, up to the supported release. As can be seen above, its user inteface was designed to resemble the centralized VCS Subversion.

Regrettably, as noted earlier, Post Facto appears to be an inactive project. Although pyPgSQL is still available its last release was made in June 2006. Aside from the PgCon presentation, a brief description in the project’s home page, and pf‘s usage message its documentation is non-existent.

Post Facto’s concept of using the database as the VC repository is novel. However, it may be argued that doing so separates the database “code,” i.e., the logical structures, from the related application code which still resides in a conventional VCS.