2

The Future of Pyrseas: Part 2

When I started working on Pyrseas, I reviewed several other products. Robert Brewer’s Post Facto was probably the one with the most unique design. Although it compared database schemas in order to generate SQL to synch them up, it did not store database object definitions in a standard VCS repository. Rather, it used a Postgres database as the repository analog.

While Post Facto’s design certainly influenced Pyrseas, there is one aspect of  the former that, unfortunately, I did not emulate.

The Dependables

As any developer knows, database objects have dependencies on each other: table A has a primary key PK1, table B is declared with a foreign key dependent on PK1, function C is dependent on type X, view D is based on table A and includes a call to function C.

Pyrseas currently deals with these dependencies in an object-specific manner. For example, it does at least two passes through pg_class objects (tables, views, sequences, etc.) in order to create, alter or drop these objects in the correct order. However, this ad hoc approach can result in incorrect sequencing of generated SQL statements in some cases, particularly those like view D above.

The missing feature from Post Facto that avoids this conundrum? If you answered topological sort you were obviously paying attention in your Algorithms class. If you didn’t, may I suggest chapter 15, “Devising and engineering an algorithm: Topological Sort” of Bertrand Meyer’s Touch of Class.

Daniele’s Quest

Over two years ago, someone opened an issue about the need to create primary keys before creating views. Later, Daniele Varrazzo reported another issue with dependencies.

Many of you Postgres users will recognize Daniele as the maintainer of Psycopg, the popular Python PG adapter, which of course is used by Pyrseas.  Daniele and I chatted online, I mentioned Post Facto’s solution and he, fortuitously and generously, started implementing a topological sort on a deptrack branch of Pyrseas.

We then collaborated for about eight months. He did most of the initial coding and I ran tests and fixed some issues. Unfortunately, Daniele is very busy, with a full-time job, Psycopg and other interests, so the work came to a near standstill.

Where We Stand

The last time changes were submitted to the deptrack branch, about six months ago, only four tests failed (out of over 600) running on both Python 2.7 and 3.4 against Postgres 9.3. Regrettably, three of those tests are integration and functional tests, so correcting those is critical to adding this feature.

In addition, although most tests complete successfully, the run times have been impacted severely. This will require an effort at re-optimizing performance before releasing the changes. Last but not least, the implementation needs internal documentation so that it can be properly maintained.

Sadly, I have not had much time or incentives to address these shortcomings. Are there any Pyrseas, Postgres or Python enthusiasts looking for a challenge?

2

The Future of Pyrseas: Part 1

In the early days of the Pyrseas project, I read about some open source projects being abandoned due to their developers losing interest or perhaps becoming involved elsewhere and thought to myself “That could never happen to me.”

Sadly, this blog has remained silent for over two years, and I haven’t done substantive development work since last September.

Still, some remain interested in Pyrseas, if gauged by the latest two issues: inquiring about consolidating/updating our documentation and porting dbtoyaml/yamltodb to other databases. So it’s appropriate that we discuss in what directions the project may turn.

Some background first.

Pyrseas: Origins

Pyrseas was born from my experience with making structural changes to SQL databases, version control systems, the typical incompatibility between the two and the solution suggested by Andromeda.

Andromeda had a fairly simple concept: describe your database using YAML and use that to drive the process of generating SQL to make structural modifications (also, use the YAML spec to produce a basic CRUD app to make content changes—see Part 3).

The Pyrseas innovation was: rather than manually edit a YAML spec, why not create it from the Postgres catalogs? In addition, instead of using the information_schema catalog views so that the process could be nominally portable to a few other SQL databases, we opted to query the internal catalogs directly.

The Imaginarium of Peter Eisentraut

In response to a post regarding another tool that diff’d database versions, Core Team member Peter Eisentraut commented that something that “doesn’t support all PostgreSQL features” is not helpful “to help you manage your database upgrades, because it might miss important details.”

That made us reconsider the scope of Pyrseas which initially was to be limited to tables, views and perhaps functions. We decided to address the vast array of Postgres features and some releases later we managed to achieve that goal, for the most part.

A post about the proper place of business logic then led to a collaboration with Roger Hunwicks to augment the Pyrseas tools. Another discussion with Josep Martínez resulted in a preliminary capability to copy predefined data into the database.

Lilo & Sqitch (or Is diffing database versions sufficient?)

Although my Pyrseas development activity has been limited recently, I’ve continued to use Pyrseas to assist in my DBA and database development tasks. I’ve thus come to the conclusion that: no, in the most general case, diffing database versions is not sufficient to manage structural changes.

A simple example based on my original proof-of-concept schema can serve to validate this assertion. Suppose we wanted to modify the release_year column to release_date and use a DATE datetype. The RENAME may be made to work but the datatype change will require some specialized handling (e.g., an external script) that cannot be codified in a generalized “difference” approach.

When I first reviewed the other SQL schema versioning tools, Liquibase was the main exponent of the non-differencing camp (and it still seems to be going strong). About a year after that, David Wheeler came out with the first version of Sqitch and thanks to David I was able to experiment with it.

My main objection to Sqitch and Liquibase is that for the most common use cases, e.g., add a table, add a column, rewrite a view or function, etc., it seems unproductive for a DBA or developer to do the work more than once, i.e., in a development database, either via psql, using a design tool such as PgAdmin or, in some simple cases, even by editing the YAML spec directly. Recreating the SQL DDL and applying it –in the correct order– to another development, test or production database should be automated. The generated SQL should take into consideration the state of the target and, if it becomes necessary, should include generating SQL to backout changes, e.g., drop table, alter table drop column, etc., without any extra DBA or developer intervention.

Conclusion

My proposal to address the insufficiency of diffing database versions is to incorporate some of the ideas of the non-differencing approaches into yamltodb. The solution may be similar to what was done for dbaugment, i.e., add specialized scripts or configuration files that can control the additional processing. Admittedly, this is still very vague and will probably be third in terms of priorities, although I chose to discuss it first.

1

Pyrseas contributions solicited

Do you use PostgreSQL and truly believe it’s “the world’s most advanced open source database” and that its upcoming 9.3 release will make it even more awesome?

Do you also use Python and believe it’s “an easy to learn, powerful programming language” with “elegant syntax” that makes it an ideal language for developing applications and tools around PostgreSQL, such as Pyrseas?

Then we could use your help. For starters, we want to add support for the MATERIALIZED VIEWs and EVENT TRIGGERs coming up in PG 9.3.

We have also been requested to add the capability to load and maintain “static data” (relatively small, unchanging tables) as part of yamltodb, so that it can be integrated more easily into database version control workflows.

And for the next release, Pyrseas 0.7, we’d like to include the first version of the database augmentation tool which will support declarative implementation of business logic in the database–starting off with audit trail columns. Some work has been done on this already, but it needs integration with the current code and tests.

Or perhaps coding is not your forte, but you’re really good at explaining and documenting technical “stuff”. Then you could give us a hand with revamping the docs, maybe writing a tutorial so that users have a smooth ride using our tools.

Or maybe you have your own ideas as to how improve the PostgreSQL version control experience. We’d love to hear those too.

If you’d like to help, you can fork the code on GitHub, join the mailing list and introduce yourself, or leave a comment below.

0

A couple of Pyrseas enhancements

Based on feedback from users and contributors, Pyrseas now sports two enhancements.

Multi-line String Formatting

Up to Pyrseas 0.6, long textual elements such as view definitions, function source text and long object comments, would usually be shown in the YAML output as quoted strings with embedded newlines. Here are two examples from the autodoc database:

schema product:
  description: "This schema stores a list of products and information\n about the\
    \ product"
...
schema warehouse:
  view products:
    definition: " SELECT DISTINCT product.product_id, product.product_code, product.product_description\n\
      \   FROM warehouse.inventory\n   JOIN product.product USING (product_id);"

As you can imagine, this was particularly unsatisfactory for complex functions and views. Thanks to preliminary work by Andrey Popp, Pyrseas 0.7 will be able to format these elements in YAML block style. The above elements will be shown as follows:

schema product:
  description: |-
    This schema stores a list of products and information
     about the product
...
schema warehouse:
  view products:
    definition: |2-
       SELECT DISTINCT product.product_id, product.product_code, product.product_description
         FROM warehouse.inventory
         JOIN product.product USING (product_id);

Thanks to testing by Josep Martínez, 0.7 will also properly display and handle such strings even when they include non-ASCII characters such as accented characters. For example, in 0.6, “Martínez” would be shown as “Mart\xEDnez”. In 0.7, the output will be the original UTF-8 string.

Directory of Database Objects

Pyrseas 0.6 has a single format for output by dbtoyaml or input into yamltodb: a single YAML-formatted file. This becomes a problem when your database has hundreds or more tables, functions, etc (let alone 409,994 tables and counting!). Furthermore, as dbtoyaml and yamltodb are intended to assist with database version control, your team may want to store individual object specifications in your version control system, or you may want to diff individual objects.

The 0.7 --directory option to dbtoyaml and yamltodb allows you to split the YAML spec into multiple files in a directory (or folder) tree. For example, using the dbtoyaml -d option on the autodoc database results in the following tree (shown under Linux using ls -RF):

.:
schema.inherit/      schema.public/      schema.warehouse/
schema.inherit.yaml  schema.public.yaml  schema.warehouse.yaml
schema.product/      schema.store/
schema.product.yaml  schema.store.yaml

./schema.inherit:
table.tab1b.yaml  table.tab1.yaml  table.taba.yaml  table.tabb.yaml

./schema.product:
function.worker.yaml  sequence.product_product_id_seq.yaml  table.product.yaml

./schema.public:

./schema.store:
sequence.store_store_id_seq.yaml  table.inventory.yaml  table.store.yaml

./schema.warehouse:
function.worker.yaml                      table.warehouse.yaml
sequence.warehouse_warehouse_id_seq.yaml  view.products.yaml
table.inventory.yaml

As can be seen, each schema gets its own directory wherein are stored each object belonging to that schema. In addition to schemas, the root level also stores non-schema owned objects such as foreign data wrappers and extensions (the latter can be placed in a schema, but are not owned by it).

The directory tree and multi-line string formats are still under review, so I’d like to encourage you to test both enhancements and provide feedback.

0

PostgreSQL Indexes on Expressions

Pyrseas had its first release a little over a year ago and we now have our first backward compatibility issue. The first release included basic support for traditional indexes, i.e., one or more key columns. For example, given a table test1 with columns col1, col2 and col3, and an index on the last two, dbtoyaml would show (some details omitted):

table test1:
  columns:
  - col1
  - col2
  - col3
  indexes:
    test1_idx:
      columns:
      - col2
      - col3

One of the first issues reported the lack of support for “functional” indexes. I added that but unfortunately, didn’t realize that one can have more than one function or expression and even mix regular columns with expressions. Thus the support was limited to a single expression. Given the first example in the “Indexes on Expressions” documentation, dbtoyaml would show:

table test1:
  columns:
  ...
  indexes:
    test1_lower_col1_idx:
      expression:
        lower(col1)

The original issue was recently re-opened (thanks, Roger) to point out the deficiencies. A fix has been pushed. Thus in the next release, dbtoyaml will support indexes with multiple expressions and even combinations of functions and regular columns. Here is a weird example using the first table. Given CREATE INDEX test1_idx ON test1 (btrim(col3, 'x') NULLS FIRST, col1, lower(col2) DESC), dbtoyaml now outputs:

table test1:
  indexes:
    test1_idx:
      access_method: btree
      keys:
      - btrim(c3, 'x'::text):
          nulls: first
          type: expression
      - col1
      - lower(col2):
          order: desc
          type: expression

So instead of ‘columns’ (or ‘expression’), dbtoyaml outputs ‘keys’. Any key that is an expression is marked with the ‘type’ qualifier.To allow for backward compatibility, yamltodb will continue to accept ‘columns’, so existing YAML specs  with traditional indexes won’t need to be changed. However, if you have an index using an expression, you’ll have to edit as seen above.

Do you have an unusual index?  Try dbtoyaml (from GitHub) on it and let us know if it works (or not).

0

PostgreSQL Extensions and Pyrseas

Prompted by Peter Eisentraut’s blog post, I’ve finished adding support for PG 9.1 EXTENSIONs to the Pyrseas dbtoyaml and yamltodb utilities. For now, this is only available on GitHub.

In order to deal with procedural languages, which are now created as extensions, the utilities now fetch the pg_catalog schema (previously deemed uninteresting for the purpose of version control).  The output of dbtoyaml from a freshly created 9.1 database (assuming no customizations via template1) is now:

schema pg_catalog:
  description: system catalog schema
  extension plpgsql:
    description: PL/pgSQL procedural language
    version: '1.0'
schema public:
  description: standard public schema

This could be changed easily to exclude pg_catalog (which will now also appear against 8.4 and 9.0 databases) before the next Pyrseas release. Update: The pg_catalog schema will now only be shown if it has something other than a description.

I’m hoping some brave, adventurous or simply interested souls will help test the additions.  Please report any issues on GitHub.

0

Pyrseas PostgreSQL features: feedback requested

I’ve been considering the missing features of dbtoyaml/yamltodb.  Two of those are PG 9.1 features:  COLLATIONs and EXTENSIONs.  I plan to cover them eventually, but I think I ought to deal first with the remaining pre-9.1 features.

ROLEs (as well as USERs and GROUPs) and TABLESPACEs are not output by pg_dump (the equivalent of dbtoyaml), only by pg_dumpall.  I’m thinking that if I were to add support for ROLEs and TABLESPACEs I’d probably do it with a --cluster option to dbtoyaml, and the output would be something like the following:

database postgres:
  role one:
    createdb: true
    login: true
  role grp:
    roles:
      - one
 tablespace dataspace:
    location: /data/db

This approach could, in theory, produce output for all databases in a cluster, i.e., the databases would be the top nodes in the YAML spec, rather than the schemas as is normally the case. In other words, it would be the equivalent of pg_dumpall --schema-only. However, I suspect that few persons would be interested in that, at least for version control purposes—since different databases may belong to different projects.

On the other hand, I believe DBAs may want dbtoyaml to include “owner” and privlege (GRANT) information. David Fetter specifically asked for GRANTs saying they would be “handy for deployments.”

Owner and privilege information could be shown as follows:

schema public:
  table film:
    owner: jma
    privileges:
      admin:
        - insert
        - update
      jma:
        - all
      viewer:
        - select

An open question is whether some list of roles is necessary, aside from the object-level information.

I’d appreciate readers taking a couple of minutes to leave feedback on any of the above points, particularly on how important they think the features are in their day-to-day work.

1

Pyrseas/PostgreSQL Feature Matrix

In my last post, I wrote—referring to the state of Pyrseas after version 0.5 is released:

The only gaps left are TABLESPACE, GROUP/ROLE and the EXTENSIONs added in PG 9.1.

I’m afraid I should’ve double checked the list of 9.1 SQL CREATE statements. I missed COLLATIONs. I’ve created a new page, Feature Matrix, that shows the correct picture, which will be updated as subsequent releases are made.

4

More Database Tools?

It’s been over year since I started blogging on these pages about Pyrseas and version control. In a month it will also be the first anniversary of the initial commit to GitHub. Much code and many words have flown under these “bridges,” so this seems an appropriate time to reflect.

When I discovered Andromeda, I was looking for a framework to do simple (CRUD-type) database updates

  • with more flexibility (read, programability) than pgAdmin or phpPgAdmin
  • without being tied to an object-relational mapper, either built-in as in Django or external as SQLAlchemy (Pylons/Pyramid)
  • without having to write repetitive code, either SQL or ORM.

Andromeda appeared to satisfy these objectives (although I wasn’t thrilled about having to customize it in PHP).

When I conceived dbtoyaml, I was being lazy: reacting to Andromeda’s requirement to handcraft a YAML description of a database before I could use it to manage SQL changes to it. I thought: why not create the YAML from the database catalogs?

Since my concept for a YAML database specification didn’t match well to Andromeda’s, that led to yamltodb, my attempt to recreate the SQL “diff’ing” features of Andromeda in Python. Andromeda did it using the information_schema catalogs, which made it portable to other DBMSs that had those. Andromeda also did the comparisons by issuing SQL queries (which didn’t perform well). I chose to use the pg_catalog tables and did the comparisons directly on Python structures.

At first, I had intended to only diff schemas and tables and not much more, since that sufficed for my purposes. However, Peter Eisentraut’s comment eventually convinced me that Pyrseas had to support ALL PostgreSQL DDL features1. I’m very pleased with what was accomplished. Pyrseas 0.5, to be released shortly2, will add support for TEXTSEARCH and FOREIGN DATA WRAPPER related objects. The only gaps left are TABLESPACE, GROUP/ROLE and the EXTENSIONs added in PG 9.1.

2012 brought another turn of events. My post on the controversy between Chris Travers and Tony Marston on whether business logic ought to reside in the database led to collaboration with Roger Hunwicks to create dbextend, a tool to automate database augmentation. A first submission was made and work continues on that front.

The latter effort raises other possibilities. For example, since yamltodb already knows how to create nearly all PG objects, it would be trivial to create a schemadump utility (equivalent to pg_dump -s). Another potential tool of interest to PostgreSQL advocates: dbtoyaml for other databases (mytoyaml, oratoyaml anyone?) together with a conversion utility that operates on the YAML specification so it can be accepted by the PG-only yamltodb (the YAML converter seems should be easier than editing SQL statements). The YAML/JSON output from dbtoyaml is amenable to other analysis or automation tasks.

I hope to get back to my database user interface “dream” … one of these days, but in the meantime, I’m glad for having taken these detours. I’d like to thank those who helped along the way: Josh Berkus, Robert Brewer, Adam Cornett, Ronan Dunklau, Peter Eisentraut, David Fetter, Dickson Guedes, Matthias Howell, Roger Hunwicks, Toon Koppelaars, Marko Kreen, Fabrízio Mello, Regina Obe, Filip Rembialkowski, Dariusz Suchojad, Daniele Varrazzo, Evgeni Vasilev, David Wheeler and others I may have missed.


1 Actually, Josh Berkus was the first one who mentioned (in a private email) that I ought to support all PG objects.
2 And just in time for PyCon, I’m happy to announce that it will support Python 3.

4

The Phantom of the Database – Part 1

Scenario: A row with “Seven  Samurai” (notice two spaces between the words) as the movie title and 1956 as the release year, in the film table.

Plot: User Alice updates the row to remove the extra space. In the meantime, user Bob updates it to change the title to its phonetic Japanese equivalent “Shichinin no Samurai” and user Carol updates the release date to 1954, the year the movie was first shown in Japan. Whose updates will survive?

If the updates take place sequentially and each user fetches the previously updated row prior to saving his or her change, we should end up with a row with “Shichinin no Samurai” as the title and 1954 as the year.

However, if each user starts viewing the original row in a web browser and the application is incorrectly designed, one or two changes may be overwritten.

For example, my minimalist user interface application even in its more robust incarnation, issues an UPDATE with both the title and the year as received from the user. This was done intentionally, for simplicity, to be corrected later. As a result, if the users press the “Save” button in the sequence shown, both Alice’s correction to the English title and Bob’s change to a Japanese title will be lost because they will be overwritten by Carol’s change.

If you’ve used a centralized VCS, like Subversion, you’re probably familiar with the issue. If our users were developers updating a source file with the movie information, they would (typically) be prevented from overwriting because the VCS would inform the second and third users that they had to merge the changes in prior commits.

So, is it enough if the application only updates the columns that were changed by the users, somewhat like a VCS automatically merging non-conflicting changed lines?

To be continued …