Tag Archives: andromeda

More Database Tools?

It’s been over year since I started blogging on these pages about Pyrseas and version control. In a month it will also be the first anniversary of the initial commit to GitHub. Much code and many words have flown under these “bridges,” so this seems an appropriate time to reflect.

When I discovered Andromeda, I was looking for a framework to do simple (CRUD-type) database updates

  • with more flexibility (read, programability) than pgAdmin or phpPgAdmin
  • without being tied to an object-relational mapper, either built-in as in Django or external as SQLAlchemy (Pylons/Pyramid)
  • without having to write repetitive code, either SQL or ORM.

Andromeda appeared to satisfy these objectives (although I wasn’t thrilled about having to customize it in PHP).

When I conceived dbtoyaml, I was being lazy: reacting to Andromeda’s requirement to handcraft a YAML description of a database before I could use it to manage SQL changes to it. I thought: why not create the YAML from the database catalogs?

Since my concept for a YAML database specification didn’t match well to Andromeda’s, that led to yamltodb, my attempt to recreate the SQL “diff’ing” features of Andromeda in Python. Andromeda did it using the information_schema catalogs, which made it portable to other DBMSs that had those. Andromeda also did the comparisons by issuing SQL queries (which didn’t perform well). I chose to use the pg_catalog tables and did the comparisons directly on Python structures.

At first, I had intended to only diff schemas and tables and not much more, since that sufficed for my purposes. However, Peter Eisentraut’s comment eventually convinced me that Pyrseas had to support ALL PostgreSQL DDL features1. I’m very pleased with what was accomplished. Pyrseas 0.5, to be released shortly2, will add support for TEXTSEARCH and FOREIGN DATA WRAPPER related objects. The only gaps left are TABLESPACE, GROUP/ROLE and the EXTENSIONs added in PG 9.1.

2012 brought another turn of events. My post on the controversy between Chris Travers and Tony Marston on whether business logic ought to reside in the database led to collaboration with Roger Hunwicks to create dbextend, a tool to automate database augmentation. A first submission was made and work continues on that front.

The latter effort raises other possibilities. For example, since yamltodb already knows how to create nearly all PG objects, it would be trivial to create a schemadump utility (equivalent to pg_dump -s). Another potential tool of interest to PostgreSQL advocates: dbtoyaml for other databases (mytoyaml, oratoyaml anyone?) together with a conversion utility that operates on the YAML specification so it can be accepted by the PG-only yamltodb (the YAML converter seems should be easier than editing SQL statements). The YAML/JSON output from dbtoyaml is amenable to other analysis or automation tasks.

I hope to get back to my database user interface “dream” … one of these days, but in the meantime, I’m glad for having taken these detours. I’d like to thank those who helped along the way: Josh Berkus, Robert Brewer, Adam Cornett, Ronan Dunklau, Peter Eisentraut, David Fetter, Dickson Guedes, Matthias Howell, Roger Hunwicks, Toon Koppelaars, Marko Kreen, Fabrízio Mello, Regina Obe, Filip Rembialkowski, Dariusz Suchojad, Daniele Varrazzo, Evgeni Vasilev, David Wheeler and others I may have missed.


1 Actually, Josh Berkus was the first one who mentioned (in a private email) that I ought to support all PG objects.
2 And just in time for PyCon, I’m happy to announce that it will support Python 3.

Automated Database Augmentation

Suppose you have a PostgreSQL database like the Pagila sample with 14 tables, each with a last_update timestamp column to record the date and time each row was modified, and it is now a requirement to capture which user effected each change. Or perhaps you have several tables without such audit trail columns and need to add them quickly. Or maybe you have decided to denormalize your design by adding a calculated column, e.g., extended price = unit price times quantity ordered, or a derived column, e.g., carrying the customer name in the invoice table.

If you have some experience as a DBA, the word “drudgery” may have come to mind at the prospect of implementing the above features. It’s possible that, after a while, you’ve developed an approach for dealing with some of them but still wish there’d be some way to automate these thankless tasks.

You may have looked at the Andromeda project’s “automations” which provide some of these capabilities. However, in order to take advantage of the automations, you’ll first have to manually describe your database in a YAML format (and you’ll have to install Apache and PHP). Or you could have tried to use the follow-on project, Triangulum, but essentially you’d still have to create a YAML schema (no need for Apache, but you still need PHP).

Some relief is forthcoming. As a result of discussions resulting from my Business Logic in the Database post, I have been collaborating with Roger Hunwicks on a potential solution to these common DBA needs. The new Pyrseas tool is tentatively named dbextend1 and its initial documentation is available in the Pyrseas extender branch. This is how I envision dbextend being used.

Consider the opening example. The DBA would create a simple YAML file such as the (abbreviated) one below, listing the tables and the needed features:

schema public:
  table actor:
    audit_columns: default
  table category:
    audit_columns: default
...
  table store:
    audit_columns: default

The DBA would then use this file, say audext.yaml, as input to dbextend, e.g.,

dbextend pagiladb audext.yaml

dbextend reads the PostgreSQL catalogs (using code shared with dbtoyaml and yamltodb), building its internal representation. It also reads the YAML extensions file and builds a parallel (albeit much smaller) structure. Thirdly, it reads extension configuration information, e.g., a definition of what columns need to be added for “audit_columns: default“, for example, modified_timestamp and modified_by_user, what trigger(s) to add, and what function(s) to be created.

The output of dbextend is a YAML schema file, just like the one output by dbtoyaml, which can be piped directly to yamltodb to generate SQL to implement the desired features.

In case you’re wondering, dbextend —like other Pyrseas tools— will require Python, psycopg2 and pyyaml.

What features would you like to see automated? What are your suggested best practices for automating these common needs?


Picture credit: Thanks to Mr. O’Brien, a fourth-grade teacher in Minnesota.

1 We’re still receptive to some other suitable name.

Quo vadis, Pyrseas?

When I found Andromeda, over a year ago, I was searching for a tool that would help me maintain PostgreSQL tables supporting web sites, somewhat like a Django admin app, but without Django itself since that was not part of the stack. Ideally, this tool would also facilitate table redesign. I had already written a maintenance app, using CherryPy, Mako and SQLAlchemy, but it was cumbersome to add or make changes to it, and database version control was limited to storing the latest CREATE TABLE statements in a Subversion repository.

Andromeda looked attractive because of its version control features and the capability of automatically generating admin apps. However, after experimenting, I found it lacked the ability to instropect an existing database, requiring you to input the schema in its own YAML format. Pyrseas was born as a result of thinking on how to improve this.

Fast forward to the present: Pyrseas 0.4.0 has been released, supporting about 70% of the PostgreSQL DDL features. This seems like a good time to reflect on where to go next, both short term and long term.

Database Version Control

On the version control front, the following tasks come to mind:

  • Add support for missing DDL features: TEXT SEARCH objects, USERs/ROLEs, TABLESPACEs, FDWs and EXTENSIONs.
  • Ensure existing DDL features are covered fully, i.e., all syntax options
  • Add further options to dbtoyaml/yamltodb for selective output or SQL generation
  • Package Pyrseas as a PG 9.1 EXTENSION, allowing access from other languages and tools
  • Support migration of data, i.e., when ALTER TABLE is not enough
  • Your favorite version control feature

Application Generation

Progress on this front has been slow. Pyrseas could generate minimalist web admin apps with CherryPy, Jinja2 and perhaps something like WTForms, but how many users would want to use them? Conversely, how many would also want support for HTML5, JQuery, or some other WSGI server or templating engine?

I’m not decided, but I still need that admin app, so perhaps I’ll build the capability and blog about it, for tutorial purposes, as I have done so far.

What do you think? What is your favorite database version control feature? What would you like to see in a database web app generation tool? Let me know …

P.S. I’ll be attending PGBR 2011, so hopefully I can hear from some of you personally, in English, em Português o en español.

Version Control for PostGIS

A year ago, I was only trying to “rescue” Andromeda

When I started the Pyrseas project, I was mainly interested in improving on the basic table/column version control of Andromeda. I never thought I’d be adding support for “exotic” features such as base types, operators or operator classes. However, the initial feedback led me to set my sights much higher: Pyrseas was to offer support for all PostgreSQL data definition features.

Over the past two and a half months, I’ve added coverage for those advanced features and more. I’m glad to report that, with the help of Leo Hsu and Regina Obe, the Pyrseas tools are now able to output all the objects1 in a database loaded with postgis.sql, spatial_ref_sys.sql and postgis_comments.sql from PostGIS 1.5, in YAML format. The tools are also able to take in that YAML and generate the SQL necessary to recreate all those objects, in the correct order, on an empty database, as well as to reverse the process.

The following is an update to the feature coverage table presented earlier. Although Pyrseas 0.4 has not been released yet, all these capabilities are present in the current HEAD GitHub repository.

DDL Feature 0.1 0.2 0.3 0.4
AGGREGATE  X  X
CAST  X
CONSTRAINT TRIGGER  X
CONVERSION  X
DOMAIN  X  X
FOREIGN DATA WRAPPER
FUNCTION  X  X  X
GROUP
INDEX  X  X  X  X
LANGUAGE  X  X  X
OPERATOR  X
OPERATOR CLASS  X
OPERATOR FAMILY  X
ROLE
RULE  X  X
SCHEMA  X  X  X  X
SEQUENCE  X  X  X  X
SERVER
TABLE  X  X  X  X
– CHECK CONSTRAINT  X  X  X  X
– PRIMARY KEY  X  X  X  X
– FOREIGN KEY  X  X  X  X
– UNIQUE CONSTRAINT  X  X  X  X
– INHERIT  X  X  X
TABLESPACE
TEXT SEARCH CONFIGURATION
TEXT SEARCH DICTIONARY
TEXT SEARCH PARSER
TEXT SEARCH TEMPLATE
TRIGGER  X  X
TYPE  X  X
– ENUM  X  X
– composite  X
– base type  X
USER
USER MAPPING
VIEW  X  X  X
COMMENT  X  X  X

I’d like to thank Leo and Regina for their help, as well aquilax, mhow and acornett at GitHub for being early adopters and raising issues to improve Pyrseas functionality.

If you have a challenging PostgreSQL database, I’d like to encourage you to give Pyrseas a try, and let me know how it works out.


1 For those interested, this includes 2 tables, 1 view, 9 base types, 1 composite type, 777 functions, 17 aggregate functions, 19 casts, 4 operator classes/families, 23 operators, and 315 comments.

User Interfaces for Databases

Pyrseas was started to improve on Andromeda. In addition to the schema version control I’ve been discussing in these posts, Andromeda provides two other capabilities: automations and web application programming.

Andromeda’s web programming allows you to generate a database application with very little actual programming, since Andromeda takes into account the primary key, foreign key and other information present in the YAML specification to generate much of the application code for you. This is another area that I’d like to explore for implementation in Pyrseas.

Administrative Applications

Nearly any database needs a user interface to add, retrieve, update and delete records, known by the unattractive acronym CRUD. There are a number of admin apps that provide such generic facilties. For example, for PostgreSQL we have pgAdmin which offers a GUI, and phpPgAdmin and Adminer which offer PHP web interfaces. These have a couple of limitations.

Admin apps almost always operate on a single table at a time. Interfaces for end users often require interacting with or viewing two or more tables on a single screen or page, e.g., customers with their invoices, accounts with their transactions.

By definition, admin apps are “raw” and cater to the needs of DBAs and programmers. After retrieving a customer record searching by name, the admin user can update any column, including the primary key and other columns that end users would normally be prevented from updating.

Django Admin

Django includes an appealing admin site. With a moderate dose of programming, you can develop an application suitable for end users. However, you’re at the mercy of Django’s constraints on “modeling” a database. Take for example the film_category table I used in my second post on version control:

CREATE TABLE film_category (
    film_id INTEGER NOT NULL REFERENCES film (id),
    category_id INTEGER NOT NULL
        REFERENCES category (category_id),
    last_update TIMESTAMP WITH TIME ZONE NOT NULL,
    PRIMARY KEY (film_id, category_id)
);

Django cannot use this table “as is” because it insists in having a single column as the primary key. So you end up adding an unnecessary “id” column together with a SEQUENCE (and needless to say, you incur the wrath of the SSSKA).

We could turn to SQLAlchemy, a Python-based ORM that understands that a primary key –albeit singular– can encompass more than one column, and some other web framework or even Django itself, and attempt to recreate the nice Django admin app. However, aside from the additional effort that would represent, we wouldn’t be taking advantage of the YAML database specification we can now store in our VCS.

This is a very broad, and some may say, ambitious topic, but as I did with version control, in future posts I’ll survey existing “art” and progress towards designing and implementing a generic end user interface for PostgreSQL. Reader feedback will be much appreciated.

SQL Version Control Implementation Choices I

The Andromeda “data dictionary” is organized as multiple hierarchies, e.g., modules, generic columns, and tables with columns. Here is a very simple example:

module movies:
    description: Movies

column id:
    type_id: int

column title:
    type_id: vchar
    colprec: 32

column release_year:
    type_id: int

table film:
    module: movies
    description: Film information
    column film_id:
        primary_key: Y
    column title:
    column release_year:

This has to be created and maintained manually which may be OK if you’re starting off on a new database project, but not so cool if you have an existing database with a hundred tables (or hundreds of thousands of tables–see page 6).

Thus for Pyrseas I decided to create first a tool (dbtoyaml) to output an existing database schema in YAML format. Structurally, a single hierarchy, e.g., database → schemas → tables (somewhat like pgAdmin’s left panel), appeared preferable to Andromeda’s design. After a couple of iterations, the Pyrseas YAML specification ended as follows:

schema public:
  table film:
    columns:
    - id:
        not_null: true
        type: integer
    - title:
        type: character varying(32)
    - release_year:
        type: integer
    description: Film information
    primary_key:
      film_pkey:
        columns:
        - id

The above is the output from PyYAML’s dump function with default_flow_style=False. Internally, the structure is a Python dict or map which looks just like JSON, e.g.,

{
  'schema public':
  {
    'table film':
    {
      'primary_key':
      {
        'film_pkey':
        {
          'columns': ['id']
        }
      },
      'columns':
      [
        {
          'id':
          {
            'not_null': True,
            'type': 'integer'
          }
        },
        {
          'title':
          {
            'type': 'character varying(32)'
          }
        },
        {
          'release_year':
          {
            'type': 'integer'
          }
        }
      ],
      'description':
        'Film information'
    }
  }
}

PyYAML’s load function can read the YAML file and turn it into the Python dict, so the second tool (yamltodb) can deal with the structure above directly.

The most important implementation decision for yamltodb was how to compare the input map to the map created from the current database’s catalogs. A few options were possible:

  1. SQL-based comparisons
  2. Map comparisons
  3. Internal structure comparisons

SQL Comparisons

This is the approach taken by Andromeda. Its essence is described in step 4 of Dictionary Based Database Upgrades. It consists of storing the input specification into a set of tables and the existing catalogs into a parallel set (potentially one could use the information_schema views), and then issuing SQL queries to compare the two sets.

The presumed advantage of this method is that some of the operations can be done –as SQL set operations– in the DBMS. However, the input spec first has to be inserted into the tables, a row-at-a-time, thus negating the former benefit. In addition, the affected database has to carry at least one set of extraneous catalog-like tables.

Map Comparisons

This is the naive approach consisting of “walking the trees,” i.e., comparing the input map to an internal map created from the current database. This works, up to a point. For example, if a table in the input map is not present in the internal map, it’s easy enough to call a function to generate a CREATE TABLE statement.

However, as most readers know, a PostgreSQL database usually has many dependencies between the various system entities, e.g., foreign keys on primary keys, views on tables, etc. So comparing hierarchies isn’t ideal.

Internal Comparisons

In a forthcoming post, I’ll explore this method as well as other issues such as generating the SQL statements in the correct order.

SQL Database Version Control – Andromeda

In my first review of database VC tools, I mentioned Kenneth Down’s article where he recommended storing a single metadata file in a VCS and using a difference engine to do database upgrades. In a subsequent post, Kenneth expanded on this, providing a step-by-step description of how to implement such a tool. In fact, he had already gone further: he created an open source project named Andromeda that attempted to deliver on that vision.

Let’s see how our DBAs Carol and Dave can use Andromeda to implement the database changes specified in Version Control, Part 2: SQL Databases.

Version 0.1

Andromeda is much more than a version control tool. Its SourceForge page advertises it as a “database development system” and a “complete tool for building database-centric” applications. So, to start off, Carol, Dave and Manny will have to install Apache, PHP and other dependencies before installing Andromeda. Then they’ll be able to access it via web browser at a URL such as http://localhost/~manny/andro_root/andro. After an initialization step, which includes creating a PostgreSQL database named andro, Manny can create the “movies” application as described here.

In order to create the first version of the film table, Carol has to create the YAML spec file movies.dd.yaml as follows (this is the file that will be placed under version control):

group admin:
    description: Administrators
    module movies:
        permsel: Y
        permins: Y
        permupd: Y
        permdel: Y

group users:
    description:  Employees
    module movies:
        permsel: Y

module movies:
    description: Film Database
    uisort: 100

column id:
    description: Unique film ID
    type_id: int

column film_title:
    description: Film title
    type_id: vchar
    colprec: 32

column release_year:
    description: Release year
    type_id: int
    value_min: 1888

table film:
    description: Films
    module: movies
    uisort: 30
    spaceafter: Y

    column id:
        uisearch: Y
        primary_key: Y

    column film_title:
        uisearch: Y

    column release_year:

The group definitions are so that Andromeda can implement security constraints. The module definition is used to group tables logically, e.g., in the full Pagila database we may group tables into movies, customers and accounting. The column definitions precede the table definitions since the former can be used in multiple tables. The YAML spec is processed by selecting a “Build” step from Andromeda. If successful, it will create a “movies” database including the film table.

Note that Carol had to use film_title instead of title because the latter conflicts with an undocumented predefined column. Using title causes the following error:

ERROR >> Duplicate Column definition: 

ERROR >>    column_id => title

Notice also that she used value_min instead of valuemin for the release_year constraint. Andromeda documents the latter. However, no CHECK constraint is actually created and the trigger functions (see below) don’t implement a validation.

If you look at the created table, you may be surprised:

               Table "public.film"
    Column    |         Type          | Modifiers
--------------+-----------------------+-----------
 _agg         | character(1)          |
 skey_quiet   | character(1)          |
 film_title   | character varying(32) |
 skey         | integer               |
 release_year | integer               |
 id           | integer               |
Indexes:
    "film_idx_film_skey_idx" btree (skey)
    "film_pk" btree (id)
Triggers:
    film_ins_bef_r_t BEFORE INSERT ON film FOR EACH ROW EXECUTE PROCEDURE film_ins_bef_r_f()
    film_upd_bef_r_t BEFORE UPDATE ON film FOR EACH ROW EXECUTE PROCEDURE film_upd_bef_r_f()

As you can see, the order of the columns is not what you’d expect from the spec1 and three internal columns were added. Note that although there is an film_pk index, PostgreSQL doesn’t recognize it as the PRIMARY KEY, and all columns are nullable. The “magic” is all in the generated trigger functions.


1 Of course, according to the relational model, the order of the attributes is not significant, but in practice most people examining a relation expect some order.

Version 0.2

Carol and Dave implement the second set of changes by editing movies.dd.yaml. The combined edits are as follows:

@@ -30,4 +30,15 @@
 value_min: 1888

+column length:
+    description: Length in minutes
+    type_id: int
+    value_min: 1
+    value_max: 9999
+
+column rating:
+    description: Film rating
+    type_id: char
+    colprec: 5
+   
 table film:
     description: Films
@@ -44,2 +55,6 @@

 column release_year:
+
+    column length:
+
+    column rating:

The departure from the original is minor: length was defined as int instead of smallint because Andromeda only allows a limited set of types. And although Dave specified the CHECK constraint on length, it’s not functional.

Version 0.3

For the third iteration, Carol and Dave again edit movies.dd.yaml as shown below:

@@ -16,4 +16,44 @@
 uisort: 100

+column language_id:
+    description: Language ID
+    type_id: int
+
+table language:
+    description: Spoken language
+    module: movies
+    uisort: 10
+
+    column language_id:
+        primary_key: Y
+        uisearch: Y
+
+    column name:
+        description: Language name
+        uisearch: Y
+
+    column ts_upd:
+
+column category_id:
+    description: Film category ID
+    type_id: int
+
+table category:
+    description: Categories
+    module: movies
+    uisort: 20
+    spaceafter: Y
+
+    column category_id:
+        auto: sequence
+        primary_key: Y
+        uisearch: Y
+
+    column name:
+        description: Film category
+        uisearch: Y
+
+    column ts_upd:
+
 column id:
     description: Unique film ID
@@ -59,2 +99,17 @@

 column rating:
+
+    foreign_key language:
+
+table film_category:
+    description: Film categories
+    module: movies
+    uisort: 40
+
+    foreign_key film:
+        primary_key: Y
+
+    foreign_key category:
+        primary_key: Y
+
+    column ts_upd:

Notes:

  • The foreign_key specification names the table only. Generated column names match those of the referenced table, which may be confusing.
  • Each column of a PRIMARY KEY needs a primary_key: Y line.
  • The predefined column name was used instead of table-specific columns as in the original. The former creates varchar(100) columns (although the documentation says varchar(40)).
  • The predefined column ts_upd creates timestamp without time zone rather than with time zone as in the original. The latter is not supported by Andromeda.
  • The sequence category_seq_category_id is created, as a result of auto: sequence on category_id but is not linked directly to the category table (it’s used in the trigger functions).

Review

Andromeda is written in PHP (comments indicate it was earlier implemented in Java) and supports PostgreSQL databases. It also requires Apache, PEAR and PL/Perl.

Andromeda implements its difference engine by querying the information_schema views. It saves the data from the views and from the YAML spec into its own tables (it creates 105 tables in a schema named zdd and 77 tables in the public schema –not all of these are for version control or database automation tasks). It then uses SQL queries to determine what actions are needed for the upgrade. It applies the changes immediately.

As pointed out previously, Andromeda supports a limited set of data types. For example, booleans are only implemented as CHAR(1)’s that accept ‘Y’ or ‘N’. A follow-on project, Triangulum, appears to expand on data type support.

Andromeda can load data from CSV files and is able to retain data in an existing table whose structure is changing. However, it cannot deal with RENAMEs of columns or tables and it doesn’t seem to be able to DROP a column.

The last official release of Andromeda occurred in 2009. The guide for its successor, which made a third alpha release available in Feb. 2011, implies Andromeda will be superseded by Triangulum.

The concept of using a YAML specification for SQL database version control is very appealing (in fact, it’s what led to the Pyrseas project) because it disassociates SQL syntax from the definition and is not verbose like the XML used by other tools. Andromeda (and Triangulum) thus implement a useful mechanism for schema versioning.