In the previous post, I discussed two alternative implementations for the yamltodb difference engine: SQL comparisons using extra dictionary tables, and comparing the YAML/JSON maps (Python dicts). Today I’ll review the last option: internal structure comparisons.
Internal Structures – Take 1
The original implementation for yamltodb was based on the implementation of dbtoyaml. Since the YAML maps output by the latter were hierarchical, the internal structures were also organized in layers: a Python class instance (Database) to represent the database, having a dict of classes representing the schemas, each in turn with a dict of “schema objects,” i.e., tables, sequences, etc. Each layer was populated by queries against the corresponding PostgreSQL catalogs.
This organization was also used for the difference engine. A second Database instance represented the YAML input. The two hierarchies were then traversed and compared at each layer, outputting SQL as differences were found.
This approach served to flesh out the SQL generation code and to deal with various feature tests, e.g., creation of a schema, creation of a table, adding columns to a table, and even RENAMEing schemas and tables. However, it couldn’t easily handle FOREIGN KEYs and SEQUENCEs owned by tables (resulting from a SERIAL data type).
Internal Structures – Take 2
To tackle the various dependencies, I refactored the code to use structures somewhat along the lines of Post Facto. The Database class now holds objects derived from class DbObjectDict which corresponds roughly to a single PostgreSQL catalog. Thus, SchemaDict is nearly equivalent to pg_namespace, ClassDict to pg_class and ConstraintDict to pg_constraint. Each one of those classes is also a Python dict and holds specialized objects, e.g., ClassDict holds objects (derived from DbClass) of class Sequence, Table or View (for the time being, class Index is handled separately).
Each XxxDict class is hashed by a Python tuple (an immutable sequence), paralleling the underlying catalog UNIQUE keys. After the dicts are populated, either from the catalogs or from the YAML map, methods are called to link related objects, e.g., columns to tables, foreign keys to the primary key tables, etc.
Note that Pyrseas implements dual sets of classes: collections (dicts) of objects and classes for individual object types. In constrast Post Facto has a single set of objects and uses a Python classmethod to fetch the catalog information. I found the dual sets more convenient to deal with certain implementation issues.
In order to generate the SQL in the correct sequence considering inter-object dependencies, Post Facto does a topological sort of its objects. Influenced by pg_dump, I preferred a lazier approach. The diff_map methods of Database and lower level classes generate the SQL in a stratified manner with deferred actions.
For example, since one cannot drop tables or primary keys if foreign keys still refer to the former, ClassDict‘s diff_map, upon finding a table has been dropped (it doesn’t exist in the input YAML map), only marks it for dropping. Then it has a pass to drop foreign keys and views, followed by another pass to drop other constraints and indexes and finally the tables themselves.
As a second illustration, I recently added support for inherited tables (and by extension, partitioned tables). Since PostgreSQL allows for multiple inheritance, creating (and dropping) tables in an inheritance “network” requires special care. My solution: the first pass stacks children tables and only creates “root” tables, while the second pass iterates over the stack to create the rest in the right order.
I’d love to hear feedback on this or from anyone who has experimented with the Pyrseas utilities.