Tag Archives: ttm

Multisets and the Relational Model

In a comment to my previous post, David Fetter challenged me to “find a case for multisets. That we’re stuck with them doesn’t mean they’re useless.” My response was that I couldn’t help him because multisets (or bags) are not part of the relational model (which was the point of my post) and asked David to show me an example of a multiset he’s stuck with so that we could discuss it.

While waiting for his response, I read an article titled “Toil and Trouble” by Chris Date, which was originally published in Database Programming and Design, January 19941, where he tackled the issue of duplicate rows and multisets. Chris opened by stating that duplicates “are, and always were, a mistake in SQL” (and nearly 20 years later the mistake has not been corrected).

In the article, Date makes a number of points against duplicates and multisets but perhaps two of the best are the following:

  1. When considering the collection (3, 6, 6, 8, 8, 8, 11) versus the set {3, 6, 8, 11} we have to distinguish between the two 6’s by saying “the first 6″ or “the second.” Date then points out that “we have now introduced a totally new concept, one that is quite deliberately omitted from the relational model: positional addressing. … we have moved quite outside the cozy framework of relational theory … [and] there is no guarantee whatsoever that any results that hold within that framework still apply.”
  2. In response to a claim by David Beech that “mathematicians deal with such collections, called multisets or … bags” and therefore that a model with duplicate rows is at least mathematically respectable, Date says:

“… all of the mathematical ‘bag theory’ treatments I’ve seen start off by assuming that there is a way to count duplicates! And that assumption, I contend, effectively means that bags are defined in terms of sets—each bag element has a hidden identifying tag that distinguishes it somehow, and the bag is really a set of tag/element pairs.”

I believe that as programmers it becomes second nature to deal with duplicate items in lists and sequences. Since it is so easy to code a loop to visit each item in turn and apply some processing—in Python you can even use built-ins or functions from itertools, that we frown on a system that, at least in theory, insists on removing duplicates and dealing only with proper (mathematical) sets. However, we should realize that the theory, as Date says, is practical: by keeping the duplicates we lose, for example, the benefits of relational normal forms and certain optimization techniques.

In closing, Date presents the following parts and shipments database:

P  pno │ pname        SP   sno │ pno 
  ─────┼────────          ─────┼─────
   P1  │ Screw             S1  │ P1 
   P1  │ Screw             S1  │ P1  
   P1  │ Screw             S1  │ P2 
   P2  │ Screw

And considers the query “List part numbers for parts that either are screws or are supplied by supplier S1, or both.” He then presents 12 candidate SQL formulations, which someone ran for him against SQL Server 4.2 on OS/2.  I thought it would be instructive to run them against Postgres 9.3, so here they are.

SELECT pno
FROM   p
WHERE  pname = 'Screw'
OR     pno IN
     ( SELECT pno
       FROM   sp
       WHERE  sno = 'S1');

Result: 3 P1, 1 P2

SELECT pno
FROM   sp
WHERE  sno = 'S1'
OR     pno IN
     ( SELECT pno
       FROM   p
       WHERE  pname = 'Screw');

Result: 2 P1, 1 P2

SELECT p.pno
FROM   p, sp
WHERE  ( sno = 'S1' AND
         p.pno = sp.pno)
OR       pname = 'Screw';

Result: 9 P1, 3 P2

SELECT sp.pno
FROM   p, sp
WHERE  ( sno = 'S1' AND
         p.pno = sp.pno)
OR       pname = 'Screw';

Result: 8 P1, 4 P2

SELECT pno
FROM   p
WHERE  pname = 'Screw'
UNION  ALL
SELECT pno
FROM   sp
WHERE  sno = 'S1';

Result: 5 P1, 2 P2

SELECT DISTINCT pno
FROM   p
WHERE  pname = 'Screw'
UNION  ALL
SELECT pno
FROM   sp
WHERE  sno = 'S1';

Result: 3 P1, 2 P2

SELECT pno
FROM   p
WHERE  pname = 'Screw'
UNION  ALL
SELECT DISTINCT pno
FROM   sp
WHERE  sno = 'S1';

Result: 4 P1, 2 P2

SELECT DISTINCT pno
FROM   p
WHERE  pname = 'Screw'
OR     pno IN
     ( SELECT pno
       FROM   sp
       WHERE  sno = 'S1');

Result: 1 P1, 1 P2

SELECT DISTINCT pno
FROM   sp
WHERE  sno = 'S1'
OR     pno IN
     ( SELECT pno
       FROM   p
       WHERE  pname = 'Screw');

Result: 1 P1, 1 P2

SELECT pno
FROM   p
GROUP  BY pno, pname
HAVING pname = 'Screw'
OR     pno IN
     ( SELECT pno
       FROM   sp
       WHERE  sno = 'S1');

Result: 1 P1, 1 P2

SELECT p.pno
FROM   p, sp
GROUP  BY p.pno, p.pname, sno, sp.pno
HAVING ( sno = 'S1' AND
         p.pno = sp.pno)
OR       pname = 'Screw';

Result: 2 P1, 2 P2

SELECT pno
FROM   p
WHERE  pname = 'Screw'
UNION
SELECT pno
FROM   sp
WHERE  sno = 'S1';

Result: 1 P1, 1 P2

As Date points out, 12 different formulations produce 9 different results!  And as he further states, those are not all the possible formulations. For example, a modern revision of the third query may be:

SELECT pno
FROM   p NATURAL JOIN sp
WHERE  sno = 'S1'
OR     pname = 'Screw';

and the result is yet again different (6 P1 parts and 1 P2).

The bottom line is to be very, very careful when dealing with multisets in SQL.


1 The article was republished in Relational Database Writings, 1991-1994, in Part I, “Theory Is Practical!”

Is This Relational?

This post was prompted by Hans-Juergen Schoenig’s Common mistakes: UNION vs. UNION ALL because it touches on one of my pet peeves: the claim that some feature of SQL exemplifies or conforms to the relational model. Schoenig does not make that claim explicitly, but he does state “What [most] people in many cases really want is UNION ALL” and shows the following query and result:

test=# SELECT 1 UNION ALL SELECT 1;
 ?column? 
----------
        1
        1

(2 rows)

There are two relational faults above*. First, UNION ALL is not a relational operator. This is an area where both Ted Codd and Chris Date (and Hugh Darwen), are fully in agreement. In the “Serious Flaws in SQL” chapter of The Relational Model for Database Management: Version 2 (1990) Codd listed duplicate rows as the first flaw and characterized “relations in which duplicate rows are permitted as corrupted relations.” Date concurs and wrote the essay “Why Duplicate Rows Are Prohibited”(in Relational Database Writings, 1985-1989) and (with Darwen) included RM Proscription 3: No Duplicate Tuples in their Third Manifesto, which reads:

D shall include no concept of a “relation” containing two distinct tuples t1 and t2 such that the comparison “t1 = t2” evaluates to TRUE. It follows that (as already stated in RM Proscription 2), for every relation r expressible in D, the tuples of r shall be distinguishable by value.

Needless to say, those two “1”s are not distinguishable unless you talk about “the first 1″ and “the last 1,” i.e., ordering, which is also proscribed by the relational model because relations are sets.

Now, the example given is synthetic so I’ll present a more realistic example. Suppose a manager asks “which employees are in department 51 or work on the Skunk Works project?” Let’s assume we have a projects table with columns proj_no (primary key) and proj_name, an emp table with columns emp_no (primary key), last_name, first_name, and dept_no, and aassignments table with columns proj_no and emp_no (both forming the primary key and each referencing the previous two tables, respectively). We’ll first emulate this with a CTE, so we won’t have to create or populate any tables:

WITH emp AS (SELECT 'Ben Rich'::text AS emp_name,
                     51 AS dept_no),
     assignments AS (SELECT 'Ben Rich'::text AS emp_name,
                           'Skunk Works'::text AS proj_name)
SELECT emp_name
  FROM emp
 WHERE dept_no = 51
UNION ALL
SELECT emp_name 
  FROM assignments
 WHERE proj_name = 'Skunk Works';

If you run this in psql, you’ll see two rows with identical values and the manager is going to ask “Do we have two employees named Ben Rich?”  However, in practice the real query will be:

SELECT first_name, last_name
  FROM emp
 WHERE dept_no = 51
UNION ALL
SELECT first_name, last_name
  FROM emp JOIN assignments USING (emp_no)
           JOIN projects p USING (proj_no)
 WHERE p.proj_name = 'Skunk Works';

Unless you change UNION ALL to UNION your result wil contain duplicate rows for employees that satisfy both predicates. However, an alternative formulation without UNION would be

SELECT first_name, last_name
  FROM emp LEFT JOIN assignments USING (emp_no)
           LEFT JOIN projects p USING (proj_no)
 WHERE dept_no = 51
    OR p.proj_name = 'Skunk Works';

This query correctly returns only one row per employee. Admittedly, the query is still somewhat synthetic. In reality, the query may include multiple other columns and several hundred rows may be retrieved and thus the duplicate tuples and the logical error may not be so obvious.

UPDATE: Changed last query to use LEFT JOINs as correctly suggested by RobJ below.


* The second relational fault? The result column is unnamed (something Date and Darwen insist on much more than Codd).

Tuples in the Pythonic, TTM-inspired interface to PostgreSQL

The Third Manifesto formally describes tuple types (RM prescription 6), tuple values (prescription 9), tuple variables (prescription 12) as well as other tuple-related elements. As mentioned in the previous post, a tuple value is a set of ordered triples each consisting of attribute name, type and value.

Class Tuple of the TTM-inspired interface to PostgreSQL models TTM tuples as Python lists of TTM Attribute objects. Lists were used rather than sets because for many practical purposes the order of the attributes is useful (or has “meaning”), e.g., the first attribute listed is most often –even in purist relational theory presentations– the primary key or part of the primary key.

The interface stores the Tuple heading as a (Python) n-tuple of name-type tuples, in the “internal use” _heading attribute. The n-tuple was chosen over a list due to its immutability. The interface also sets each Attribute as a Python attribute of the Tuple object. Thus, if you define a Tuple variable as follows:

film = Tuple([
    Attribute('id', int, sysdefault=True),
    Attribute('title'),
    Attribute('release_year', int)])

You can then assign or access an Attribute using simple Python syntax:

film.title = "Seven Samurai"
if film.year == 1954:
    do something

The interface also stores two other internal use lists, one for nullable attributes and another for attributes that allow default values. These are to be used by upstream classes such as RelVar.

Class Tuple has a __setattr__ method tailored to deal with assignment to TTM Attributes. It disallows assignment to internal attributes after initialization, with one exception: the _tuple_version attribute (used by RelVar for optimistic concurrency). It also doesn’t allow assignment to undefined Attributes, e.g., given the film variable above, attempting to assign to film.length will raise an AttributeError. Finally, the assignment is “filtered” through class Attribute, so that an attempt such as film.title = 8.5 will result in a ValueError from that class.

The pyrseas.relation.tuple module defines a standalone function: tuple_values_dict. This is used to generate a dictionary of attribute values suitable for passing to Psycopg’s cursor.execute method. For INSERT, a single currtuple argument is expected. For UPDATE, the modified Tuple is passed as a second argument and tuple_values_dict will return a dictionary solely for the attribute and values that have changed.

Attributes in the Pythonic, TTM-inspired interface to PostgreSQL

The Third Manifesto‘s Relational Model (RM) prescription 9 defines a relation heading as “a set of ordered pairs or attributes of the form <A,T>,” where A is the name of the attribute and T is the name of the declared type of the attribute. It then defines a tuple value (or tuple for short) as a set of ordered triples of form <A,T,v> where v is an arbitrary value of type T, called the attribute value.

Class Attribute of the TTM-inspired interface to PostgreSQL models the latter ordered triple rather than the ordered pair. I considered implementing the pair, say as AttribType, and deriving Attribute from it, but opted for the leaner, no-hierarchy solution for now.

An Attribute object should be initialized with a name, Python type and a value. However, only the name is required. The default type is str and for Python 2, unicode is treated almost as a synonym for str (I realize this goes against everything that your mother taught you, but I’m hoping that widespread Python 3 adoption will make this moot soon).

I wholehearteadly agree with TTM RM proscription 5 banning attributes that do not have a value, i.e., like SQL NULLs. However, I hope the interface is useful on existing databases, so in the interest of practicality, class Attribute has two additional, optional arguments: nullable and sysdefault. The former is to specify that an attribute allows NULLs (or None in Python). The latter can be used to indicate the corresponding table column has an SQL DEFAULT specification (this includes those defined as SERIAL or BIGSERIAL).

If an attribute is not nullable and not sysdefault, a value must be specified. If omitted, a suitable default is created based on the type (if possible), i.e., an empty string for str, 0 for int, 0.0 for float. If the attribute is nullable, empty string, 0 or 0.0 are converted to None. This approach is to facilitate dealing with empty HTML form fields, i.e., if the user skips a nullable field, the attribute should end up as a NULL in the database.

If a value is provided, the code raises ValueError if the value does not agree with the specified (or defaulted) Python type. The only exceptions are that an int value is cast to float, and a unicode value is allowed if the type is str and we’re working in Python 2.x.

A Pythonic, TTM-inspired interface to PostgreSQL – Requirements

Several moons ago, I started a series of posts about “designing and implementing a generic end user interface for PostgreSQL.” After a while, the series got sidetracked by other issues.

More recently, I have returned to the original endeavor. Partly from reading Database Explorations: Essays on The Third Manifesto and related topics by C.J. Date and Hugh Darwen, I decided to use relational concepts as presented in The Third Manifesto (TTM) in my implementation. This post provides an overview of the requirements.

Limited Scope

The interface is not a full-blown replacement for an object-relational mapper (ORM) (although in theory it could eventually grow in that direction). The interface is intended to assist with two typical needs of a database “admin” application: browsing and CRUD.

Browsing refers to presenting a subset of rows (tuples) of a table (relation variable or relvar) for subsequent editing. The relvar will typically be normalized so it may be necessary to join it to other relvars. Browsing will usually display a limited number of columns (attributes) so relational projection will be needed.

CRUD refers to the ability to create, read, update and delete single tuples in a relvar. The interface should only support relvars with a properly defined, possibly composite primary key.

Simplicity

The user (developer) should have to define only the attributes of each relvar together with the key, and for browsing, the projected attributes plus a JOIN specification if multiple relvars are involved. The definitions should be simple enough so that most of them could be (at a later date) derived automatically from the database catalogs.

From the definitions, the interface should generate all necessary SQL commands to INSERT a single tuple (possibly returning a generated key value), retrieve, UPDATE or DELETE a single tuple using the key, and fetch subsets of projected/joined tuples in a given order.

Optimistic Concurrency Control

The interface should take advantage of PostgreSQL features to implement optimistic locking when handling updates or deletes, as described in a previous series of posts.

Query by Example Support

The interface should facilitate querying of the browsed tuples using something similar to Query-By-Example. For example, when browsing movies if the argument release_year is passed as “>= 1969“, the results should only include films released on that year or later. This feature was not discussed in a post but had been committed to the tutorial repository.

TTM and SQL

The interface should follow the TTM guidelines when possible. For example, although implemented in Python, assignment to a relvar attribute defined as int should not be allowed if the value is of type str, and duplicate attribute names in a join expression should not be permitted. However, since the interface ought to be usable against existing SQL databases, allowance should be made for certain SQL features such as nullable attributes.

The implementation has been committed to the Pyrseas repository and changes were made to the DBUI tutorial to use the new interface. Subsequent posts will cover the interface in more detail.