Is This Relational?

This post was prompted by Hans-Juergen Schoenig’s Common mistakes: UNION vs. UNION ALL because it touches on one of my pet peeves: the claim that some feature of SQL exemplifies or conforms to the relational model. Schoenig does not make that claim explicitly, but he does state “What [most] people in many cases really want is UNION ALL” and shows the following query and result:

test=# SELECT 1 UNION ALL SELECT 1;
 ?column? 
----------
        1
        1

(2 rows)

There are two relational faults above*. First, UNION ALL is not a relational operator. This is an area where both Ted Codd and Chris Date (and Hugh Darwen), are fully in agreement. In the “Serious Flaws in SQL” chapter of The Relational Model for Database Management: Version 2 (1990) Codd listed duplicate rows as the first flaw and characterized “relations in which duplicate rows are permitted as corrupted relations.” Date concurs and wrote the essay “Why Duplicate Rows Are Prohibited”(in Relational Database Writings, 1985-1989) and (with Darwen) included RM Proscription 3: No Duplicate Tuples in their Third Manifesto, which reads:

D shall include no concept of a “relation” containing two distinct tuples t1 and t2 such that the comparison “t1 = t2” evaluates to TRUE. It follows that (as already stated in RM Proscription 2), for every relation r expressible in D, the tuples of r shall be distinguishable by value.

Needless to say, those two “1”s are not distinguishable unless you talk about “the first 1″ and “the last 1,” i.e., ordering, which is also proscribed by the relational model because relations are sets.

Now, the example given is synthetic so I’ll present a more realistic example. Suppose a manager asks “which employees are in department 51 or work on the Skunk Works project?” Let’s assume we have a projects table with columns proj_no (primary key) and proj_name, an emp table with columns emp_no (primary key), last_name, first_name, and dept_no, and aassignments table with columns proj_no and emp_no (both forming the primary key and each referencing the previous two tables, respectively). We’ll first emulate this with a CTE, so we won’t have to create or populate any tables:

WITH emp AS (SELECT 'Ben Rich'::text AS emp_name,
                     51 AS dept_no),
     assignments AS (SELECT 'Ben Rich'::text AS emp_name,
                           'Skunk Works'::text AS proj_name)
SELECT emp_name
  FROM emp
 WHERE dept_no = 51
UNION ALL
SELECT emp_name 
  FROM assignments
 WHERE proj_name = 'Skunk Works';

If you run this in psql, you’ll see two rows with identical values and the manager is going to ask “Do we have two employees named Ben Rich?”  However, in practice the real query will be:

SELECT first_name, last_name
  FROM emp
 WHERE dept_no = 51
UNION ALL
SELECT first_name, last_name
  FROM emp JOIN assignments USING (emp_no)
           JOIN projects p USING (proj_no)
 WHERE p.proj_name = 'Skunk Works';

Unless you change UNION ALL to UNION your result wil contain duplicate rows for employees that satisfy both predicates. However, an alternative formulation without UNION would be

SELECT first_name, last_name
  FROM emp LEFT JOIN assignments USING (emp_no)
           LEFT JOIN projects p USING (proj_no)
 WHERE dept_no = 51
    OR p.proj_name = 'Skunk Works';

This query correctly returns only one row per employee. Admittedly, the query is still somewhat synthetic. In reality, the query may include multiple other columns and several hundred rows may be retrieved and thus the duplicate tuples and the logical error may not be so obvious.

UPDATE: Changed last query to use LEFT JOINs as correctly suggested by RobJ below.


* The second relational fault? The result column is unnamed (something Date and Darwen insist on much more than Codd).

About these ads

9 thoughts on “Is This Relational?”

  1. Your last query will omit any employees who have not been assigned to any projects, even if they belong to department 51. If that is a hypothetical problem in this hypothetical case, you could fix that by using LEFT JOINs instead of JOINs.

    1. Is that “UNION ALL implements CTE” statement applicable to Postgres? Although the docs use UNION ALL in the examples, the text indicates both UNION or UNION ALL can be used, depending on what is needed. A bit of research appears to suggest that the UNION ALL-CTE tie-in is a MSSQL Server “feature”.

      Aside: Are there any IMS zealots still around? :-)

  2. IBM/DB2 initiated the syntax, which was folded into SQL 99. IBM didn’t want to use CONNECT BY, from Oracle, for reasons which should be obvious. The DB2 engine differs in that it doesn’t need to be told RECURSIVE in the WITH definition . UNION ALL is SQL standard, again, courtesy DB2. Here’s the Wiki page: http://en.wikipedia.org/wiki/Hierarchical_and_recursive_queries_in_SQL

    Lots of IMS zealots; many don’t know that they are, unfortunately. IMS, itself, is still a money spinner for IBM.

    1. By multiset, I assume you mean a bag, i.e., a collection of something (tuples, presumably) where one item (tuple) may appear more than once. By definition, a multiset or bag is not a set, and hence we’re not talking about the Relational Model. One detail I meant to point out in the post is that the RM is based on first-order predicate logic. Each tuple can be viewed as an assertion, e.g., there exists an employee with employee number 101 and first name Ben and last name Rich and he works in department 51. What good is it to repeat an assertion n times (where n > 1)? Does it make it any truer?

      Anyway, you ask me to “find a case for multisets” because apparently you’re “stuck with them”. I can’t help you find a case for them in the RM, however, if you provide an example of a multiset that you’re stuck with, maybe we can discuss its worthiness.

  3. The final formulation works the specific question: give any employee who works on any single project or is a member of any single department (this later enforced by the model). In this formulation the least common multiple of the two joins is 1 and so at most you would get a single record per employee. If you wanted to check on multiple projects (a plausible condition and allowed by the model) then you’d get a duplicate for each project the person works on beyond the first.

    Note that creating the CTE is nice but it is not usable for the later queries since projects is undefined.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s