In a comment to my previous post, David Fetter challenged me to “find a case for multisets. That we’re stuck with them doesn’t mean they’re useless.” My response was that I couldn’t help him because multisets (or bags) are not part of the relational model (which was *the point* of my post) and asked David to show me an example of a multiset he’s stuck with so that we could discuss it.

While waiting for his response, I read an article titled “Toil and Trouble” by Chris Date, which was originally published in *Database Programming and Design*, January 1994^{1}, where he tackled the issue of duplicate rows and multisets. Chris opened by stating that duplicates “are, and always were, a mistake in SQL” (and nearly 20 years later the mistake has not been corrected).

In the article, Date makes a number of points against duplicates and multisets but perhaps two of the best are the following:

- When considering the collection (3, 6, 6, 8, 8, 8, 11) versus the set {3, 6, 8, 11} we have to distinguish between the two 6’s by saying “the
**first** 6″ or “the **second**.” Date then points out that “we have now introduced a totally new concept, one that is quite deliberately omitted from the relational model: *positional addressing*. … we have moved quite outside the cozy framework of relational theory … [and] there is *no guarantee whatsoever* that any results that hold within that framework still apply.”
- In response to a claim by David Beech that “mathematicians deal with such collections, called
*multisets* or … *bags*” and therefore that a model with duplicate rows is at least mathematically respectable, Date says:

“… all of the mathematical ‘bag theory’ treatments I’ve seen start off by assuming that there is a way to count duplicates! And that assumption, I contend, effectively means that *bags* are defined in terms of *sets*—each bag element has a hidden identifying tag that distinguishes it somehow, and the bag is really a *set* of tag/element pairs.”

I believe that as programmers it becomes second nature to deal with duplicate items in lists and sequences. Since it is so easy to code a loop to visit each item in turn and apply some processing—in Python you can even use built-ins or functions from itertools, that we frown on a system that, at least in theory, insists on removing duplicates and dealing only with proper (mathematical) sets. However, we should realize that the theory, as Date says, is practical: by keeping the duplicates we lose, for example, the benefits of relational normal forms and certain optimization techniques.

In closing, Date presents the following parts and shipments database:

P pno │ pname SP sno │ pno
─────┼──────── ─────┼─────
P1 │ Screw S1 │ P1
P1 │ Screw S1 │ P1
P1 │ Screw S1 │ P2
P2 │ Screw

And considers the query “List part numbers for parts that either are screws or are supplied by supplier S1, or both.” He then presents 12 candidate SQL formulations, which someone ran for him against SQL Server 4.2 on OS/2. I thought it would be instructive to run them against Postgres 9.3, so here they are.

SELECT pno
FROM p
WHERE pname = 'Screw'
OR pno IN
( SELECT pno
FROM sp
WHERE sno = 'S1');

Result: 3 P1, 1 P2

SELECT pno
FROM sp
WHERE sno = 'S1'
OR pno IN
( SELECT pno
FROM p
WHERE pname = 'Screw');

Result: 2 P1, 1 P2

SELECT p.pno
FROM p, sp
WHERE ( sno = 'S1' AND
p.pno = sp.pno)
OR pname = 'Screw';

Result: 9 P1, 3 P2

SELECT sp.pno
FROM p, sp
WHERE ( sno = 'S1' AND
p.pno = sp.pno)
OR pname = 'Screw';

Result: 8 P1, 4 P2

SELECT pno
FROM p
WHERE pname = 'Screw'
UNION ALL
SELECT pno
FROM sp
WHERE sno = 'S1';

Result: 5 P1, 2 P2

SELECT DISTINCT pno
FROM p
WHERE pname = 'Screw'
UNION ALL
SELECT pno
FROM sp
WHERE sno = 'S1';

Result: 3 P1, 2 P2

SELECT pno
FROM p
WHERE pname = 'Screw'
UNION ALL
SELECT DISTINCT pno
FROM sp
WHERE sno = 'S1';

Result: 4 P1, 2 P2

SELECT DISTINCT pno
FROM p
WHERE pname = 'Screw'
OR pno IN
( SELECT pno
FROM sp
WHERE sno = 'S1');

Result: 1 P1, 1 P2

SELECT DISTINCT pno
FROM sp
WHERE sno = 'S1'
OR pno IN
( SELECT pno
FROM p
WHERE pname = 'Screw');

Result: 1 P1, 1 P2

SELECT pno
FROM p
GROUP BY pno, pname
HAVING pname = 'Screw'
OR pno IN
( SELECT pno
FROM sp
WHERE sno = 'S1');

Result: 1 P1, 1 P2

SELECT p.pno
FROM p, sp
GROUP BY p.pno, p.pname, sno, sp.pno
HAVING ( sno = 'S1' AND
p.pno = sp.pno)
OR pname = 'Screw';

Result: 2 P1, 2 P2

SELECT pno
FROM p
WHERE pname = 'Screw'
UNION
SELECT pno
FROM sp
WHERE sno = 'S1';

Result: 1 P1, 1 P2

As Date points out, 12 different formulations produce 9 different results! And as he further states, those are not all the possible formulations. For example, a modern revision of the third query may be:

SELECT pno
FROM p NATURAL JOIN sp
WHERE sno = 'S1'
OR pname = 'Screw';

and the result is yet again different (6 P1 parts and 1 P2).

The bottom line is to be very, very careful when dealing with multisets in SQL.

^{1} The article was republished in *Relational Database Writings, 1991-1994*, in Part I, “Theory Is Practical!”

### Like this:

Like Loading...