Multisets and the Relational Model

In a comment to my previous post, David Fetter challenged me to “find a case for multisets. That we’re stuck with them doesn’t mean they’re useless.” My response was that I couldn’t help him because multisets (or bags) are not part of the relational model (which was the point of my post) and asked David to show me an example of a multiset he’s stuck with so that we could discuss it.

While waiting for his response, I read an article titled “Toil and Trouble” by Chris Date, which was originally published in Database Programming and Design, January 19941, where he tackled the issue of duplicate rows and multisets. Chris opened by stating that duplicates “are, and always were, a mistake in SQL” (and nearly 20 years later the mistake has not been corrected).

In the article, Date makes a number of points against duplicates and multisets but perhaps two of the best are the following:

  1. When considering the collection (3, 6, 6, 8, 8, 8, 11) versus the set {3, 6, 8, 11} we have to distinguish between the two 6′s by saying “the first 6″ or “the second.” Date then points out that “we have now introduced a totally new concept, one that is quite deliberately omitted from the relational model: positional addressing. … we have moved quite outside the cozy framework of relational theory … [and] there is no guarantee whatsoever that any results that hold within that framework still apply.”
  2. In response to a claim by David Beech that “mathematicians deal with such collections, called multisets or … bags” and therefore that a model with duplicate rows is at least mathematically respectable, Date says:

“… all of the mathematical ‘bag theory’ treatments I’ve seen start off by assuming that there is a way to count duplicates! And that assumption, I contend, effectively means that bags are defined in terms of sets—each bag element has a hidden identifying tag that distinguishes it somehow, and the bag is really a set of tag/element pairs.”

I believe that as programmers it becomes second nature to deal with duplicate items in lists and sequences. Since it is so easy to code a loop to visit each item in turn and apply some processing—in Python you can even use built-ins or functions from itertools, that we frown on a system that, at least in theory, insists on removing duplicates and dealing only with proper (mathematical) sets. However, we should realize that the theory, as Date says, is practical: by keeping the duplicates we lose, for example, the benefits of relational normal forms and certain optimization techniques.

In closing, Date presents the following parts and shipments database:

P  pno │ pname        SP   sno │ pno 
  ─────┼────────          ─────┼─────
   P1  │ Screw             S1  │ P1 
   P1  │ Screw             S1  │ P1  
   P1  │ Screw             S1  │ P2 
   P2  │ Screw

And considers the query “List part numbers for parts that either are screws or are supplied by supplier S1, or both.” He then presents 12 candidate SQL formulations, which someone ran for him against SQL Server 4.2 on OS/2.  I thought it would be instructive to run them against Postgres 9.3, so here they are.

SELECT pno
FROM   p
WHERE  pname = 'Screw'
OR     pno IN
     ( SELECT pno
       FROM   sp
       WHERE  sno = 'S1');

Result: 3 P1, 1 P2

SELECT pno
FROM   sp
WHERE  sno = 'S1'
OR     pno IN
     ( SELECT pno
       FROM   p
       WHERE  pname = 'Screw');

Result: 2 P1, 1 P2

SELECT p.pno
FROM   p, sp
WHERE  ( sno = 'S1' AND
         p.pno = sp.pno)
OR       pname = 'Screw';

Result: 9 P1, 3 P2

SELECT sp.pno
FROM   p, sp
WHERE  ( sno = 'S1' AND
         p.pno = sp.pno)
OR       pname = 'Screw';

Result: 8 P1, 4 P2

SELECT pno
FROM   p
WHERE  pname = 'Screw'
UNION  ALL
SELECT pno
FROM   sp
WHERE  sno = 'S1';

Result: 5 P1, 2 P2

SELECT DISTINCT pno
FROM   p
WHERE  pname = 'Screw'
UNION  ALL
SELECT pno
FROM   sp
WHERE  sno = 'S1';

Result: 3 P1, 2 P2

SELECT pno
FROM   p
WHERE  pname = 'Screw'
UNION  ALL
SELECT DISTINCT pno
FROM   sp
WHERE  sno = 'S1';

Result: 4 P1, 2 P2

SELECT DISTINCT pno
FROM   p
WHERE  pname = 'Screw'
OR     pno IN
     ( SELECT pno
       FROM   sp
       WHERE  sno = 'S1');

Result: 1 P1, 1 P2

SELECT DISTINCT pno
FROM   sp
WHERE  sno = 'S1'
OR     pno IN
     ( SELECT pno
       FROM   p
       WHERE  pname = 'Screw');

Result: 1 P1, 1 P2

SELECT pno
FROM   p
GROUP  BY pno, pname
HAVING pname = 'Screw'
OR     pno IN
     ( SELECT pno
       FROM   sp
       WHERE  sno = 'S1');

Result: 1 P1, 1 P2

SELECT p.pno
FROM   p, sp
GROUP  BY p.pno, p.pname, sno, sp.pno
HAVING ( sno = 'S1' AND
         p.pno = sp.pno)
OR       pname = 'Screw';

Result: 2 P1, 2 P2

SELECT pno
FROM   p
WHERE  pname = 'Screw'
UNION
SELECT pno
FROM   sp
WHERE  sno = 'S1';

Result: 1 P1, 1 P2

As Date points out, 12 different formulations produce 9 different results!  And as he further states, those are not all the possible formulations. For example, a modern revision of the third query may be:

SELECT pno
FROM   p NATURAL JOIN sp
WHERE  sno = 'S1'
OR     pname = 'Screw';

and the result is yet again different (6 P1 parts and 1 P2).

The bottom line is to be very, very careful when dealing with multisets in SQL.


1 The article was republished in Relational Database Writings, 1991-1994, in Part I, “Theory Is Practical!”

16 thoughts on “Multisets and the Relational Model”

  1. It seems that the current behavior has the distinct value of being able to use a bag result when needed while still, via DISTINCT, obtain the set-oriented result when required. I guess it could be designed so that you need a keyword, like “SELECT BAG * FROM t NATURAL JOIN u”, to reverse the default but likely if the engines were written to output sets the reverse behavior would be impossible.

    As with inclusion of NULL this may be mathematically impure but it comes across as reasonably practical. It makes the language harder to use but in doing so provides added functionality for those who have mastered it. Especially when writing quick, interactive, queries.

    1. “… for those who have mastered it.”

      I challenge anyone to explain and/or predict the results of *all* the 10 formulations above (and other variations) that give a result other than 1 P1 and 1 P2, and to do it in such a way that is predictable across the different SQL implementations. Even if someone can, I wonder what is the benefit of knowing how all these inconsistent results are derived?

      1. Not gonna take up the challenge. In all those cases you add the DISTINCT and you answer the original question just fine. But those different formulations are capable of answering many other questions as well; though I’ll concur that largely either the question of set behavior is immaterial (because you are pushing through keys anyway) or that the set behavior is indeed desired.

  2. Just as a note: If position really matters, you are *not* talking about multisets, you’re talking about sequences.

    If you ignore ordering and consider only the properties of multisets (items can be a member multiple times), things behave exactly as you would expect with multisets, and you don’t need to worry about “the first 2″ or “the second 2″. They’re both 2–it’s just a member twice. You don’t need to worry about whether something is “the first join of 2 with A” or “the second join of 2 with A”–that item just appears in the join twice.

    When ordering matters and you are using the properties of a sequence instead of a set or multiset, things do get more complicated, and this is particularly the case in SQL, where many primitive operations do not produce their results in a defined order. That’s not a problem for sets or multisets, which have no order, but when you start including operations that create or preserve order, and to depend on the order of results, things can get very messy.

    It’s primarily the mix of set operations and sequence operations that complicates things. A calculus based purely on set theory would do better in a world consisting only of sets. A calculus based on manipulating sequences would do better in a world where ordering is involved. Instead, we have SQL, which is a bastard child of the two. Based on Codd’s relational model of sets, but making concessions to practicality and including order without truly embracing sequences.

    1. It doesn’t matter whether you use position or not. The crucial concept, as stated by Codd, is “The DBMS requires that all database information … is cast explicitly in terms of values in relations, and in no other way in the base relations.” Date calls this the Information Principle or the Principle of Uniform Representation and states it this way: “The entire information content of the database is represented in one and only one way, as explicit values in attribute positions within tuples within relations.” And the Third Manifesto has at least two proscriptions about “no ordinal positions”.

      With multisets or bags (or sequences–which to me are simply multisets/bags with some known ordering), you have to come up with some way of distinguishing between duplicate items. If you’re given a bag with differently colored marbles and are told to count how many of each color, if you can remove them and put them in lines by color, you’ve applied some ordering. If you can remove only one at a time and have to put it back, you have to use some other means of distinguishing them, e.g., some mark on each marble.

      1. Not at all. If there is an intrinsic order, that’s a further behavior that needs to be covered. Just like a set doesn’t have order, a multiset doesn’t have order.

        Items in sets have identity, and are either present or not. Items in multisets have identity and multiplicity, and are present a natural number of times. Items in sequences have a position in the sequence.

        When you speak of “taking items out to count them”, you’re presuming a specific implementation of multisets, much like SQL’s inclusion of order presumes a specific implementation of sets. The mathematical definition of a multiset doesn’t require that, any more than the mathematical definition of a set requires the extra sequence-based features of SQL.

      2. Items in sets have identity by virtue of each item’s value. And that is all that matters as far as the relational model is concerned, because as both Codd and Date state information in the model is represented solely by explicit values, not by positioning or by some multiplicity counter. And the relational model is based only on sets: relations are sets of tuples and tuples are sets of attributes (or as Codd originally wrote “The ordering of [tuples] is immaterial” and “All [tuples] are distinct”). The table P shown above is not a relation because part appears three times, so by definition, that table is not a set of tuples.

        AFAIK no one has come up with a mathematical definition of data model akin to the relational model but with multisets of tuples. Until someone does it seems futile to argue about this “feature” of SQL, except to point out its inconsistencies. In other words, as far as I’m concerned, the ball is on the SQL-multiset defenders. And to quote Date once more:

        Anyone who still believes in duplicates should be asked to write out one googol times, “There’s no such thing as a duplicate.”

        My example regarding colored marbles didn’t presume anything about implementations. It was only meant as a real world analogy.

      3. AFAIK no one has come up with a real implementation of a database engine with pure set theory (including disallow-ability of NULL) as primary constraint. Until someone does it seems prudent to discuss the actually implemented features of SQL – the language of the database engines that we do have.

        SQL is based upon but is not fully true to the relational model. Fine; at least I can use SQL to get data out of my database – something that relational theory by definition cannot do for me. The fact that there isn’t some pure mathematical model that directly maps onto SQL doesn’t bother me. Sure, SQL has some rough edges because of its decision to eschew a pure relational model but since SQL is actually useful lambasting it for implementation decisions on the basis of being contrary to theory strikes me as the tailing wagging the dog and ivory tower lecturing. As right as Date/Codd and their followers may be if they do not give us something we can actually touch then their words are inspiring and provoking but, for the vast majority of us who are not going to try and write a new language/engine, simply come across as pointing out the flaws in SQL without offering alternatives. This is valuable but limited.

      4. I guess it may depend on what your definition of “real implementation” means, but there is at least one I’ve used and it’s real enough. It’s called Rel and although intended mainly for teaching, it satisfies the Third Manifesto which is based on relational theory (and which Date and Hugh Darwen have been offering as the alternative for almost 20 years). Furthermore, there is Dataphor which also meets most of TTM and has commercial support. In addition, you can find information about others at the The Third Manifesto site. Historically, there was also the Peterlee Relational Test Vehicle which was closer to relational theory than System R.

        I’m not sure if the System R implementors consciously eschewed a pure relational model. They built a prototype, which was then rushed into various implementations, outside and inside of IBM, without considering whether it truly reflected the theory on which it was based (and good language design practices, such as orthogonality).

      5. I forgot to mention that Codd also suggested a strategy for gracefully dropping duplicate row support in his book The Relational Model for Database Management: Version 2 (1990), to be handled in three stages:

        1. warn uses that support for duplicate rows is going to be phased out in about tow years’ time;
        2. within the first year, install in some new release a “two-position switch”(i.e., a DBA-controlled bit) that permits the DBMS to operate in two modes with respect to duplicate rows: (1) accepting them and (2) rejecting them;
        3. drop the support for duplicate rows within a relation altogether; and improve the optimizer accordingly.

        Unfortunately, to my knowledge, none of the vendors or the SQL committee took up Codd’s suggestion.

      6. And to be clear, I think the key confusion here is that the sample question that Date asks doesn’t concern cardinality—and the “different” results differ only in cardinality. Since cardinality doesn’t matter, you ignore it, and treating these as sets, duplicates don’t matter at all.

        For example, if you used the “different” results of all of these queries as the set of an IN clause, they would all behave identically.

        These queries all mean the same thing… in a set calculus. But in a multiset calculus, they’re *different* questions, and demand different answers. The first two queries are equivalent with sets, but not with multisets.

        However, there is a behavior which holds here, which is that if you take the input multisets and convert them to sets, and run the set-based query on them, you get the same result as if you run the multiset query on the original sets and then convert the final result. (And that’s what makes these multisets and sets, rather than something else.)

        So the confusion only arises if you expect multiset operations to throw away information that must be maintained by multisets (cardinality under certain operations). If you expect the result to be equivalent to the same set (which it is in all of these cases: the set {P1, P2}), you see that they’re all in fact the same.

        Different multisets projecting into the same sets.

  3. This argument seems poorly informed. As I understand it, you’re arguing against duplicate rows… so are you arguing against group by? A simple query of e.g. a people table consisting of {name:string, gender:m | f |_, zipcode:int} to find out how many men are in a given zipcode (select zipcode, count(*) from people group by zipcode having gender=m), necessarily invokes a multiset under the hood. There’s a bijection between multiset[T] and set[T x N > 0]. I don’t understand why you’re so against them.

    Kat Prevost seems to have a better grasp of things.

    1. GROUP BY has no real bearing on the issue of duplicate rows. Duplicate rows are not part of the relational model. As I mentioned, both Ted Codd and Chris Date have pointed this out, the latter in many, many of his writings (and Hugh Darwen too). Every tuple (in vulgar, SQL-speak, every row) in a relation variable (in SQL-speak, table) expresses a first-order predicate that is supposed to be true for the enterprise in question. For example, in the P (parts) table, the predicate may be: There exists a part, identified by part number pno which has part name pname. In the sample table above, what purpose does it serve to repeat three times the proposition: There exists a part ‘P1′ which has part name ‘Screw’?

      The relational model and all the theory stemming from it is founded on “Relations are sets of tuples, which in turn are sets of attributes”. The relational model is not founded on multisets. I don’t know how else to explain it, but if you need more, maybe the free book An Introduction to Relational Database Theory by Hugh Darwen, can help.

      As an aside, how can you talk about a bijection, which is a one-to-one correspondence between elements of two sets, when talking about multisets?

      1. I can’t speak to your example regarding table P above, because it was designed by you to be nonsensical. An example of how such a situation might arise naturally though is a database where we have Orders(Id, …) , OrderLineItems (OrderId x Quantity x ProductId), Products (Id, CategoryId, ..), and Categories (Id, …).

        The reason group-by or multisets are relevant is so you can ask questions like, ‘in the last month, how many screws were ordered?’, where Screw is a Category.

        My point about there being a bijection between multisets and sets was referring to the bijection between the set of multiset[T] and the set of set[T x N > 0], probably more clear in the following example:

        If T is the set of the first three letters in the alphabet (A, B, C), then the following are equivalent elements, first in multiset[T] and then in set[T x N > 0]. The bijection is the relation.

        {AB} {(A, 1), (B, 1)}
        {AAAB} {(A, 3), (B, 1)}
        {AAAABBBC} {(A, 4), (B, 3), (C, 1)}

        Does that make more sense?

      2. I did not design the table P example. As stated in my post, it was Chris Date who presented that example in his DBP&D article. However, if you agree that it’s nonsensical, then perhaps you can explain why? Is it because there are three repeated tuples? That’s the whole point that Codd, Date, Darwen, myself and many others are trying to make. If it’s not because of the repeated tuples, then I’d love to hear your explanation. Or perhaps you can come up with an example of your own that has duplicate tuples/rows and that you postulate is not nonsensical.

        In the relational model you can ask questions like “how many screws were ordered in the last month?” and get the correct answer, yet the base relations (tables) do not have any duplicate tuples (rows), i.e., each tuple is uniquely identified by a set of its attributes, say, order_number, order_date, product number, quantity_ordered, product_type, most likely spread over several base relations. The answer can be obtained from the relational operator SUMMARIZE, which is more or less equivalent to SQL’s GROUP BY. In case you’re interested, SUMMARIZE is derivable from other more primitive relational operators, as discussed in Appendix A of Databases, Types, and the Relational Model.

      3. I should also point out that in practice, you rarely see such an obviously nonsensical example as three (‘P1′, ‘Screw’) tuples, because no intelligent database designer will purposely create a parts table with such an obvious identifier as part_number and not add a PRIMARY KEY or UNIQUE constraint on it. What may typically happen is that there is a table with 50+ columns and it’s unclear which set of attributes uniquely identify a row, and this perhaps joins to one or more tables, and then a user complains that “some of the reports don’t look right”, because she is sure that they sold x number of widgets to customer y last month. And when the DBA looks at the joined tables, it’s not immediately apparent that duplicate rows in one of the tables are the cause of the problem.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s