The Phantom of the Database – Part 2

In the previous episode: Alice, Bob and Carol were trying to simultaneously update the following row of the film table:

  id   |     title      | release_year
-------+----------------+--------------
 47478 | Seven  Samurai |         1956

Alice wanted to remove the extra space in the title, Bob was trying to change the title to the phonetic Japanese version and Carol was correcting the release year to 1954. To simplify the analysis we’ll now limit ourselves to Alice’s and Bob’s updates.

The Lost Update Problem

If Alice’s statement executes first, Bob’s change will overwrite her update. Similarly, if Bob’s statement takes precedence, his change will be overwritten. Appropriately, this is conventionally known as the lost update problem. The updates are known as blind writes or dirty writes. How can our Python user interface prevent this problem?

The traditional solution to the lost update problem is to use two-phase locking. You can use PostgreSQL’s psql application to verify how this works (I used the PROMPT1 variable to show which user is issuing the statements).

Alice’s session starts as follows:

alice@moviesdb=> begin;
BEGIN
alice@moviesdb=> select title from film where id = 47478;
     title      
----------------
 Seven  Samurai
(1 row)

Bob’s session is identical but then he issues the UPDATE statement:

bob@moviesdb=> begin;
BEGIN
bob@moviesdb=> select title from film where id = 47478;
     title      
----------------
 Seven  Samurai
(1 row)

bob@moviesdb=> update film set title = 'Sichinin no Samurai' where id = 47478;
UPDATE 1

When Alice tries her UPDATE, her session hangs:

alice@moviesdb=> update film set title = 'Seven Samurai' where id = 47478;

You can examine the situation from another psql session (I cheated and excluded that session’s data). I won’t try to explain (or understand) all this but you can see that Alice’s session is waiting due to an ungranted lock.

moviesdb=# select procpid, waiting, current_query from pg_stat_activity;
 procpid | waiting |                       current_query
---------+---------+-----------------------------------------------------------
   25747 | t       | update film set title = 'Seven Samurai' where id = 47478;
   25900 | f       | <IDLE> in transaction
(2 rows)

moviesdb=# select pid, relation::regclass, locktype, page, tuple, mode, granted
moviesdb-# from pg_locks order by pid, relation, locktype;
  pid  | relation  |   locktype    | page | tuple |       mode       | granted
-------+-----------+---------------+------+-------+------------------+---------
 25747 | film      | relation      |      |       | AccessShareLock  | t
 25747 | film      | relation      |      |       | RowExclusiveLock | t
 25747 | film      | tuple         |    0 |    37 | ExclusiveLock    | t
 25747 | film_pkey | relation      |      |       | AccessShareLock  | t
 25747 | film_pkey | relation      |      |       | RowExclusiveLock | t
 25747 |           | transactionid |      |       | ShareLock        | f
 25747 |           | transactionid |      |       | ExclusiveLock    | t
 25747 |           | virtualxid    |      |       | ExclusiveLock    | t
 25900 | film      | relation      |      |       | RowExclusiveLock | t
 25900 | film      | relation      |      |       | AccessShareLock  | t
 25900 | film_pkey | relation      |      |       | RowExclusiveLock | t
 25900 | film_pkey | relation      |      |       | AccessShareLock  | t
 25900 |           | transactionid |      |       | ExclusiveLock    | t
 25900 |           | virtualxid    |      |       | ExclusiveLock    | t
(14 rows)

Bob’s COMMIT releases his locks …

bob@moviesdb=> commit;
COMMIT

and Alice’s UPDATE now goes through:

UPDATE 1
alice@moviesdb=> commit;
COMMIT
alice@moviesdb=> select title from film where id = 47478;
     title     
---------------
 Seven Samurai
(1 row)

Hey, what happened? Alice’s UPDATE overwrote Bob’s! Wasn’t that supposed to be prevented?

Here is the rub: if it is important for the application to update the row as was presented to the user, then we need to add another qualification to the UPDATE, i.e., we need something like “and title = 'Seven  Samurai'“. We’ll discuss this in a future installment.

About these ads

16 thoughts on “The Phantom of the Database – Part 2”

  1. Your suggestion “we need to add another qualification to the UPDATE” is rather misinforming. Dirty write is a well-known defect in transaction isolation, which is handled on a REPEATABLE READS isolation level, which is supported by a great number of DBMSs. This article should have a mention that “RDBMS X doesn’t have this problem by default, replacing it with transaction conflict problem”.

    1. First off, this is about PostgreSQL. Second, although I didn’t mention it. I tried the example with both SERIALIZABLE and READ COMMITTED isolation levels, in PG 8.4, and the behavior –including pending locks– was the same. Briefly, the extra qualification is needed because the primary key is insufficient. Alice wants to effect the change ‘Seven Samurai’ => ‘Seven Samurai’, and Bob ‘Seven Samurai’ => ‘Sichinin no Samurai’. So the title (or something else –as we shall see) needs to become part of the transaction. Otherwise, it’s still a blind write.

    2. The misleading part is the use of the term “dirty write”. While there is a write/write conflict (with respect to the desired behavior) it occurs due to the application’s behavior, not the database’s behavior. The database did exactly what the application requested, but the application failed to instruct the database to do what the application desired. The only reason an update was lost because of how the application executed its update operations.

      1. You’re right, Adam. It’s the application that is the issue, not the DBMS. I’m presenting these examples to examine the correct (or best possible) way to design the application transactions, in a tutorial fashion.

  2. Now you see the problem with all existing (whether Locker or MVCC) concurrency schemes: the granularity is the row. What this example wishes to do is update *independent* columns within a row, and there is no harm in doing so. How do we know whether columns are independent? We look for them in unique indexes/constraints. That’s not a guarantee, but if the schema is intelligent, it will be. From Codd’s logic, independent column updates are permitted.

    Implementing column level granularity is said to be supported by column datastores, http://en.wikipedia.org/wiki/Column-oriented_DBMS

    1. I don’t think row granularity is the issue here. The problem would still exist if –as I did in the original application– I included all other columns as part of the UPDATE. Semantically, the title is not independent, it is functionally dependent –together with the release year– on the id, and it could even be argued that title-release_year are a candidate key and any change should consider both columns as a whole. From my perspective, the issue is defining the “correct” user transaction.

  3. The real problem with your application is it doesn’t use transactions correctly: realistically you’re displaying the data in one commit then updating in a separate commit. You’re not guaranteed atomicity in this situation. Trying to cover up this problem by specifying the entire old row during the UPDATE is permissible but has obvious consequences.

    If you’re using a single transaction per user, you’d need to take a lock implicitly (SELECT FOR UPDATE) or explicitly (LOCK TABLE) to resolve the issue. This would prevent the SELECT operation from completing until the other side finished their updates. This has obvious (and a few not so obvious) consequences as well. Run both transactions with SELECT FOR UPDATE and one should hang until the other one completes.

    1. As I explained above, the application is indeed incorrect and the goal is to examine how it should be improved. However, using LOCK TABLE or SELECT FOR UPDATE isn’t going to change the results. I’m already getting a hang in one session until the other commits.

      1. Yes, it will change the results because the SELECT is where the second transaction will block, instead of at the UPDATE. The second transaction is given a chance to see the update before it overwrites the first transaction with its own update. Of course, the caller making the second transaction has to look at the results before just blindly updating. Nevertheless, it’s still an appropriate way to solve the problem at hand for many applications.

      2. Yes, Adam, if the SELECTs are SELECT FOR UPDATE, the second SELECT will hang. However, I’m ultimately designing a web app. I don’t think it would be a good idea to use a SELECT FOR UPDATE (or a LOCK TABLE) on a web page designed to display a record, since it would block all others wanting to view that same record. Even if the app were designed with a “display for update” page, allowing open-ended “think time” is not a good strategy. The SELECT FOR UPDATE could be used just before the UPDATE, at which point we could check that the record hasn’t changed in the meantime, but I believe there are better options than that.

  4. As stated earlier the issue really is the application. Should the application be using a pessimistic or an optimistic locking technique? That is where some fun discussions can originate from.

  5. I suppose you could have a last_updated time stamp on the record, or a last update counter.

    So when a web page reads & displays the record, it also gets the last time stamp or counter. Then when the web user hits Save, you have “WHERE id=X AND last_updated=”

    So we have record R with client number ABC1234 and DOB 01/01/01 and last_updated 2011-11-28 13:14:15.

    Alice loads the page to edit record R, changes the client number to DEF5678 … but goes off to get a cup of tea BEFORE hitting Save.

    Bob loads the page to edit record R, changes the DOB, hits Save. The last_updated stamp is now updated – to say 2011-11-28 14:00:00.

    Alice comes back with her cuppa, hits Save – because her WHERE now includes AND last_updated=’2011-11-28 13:14:15′ – her Save will fail – so she won’t lose Bob’s change.

    But now the web page has to cope with that – let her know that her changes can’t be saved because something else has updated the record. Then it will have to be smart enough to work out if her changes are safe … could be “interesting”

    1. Yes, Richard, that’s what I was referring to at the end of Part 3 when I mentioned adding “a version number for each row or a sufficiently granular timestamp.” And indeed, if we implement any scheme to prevent a lost update, we’ll have to take care of informing the user in the client when a conflict has occurred with another user updating the same record.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s