The Phantom of the Database – Part 3

In the previous episode: Alice and Bob were trying to simultaneously update the title of the following row of the film table:

  id   |     title      | release_year
-------+----------------+--------------
 47478 | Seven  Samurai |         1956

User Transaction Correctness

In the first episode, Carol also wanted to concurrently change the release_year to 1954. Let us now consider her update in conjunction with one of the others.

If Bob modifies the title and Carol changes the year —provided our SQL statements only include the updated fields— there will be no problem. In other words, the changes can be merged, just as when Alice updates one section of a Wikipedia page and Bob changes another section of the same page (Note to self: have to talk to them about this moonlighting).

So the problem with Alice’s and Bob’s updates to the title (one to remove the extra space, the other to change it to Japanese) is that they cannot be merged. How can we implement a correct user transaction without incurring the penalty of locking the row in question?

Optimistic Concurrency Control

Richard Stephan was on the right track when he asked in a comment to the previous post: Should the application be using a pessimistic or an optimistic locking technique?

Optimistic concurrency control (aka optimistic “locking”) is a transaction design approach that avoids the overhead of locking. As stated in the Wikipedia article, the “stateless nature of HTTP makes locking infeasible for web user interfaces,” so it’s almost a given that we should use OCC for our interface.

OCC involves validating that the record we updated was the correct one. A simple way to do this is to add a qualification to the original UPDATE that does the validation for us. For example, the DB-API cursor execute call could be coded as follows:

cursor.execute("UPDATE film SET title = %s "
               "WHERE id = %s AND title = %s",
               (newrec.title, oldrec.id, oldrec.title))

Above, newrec refers to the new record instance and oldrec to the old one. So both Alice and Bob would send UPDATEs with “AND title = ‘Seven  Samurai'”. Naturally, since they’re both modifying the title, only one of the statements can succeed.

The downside of having to add the extra qualification —in the general case— is that the user could’ve changed several fields so each UPDATE would have to be customized accordingly. This will have a negative impact, among other things, on the query planner. In addition, the web interface would have to deal with two values for each field.

Alternatives to the extra qualification are to add either a version number for each row or a sufficiently granular timestamp that also acts as a versioning scheme. However, since we’re using PostgreSQL, perhaps we can use something else …

Stay tuned.

The Phantom of the Database – Part 2

In the previous episode: Alice, Bob and Carol were trying to simultaneously update the following row of the film table:

  id   |     title      | release_year
-------+----------------+--------------
 47478 | Seven  Samurai |         1956

Alice wanted to remove the extra space in the title, Bob was trying to change the title to the phonetic Japanese version and Carol was correcting the release year to 1954. To simplify the analysis we’ll now limit ourselves to Alice’s and Bob’s updates.

The Lost Update Problem

If Alice’s statement executes first, Bob’s change will overwrite her update. Similarly, if Bob’s statement takes precedence, his change will be overwritten. Appropriately, this is conventionally known as the lost update problem. The updates are known as blind writes or dirty writes. How can our Python user interface prevent this problem?

The traditional solution to the lost update problem is to use two-phase locking. You can use PostgreSQL’s psql application to verify how this works (I used the PROMPT1 variable to show which user is issuing the statements).

Alice’s session starts as follows:

alice@moviesdb=> begin;
BEGIN
alice@moviesdb=> select title from film where id = 47478;
     title      
----------------
 Seven  Samurai
(1 row)

Bob’s session is identical but then he issues the UPDATE statement:

bob@moviesdb=> begin;
BEGIN
bob@moviesdb=> select title from film where id = 47478;
     title      
----------------
 Seven  Samurai
(1 row)

bob@moviesdb=> update film set title = 'Sichinin no Samurai' where id = 47478;
UPDATE 1

When Alice tries her UPDATE, her session hangs:

alice@moviesdb=> update film set title = 'Seven Samurai' where id = 47478;

You can examine the situation from another psql session (I cheated and excluded that session’s data). I won’t try to explain (or understand) all this but you can see that Alice’s session is waiting due to an ungranted lock.

moviesdb=# select procpid, waiting, current_query from pg_stat_activity;
 procpid | waiting |                       current_query
---------+---------+-----------------------------------------------------------
   25747 | t       | update film set title = 'Seven Samurai' where id = 47478;
   25900 | f       | <IDLE> in transaction
(2 rows)

moviesdb=# select pid, relation::regclass, locktype, page, tuple, mode, granted
moviesdb-# from pg_locks order by pid, relation, locktype;
  pid  | relation  |   locktype    | page | tuple |       mode       | granted
-------+-----------+---------------+------+-------+------------------+---------
 25747 | film      | relation      |      |       | AccessShareLock  | t
 25747 | film      | relation      |      |       | RowExclusiveLock | t
 25747 | film      | tuple         |    0 |    37 | ExclusiveLock    | t
 25747 | film_pkey | relation      |      |       | AccessShareLock  | t
 25747 | film_pkey | relation      |      |       | RowExclusiveLock | t
 25747 |           | transactionid |      |       | ShareLock        | f
 25747 |           | transactionid |      |       | ExclusiveLock    | t
 25747 |           | virtualxid    |      |       | ExclusiveLock    | t
 25900 | film      | relation      |      |       | RowExclusiveLock | t
 25900 | film      | relation      |      |       | AccessShareLock  | t
 25900 | film_pkey | relation      |      |       | RowExclusiveLock | t
 25900 | film_pkey | relation      |      |       | AccessShareLock  | t
 25900 |           | transactionid |      |       | ExclusiveLock    | t
 25900 |           | virtualxid    |      |       | ExclusiveLock    | t
(14 rows)

Bob’s COMMIT releases his locks …

bob@moviesdb=> commit;
COMMIT

and Alice’s UPDATE now goes through:

UPDATE 1
alice@moviesdb=> commit;
COMMIT
alice@moviesdb=> select title from film where id = 47478;
     title     
---------------
 Seven Samurai
(1 row)

Hey, what happened? Alice’s UPDATE overwrote Bob’s! Wasn’t that supposed to be prevented?

Here is the rub: if it is important for the application to update the row as was presented to the user, then we need to add another qualification to the UPDATE, i.e., we need something like “and title = 'Seven  Samurai'“. We’ll discuss this in a future installment.

PGBR 2011 recap

PGBR 2011 group photoThis is a summary of what I saw and heard at PGBR 2011.

The night before the conference I had the opportunity to greet Alvaro Herrera, Bruce Momjian, Dave Page and Greg Smith.

The conference opened on Thursday 3rd November with introductory remarks by Fábio Telles, Flavio Gurgel, Bruce and Euler Taveira. I then attended Dave’s talk which covered the new foreign data wrapper functionality in PG 9.1.  The presentation was given with “simultaneous” translation into Portuguese, which meant I listened to almost every point twice. Next I went to Rogerio Bassete‘s talk which dealt with the PostgreSQL windowing functions. Rogerio gave a multitude of examples, but unfortunately the time was relatively short.

At lunch, I talked with Rogerio, Fabrízio de Royes Mello and Dickson Guedes. It was personally satisfying to find out that Fabrizio had downloaded and tried Pyrseas, and was possibly going to “adopt” some of it. Dickson also mentioned that some years ago he had tried to implement something similar, so he was glad when he found out about Pyrseas.

The first slot after lunch was taken by the round table on the Brazilan market for PostgreSQL-based services. The participants included Euler (representing Timbira), Rodolfo Gobbi (for 4Linux–I hope I have the name right), Charly Batista (for XPort), Nabucodonosor Coutinho (for Mondrian Tecnologia) and Luis Dosso (for Dextra), with Fernando Ike acting as moderator. The overall take was that the Brazilian PostgreSQL market is mature yet still challenging, supporting variously sized provider companies serving a variety of public and private organizations. After a snack break, where I chatted with Coutinho, I attended Greg’s benchmarking talk (without translation).

The first day was topped off by the group photo followed by “happy hour” which consisted mostly of Brazilian “chopes” (allegedly 200 litres of it).

On Friday I had breakfast with Jaime Casanova and mentioned I had earlier been responsible for another product named “repmgr” (for another DBMS). I later attended Greg’s performance pitfalls talk, translated by Flavio, and Dickson’s presentation covering the PG 9.1 EXTENSION capabilities and PGXN. At lunch I spoke with Charly.

Probably the most interesting session was Diogo Biazus‘ PL/pgSQL programming dojo. Diogo sported a mohawk although the Brazilians called him a “mohicano,” but it wasn’t his hairstyle that was interesting: it was the concept of “dojo” as in “learn by doing.” Rather than giving a tutorial on PL/pgSQL, Diogo presented a programming challenge (the hard of hearing granny) and then assisted the attendees in developing a solution in PL/pgSQL using TDD and pair programming. Participants took five minute turns being the “pilot” and “co-pilot” of the pair. A most interesting experience!

For the last formal talk, I heard Flavio’s talk about myths and realities of Postgres DBA projects. To close, the Brazilian PG group presented their first ever awards to community members, followed by quite informal lightning talks, closing remarks and another group picture.

Overall, a very useful and enjoyable event. Thanks to all the organizers, speakers and others with whom I spoke! Thanks also to those who invited me back (no promises, but we’ll see).

The Phantom of the Database – Part 1

Scenario: A row with “Seven  Samurai” (notice two spaces between the words) as the movie title and 1956 as the release year, in the film table.

Plot: User Alice updates the row to remove the extra space. In the meantime, user Bob updates it to change the title to its phonetic Japanese equivalent “Shichinin no Samurai” and user Carol updates the release date to 1954, the year the movie was first shown in Japan. Whose updates will survive?

If the updates take place sequentially and each user fetches the previously updated row prior to saving his or her change, we should end up with a row with “Shichinin no Samurai” as the title and 1954 as the year.

However, if each user starts viewing the original row in a web browser and the application is incorrectly designed, one or two changes may be overwritten.

For example, my minimalist user interface application even in its more robust incarnation, issues an UPDATE with both the title and the year as received from the user. This was done intentionally, for simplicity, to be corrected later. As a result, if the users press the “Save” button in the sequence shown, both Alice’s correction to the English title and Bob’s change to a Japanese title will be lost because they will be overwritten by Carol’s change.

If you’ve used a centralized VCS, like Subversion, you’re probably familiar with the issue. If our users were developers updating a source file with the movie information, they would (typically) be prevented from overwriting because the VCS would inform the second and third users that they had to merge the changes in prior commits.

So, is it enough if the application only updates the columns that were changed by the users, somewhat like a VCS automatically merging non-conflicting changed lines?

To be continued …