The Phantom of the Database – Part 3

In the previous episode: Alice and Bob were trying to simultaneously update the title of the following row of the film table:

  id   |     title      | release_year
-------+----------------+--------------
 47478 | Seven  Samurai |         1956

User Transaction Correctness

In the first episode, Carol also wanted to concurrently change the release_year to 1954. Let us now consider her update in conjunction with one of the others.

If Bob modifies the title and Carol changes the year —provided our SQL statements only include the updated fields— there will be no problem. In other words, the changes can be merged, just as when Alice updates one section of a Wikipedia page and Bob changes another section of the same page (Note to self: have to talk to them about this moonlighting).

So the problem with Alice’s and Bob’s updates to the title (one to remove the extra space, the other to change it to Japanese) is that they cannot be merged. How can we implement a correct user transaction without incurring the penalty of locking the row in question?

Optimistic Concurrency Control

Richard Stephan was on the right track when he asked in a comment to the previous post: Should the application be using a pessimistic or an optimistic locking technique?

Optimistic concurrency control (aka optimistic “locking”) is a transaction design approach that avoids the overhead of locking. As stated in the Wikipedia article, the “stateless nature of HTTP makes locking infeasible for web user interfaces,” so it’s almost a given that we should use OCC for our interface.

OCC involves validating that the record we updated was the correct one. A simple way to do this is to add a qualification to the original UPDATE that does the validation for us. For example, the DB-API cursor execute call could be coded as follows:

cursor.execute("UPDATE film SET title = %s "
               "WHERE id = %s AND title = %s",
               (newrec.title, oldrec.id, oldrec.title))

Above, newrec refers to the new record instance and oldrec to the old one. So both Alice and Bob would send UPDATEs with “AND title = ‘Seven  Samurai'”. Naturally, since they’re both modifying the title, only one of the statements can succeed.

The downside of having to add the extra qualification —in the general case— is that the user could’ve changed several fields so each UPDATE would have to be customized accordingly. This will have a negative impact, among other things, on the query planner. In addition, the web interface would have to deal with two values for each field.

Alternatives to the extra qualification are to add either a version number for each row or a sufficiently granular timestamp that also acts as a versioning scheme. However, since we’re using PostgreSQL, perhaps we can use something else …

Stay tuned.

About these ads

9 thoughts on “The Phantom of the Database – Part 3”

  1. The extra qualification does not put any added stress on the query planner: an UPDATE statement for two columns is always a different statement than an UPDATE for one column. The ‘WHERE’ clause of the UPDATE statement must always uniquely identify the desired version and therefore could always involve the same columns for all updates. Optimizing the WHERE clause for independent columns is an optional step.

    The real problem is that the web interface must now be stateful, which is not given sufficient emphasis here.

    1. It isn’t the extra qualification that has a negative impact on the planner, but the custom WHERE clauses, because the plans can’t be re-used. We could use a WHERE clause with all the table’s columns, but that becomes unwieldy and inefficient once you have a dozen or more columns.

      Any web interface that is going to correctly update a persistent store is going to have to carry and/or exchange some “state.” After all, REST stands for Representational State Transfer. The trick is in how to limit the state information passed between the client and the server. Consider, for example, CouchDB: each document is identified not only by its id but by a revision.

      1. Is there some documentation that shows that PostgreSQL will use the exact same query plan for two UPDATEs with different SET clauses but the same WHERE clause? Certainly, I can’t create the same prepared statement for them. As far as I know, PostgreSQL does no caching of query plans outside of prepared statements.

        It’s perfectly possible to write a web application that allows you to update your persistent store, correctly, without requiring the web application itself to store additional state between requests. REST has nothing to do with this whatsoever, and I’m not sure why you believe it does. Many applications lack this sort of conflicting updates problem altogether. Consider an ATM: it doesn’t even see my bank balance (unless I explictly request it), yet I still end up with less money in my account when the transaction is complete.

        The state stored in the web application is the real design problem and I’m disappointed you don’t give it any consideration whatsoever. Decisions must be made on how to store this state, and how to properly update and expire the state. It has consequences that reach all the way to the heart of your application: clients now must either be able to deal with cookies (or similiar), or submit the old row/revision with every request, or the URL scheme must now include the old row/revision. Each one of these decisions has consequences and tradeoffs, they’re not decisions that can be waved away or ignored as unimportant.

      2. Re planner: I was thinking generically. There are some DBMSs that may re-use plans based on query text analysis.

        Re the rest: I’m not trying to ignore the issues. I’m simply presenting them piecemeal, in a perhaps unorthodox manner.

  2. It would be pretty simple to write a plpgsql update function to update film records where the update will first check to see if the record has changed before the update.
    For instance, a plpgsql function fn_update_file(id INT, ts timestamp ts, param, param…) can take the id of the row, and timestamp of when the page from which the update emanates was last updated. If ts < the timestamp of row update, the update will abort, otherwise the updates will be written.

    A cool part of this arrangement is that the various parameters in the function can be NULL and will not be overwritten if a little internal "select into" magic is used in the pl function.

    This type of function can often be handily made as a "getter/setter/deleter" function. Why use something outside of the database for these tasks, when the database itself is in charge of data integrity?

    1. Sorry — that should have been “fn_update_film”, not “fn_update_file”.

      Try a few plpgsql functions for these race conditions. You won’t be sorry.

    2. The use of a PL function doesn’t change the problem: you still have to pass the film.id and the timestamp, plus changed columns, from the application to the database function.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s