While browsing Hacker News, I stumbled upon a blog post by famous Martin Kleppmann. He was criticizing the distributed lock mechanism Redis was using, called Redlock, and arguing that it’s broken. “How can a lock mechanism be broken?” I thought. It should be as simple as locking a resource, CRUDing it, and then releasing the lock back. As it turns out, I was thinking on very simple terms. The reason lies with the lease.

Think about a worker (W1) coming into a data store with the intention of updating an entry. W1 makes a request to acquire a lock on the entry, so no other worker can come and CRUD an outdated version of it. Since there is no lock on the entry at that moment, data stores responds with an OK: W1 locks the entry for a limited amount of time (lease). Then, W1 had to pause due to a stop-the-world event. Kleppmann explains that this may happen due to garbage collectors or some other numerous reasons:

And if you’re feeling smug because your programming language runtime doesn’t have long GC pauses, there are many other reasons why your process might get paused. Maybe your process tried to read an address that is not yet loaded into memory, so it gets a page fault and is paused until the page is loaded from disk. Maybe your disk is actually EBS, and so reading a variable unwittingly turned into a synchronous network request over Amazon’s congested network. Maybe there are many other processes contending for CPU, and you hit a black node in your scheduler tree. Maybe someone accidentally sent SIGSTOP to the process. Whatever. Your processes will get paused.

In the case of such an event, the lock’s timeout (lease) can kick in and release the lock on the entry. At this point, W1 could not finish its job while still thinking it has the lock. At the same time, another worker (W2) comes in for the same entry and acquires the released lock. Now what? Two parallel workers has acquired a lock on the same entry. One is outdated based on the data store’s processes, of course; but there goes the lock mechanism: Your data may get corrupted or the workers may cause system errors when W1 tries to update entry after W2.

Kleppmann’s solution to this broken mechanism is mindblowingly simple: Incremental tokens. Whenever a worker acquires a lock, the data store provides a token with it. Following the same example above, let’s assume W1 comes in first to acquire the lock. The data store leases the lock to W1 with the token 41. Then a stop-the-world event happens to W1 and the lock has been released by the data store due to timeout. Remember that W1 still thinks it has the lock. Then W2 comes in to acquire the lock on the same entry. The data store leases the lock to W2 with the token 42. See? Incremental tokens. W2 finishes processing and requests a write back to the data store with the token 42. The data store accepts, because 42 is the latest token granted. Then the data store releases the lock on the entry. In the meantime, W1 also finishes processing and requests a write back. However, W1’s lock has the token 41. The data store, controlling its own state, rejects this request because the token is outdated. W1 cannot fulfill its job, however the data store keeps intact and consistent while and the system keeps running without crashing. W1 needs to handle the exception here, of course, but that’s not this post’s concern. Brilliantly simple.

PS: Quoting Kleppmann again on the subject of leasing:

The lock has a timeout (i.e. it is a lease), which is always a good idea (otherwise a crashed client could end up holding a lock forever and never releasing it).