Thursday, April 6, 2017

Distributed consistency at scale: Spanner vs. Calvin


In 2012, two research papers were published that described the design of geographically replicated, consistent, ACID compliant, transactional database systems. Both papers criticized the proliferation of NoSQL database systems that compromise replication consistency and transactional support, and argue that it is very possible to build extremely scalable, geographically replicated systems without giving up on consistency and transactional support. The first of these papers was the Calvin paper, published in SIGMOD 2012. A few months later, Google published their Spanner paper in OSDI 2012. Both of these papers have been cited many hundreds of times and have influenced the design of several modern “NewSQL” systems, including FaunaDB (where this post is also being published).
Recently, Google released a beta version of their Spanner implementation, available to customers who use Google Cloud Platform. This development has excited many users seeking to build scalable apps on Google’s cloud, since they now have a reliably scalable and consistent transactional database system to use as a foundation. However, the availability of Spanner outside of Google has also brought it more scrutiny --- what are its technical advantages in practice, and what are its costs? Even though it has been five years since the Calvin paper was published, it is only now that the database community is asking me to directly compare and contrast the technical designs of Calvin and Spanner.
The goal of this post is to do exactly that --- compare the architectural design decisions made in these two systems, and specifically focus on the advantages and disadvantages of these decisions against each other as they relate to performance and scalability. This post is focused on the protocols described in the original papers from 2012. Although the publicly available versions of these systems likely have deviated from the original papers, the core architectural distinctions remain the same.

The CAP theorem in context

Before we get started, allow me to suggest the following: Ignore the CAP theorem in the context of this discussion. Just forget about it. It’s not relevant for the type of modern architectural deployments discussed in this post where network partitions are rare.
Both Spanner and Calvin replicate data across independent regions for high availability. And both Spanner and Calvin are technically CP systems from CAP: they guarantee 100% consistency (serializability, linearizability, etc.) across the entire system. Yes, when there is a network partition, both systems make slight compromises on availability, but partitions are rare enough in practice that developers on top of both systems can assume a fully-available system to many 9s of availability.
(BTW: If you didn’t believe me in 2010 when I explained the shortfalls of using CAP to understand the practical consistency and availability properties of modern systems, maybe you will believe the author of the CAP theorem himself, Eric Brewer, who recommends against analyzing Spanner through the lens of the CAP theorem.)

Ordering transactions in time

Let us start our comparison of Calvin and Spanner with the most obvious difference between them: Spanner’s use of “TrueTime” vs. Calvin’s use of “preprocessing” (or “sequencing” in the language of the original paper) for transaction ordering. In fact, most of the other differences between Spanner and Calvin stem from this fundamental choice.
A serializable system provides a notion of transactional ordering. Even though many transactions may be executed in parallel across many CPUs and many servers in a large distributed system, the final state (and all observable intermediate states) must be as if each transaction was processed one-by-one. If no transactions touch the same data, it is trivial to process them in parallel and maintain this guarantee. However, if the transactions read or write each other’s data, then they must be ordered against each other, with one considered earlier than the other. The one considered “later” must be processed against a version of the database state that includes the writes of the earlier one. In addition, the one considered “earlier” must be processed against a version of the database state that excludes the writes of the later one.

Locking and logging

Spanner uses TrueTime for this transaction ordering. Google famously uses a combination of GPS and atomic clocks in all of their regions to synchronize time to within a known uncertainty bound. If two transactions are processed during time periods that do not have overlapping uncertainty bounds, Spanner can be certain that the later transaction will see all the writes of the earlier transaction.
Spanner obtains write locks within the data replicas on all the data it will write before performing any write. If it obtains all the locks it needs, it proceeds with all of its writes and then assigns the transaction a timestamp at the end of the uncertainty range of the coordinator server for that transaction. It then waits until this later timestamp has definitely passed for all servers in the system (which is the entire length of the uncertainty range) and then releases locks and commits the transaction. Future transactions will get later timestamps and see all the writes of this earlier transaction. Thus, in Spanner, every transaction receives a timestamp based on the actual time that it committed, and this timestamp is used to order transactions. Transactions with later timestamps see all the writes of transactions with earlier timestamps, with locking used to enforce this guarantee.
In contrast, Calvin uses preprocessing to order transactions. All transactions are inserted into a distributed, replicated log before being processed. In more detail: clients submit transactions to the preprocessing layer of their local region, which then submits these transactions to the global log via a cross-region consensus process like Paxos. This is similar to a write-ahead log in a traditional, non-distributed database. The order that the transactions appear in this log is the official transaction ordering. Every replica reads from their local copy of this replicated log and processes transactions in a way that guarantees that their final state is equivalent to what it would have been had every transaction in the log been executed one-by-one.

Replication overhead

The design difference between TrueTime vs. preprocessing directly leads to a difference in how the systems perform replication. In Cavlin, the replication of transactional input during preprocessing is the only replication that is needed. Calvin uses a deterministic execution framework to avoid *all* cross-replication communication during normal (non-recovery mode) execution aside from preprocessing. Every replica sees the same log of transactions and guarantees not only a final state equivalent to executing the transactions in this log one-by-one, but also a final state equivalent to every other replica.
This requires the preprocessor to analyze the transaction code and “pre-execute” any nondeterministic code (e.g. calls to sys.random() or (The implication of this in terms of what types of transactions are supported by Calvin are discussed at the end of this post.) Once all code within a transaction is deterministic, a replica can safely focus on just processing the transactions in the log in the correct order without concern for diverging with the other replicas.  
In contrast, since Spanner does not do any transaction preprocessing, it can only perform replication after transaction execution. Spanner performs this replication via a cross-region Paxos process.

The cost of two-phase commit

Another key difference between Spanner and Calvin is how they commit multi-partitioned transactions. Both Calvin and Spanner partition data into separate shards that may be stored on separate machines that fail independently from each other. In order to guarantee transaction atomicity and durability, any transaction that accesses data in multiple partitions must go through a commit procedure that ensures that every partition successfully processed the part of the transaction that accessed data in that partition. Since machines may fail at any time, including during the commit procedure, this process generally takes two rounds of communication between the partitions involved in the transaction. This two-round commit protocol is called “two phase commit” and is used in almost every ACID-compliant distributed database system, including Spanner. This two phase commit protocol can often consume the majority of latency for short, simple transactions since the actual processing time of the transaction is much less than the delays involved in sending and receiving two rounds of messages over the network.
The cost of two phase commit is particularly high in Spanner because the protocol involves three forced writes to a log that cannot be overlapped with each other. In Spanner, every force write to a log involves a cross-region Paxos agreement, so the latency of two phase commit in Spanner is at least equal to three times the latency of cross-region Paxos.

Determinism is durability

In contrast to Spanner, Calvin leverages deterministic execution to avoid two-phase commit. Machine failures do not cause transactional aborts in Calvin. Instead, after a failure, the machine that failed in Calvin re-reads the input transaction log from a checkpoint, and deterministically replays it to recover its state at the time of the failure. It can then continue on from there as if nothing happened. As a result, the commit protocol does not need to worry about machine failures during the protocol, and can be performed in a single round of communication (and in some cases, zero rounds of communication --- see the original paper for more details).


At this point, I think I have provided enough details to make it possible to present a theoretical comparison of the bottom line performance of Calvin vs. Spanner for a variety of different types of requests.  This comparison assumes a perfectly optimized and implemented version of each system.

Transactional write latency

A transaction that is “non-read-only” writes at least one value to the database state. In Calvin, such a transaction must pay the latency cost of preprocessing, which is roughly the cost of running cross-region Paxos to agree to append the transaction to the log. After this is complete, the remaining latency is the cost of processing the transaction itself, which includes the zero or one-phase commit protocol for distributed transactions.
In Spanner, there is no preprocessing latency, but it still has to pay the cost of cross-region Paxos replication at commit time, which is roughly equivalent to the Calvin preprocessing latency. Spanner also has to pay the commit wait latency discussed above (which is the size of the time uncertainty window), but this can be overlapped with replication. It also pays the latency of two phase commit for multi-partition transactions.
Thus, Spanner and Calvin have roughly equivalent latency for single-partition transactions, but Spanner has worse latency than Calvin for multi-partition transactions due to the extra phases in the transaction commit protocol.

Snapshot read latency

Both Calvin and Spanner keep around older versions of data and read data at a requested earlier timestamp from a local replica without any Paxos-communication with the other replicas.
Thus, both Calvin and Spanner can achieve very low snapshot-read latency.

Transactional read latency

Read-only transactions do not write any data, but they must be linearizable with respect to other transactions that write data. In practice, Calvin accomplishes this via placing the read-only transaction in the preprocessor log. This means that a read-only transaction in Calvin must pay the cross-region replication latency. In contrast, Spanner only needs to submit the read-only transaction to the leader replica(s) for the partition(s) that are accessed in order to get a global timestamp (and therefore be ordered relative to concurrent transactions). Therefore, there is no cross-region Paxos latency --- only the commit time (uncertainty window) latency.
Thus, Spanner has better latency than Calvin for read-only transactions submitted by clients that are physically close to the location of the leader servers for the partitions accessed by that transaction.


Both Spanner and Calvin are both (theoretically) roughly-linearly scalable for transactional workloads for which it is rare for concurrent transactions to be accessing the same data. However, major differences begin to present themselves as the conflict rate between concurrent transactions starts to increase.
Both Spanner and Calvin, as presented in the paper, use locks to prevent concurrent transactions from interfering with each other in impermissible ways. However, the amount of time they hold locks for an identical transaction is substantially different. Both systems need to hold locks during the commit protocol. However, since Calvin’s commit protocol is shorter than Spanner’s, Calvin reduces the lock hold time at the end of the transaction. On the flip side, Calvin acquires all locks that it will need at the beginning of the transaction, whereas Spanner performs all reads for a transaction before acquiring write locks. Therefore, Spanner reduces lock time at the beginning of the transaction.
However, this latter advantage for Spanner is generally outweighed by the former (extra lock-hold time at the end of the transactions) disadvantage, since, as discussed above, the latency of two-phase commit in Spanner involves at least three iterations of cross-region Paxos. Furthermore, Spanner has an additional major disadvantage relative to Calvin in lock-hold time: Spanner must also hold locks during replication (which, as mentioned above, is also a cross-region Paxos process). The farther apart the regions, the larger the latency of this replication, and therefore, the longer Spanner must hold locks.
In contrast, Calvin does its replication during preprocessing, and therefore does not need to hold locks during replication. This leads to Calvin holding locks for much shorter periods of time than Spanner, and therefore being able to process more concurrent transactions in parallel that conflict with each other.
A second difference that can affect scalability is the following: Calvin requires only a single Paxos group for replicating the input log. In contrast, Spanner requires one independent Paxos group per shard, with proportionally higher overhead.
Overall, Calvin has higher throughput scalability than Spanner for transactional workloads where concurrent transactions access the same data. This advantage increases with the distance between datacenters.


In order to implement deterministic transaction processing, Calvin requires the preprocessor to analyze transactions and potentially “pre-execute” any non-deterministic code to ensure that replicas do not diverge. This implies that the preprocessor requires the entire transaction to be submitted at once. This highlights another difference between Calvin and Spanner --- while Spanner theoretically allows arbitrary client-side interactive transactions (that may include external communication), Calvin supports a more limited transaction model.
There are some subtle, but interesting differences between Calvin and Spanner in rare situations where every single replica for a shard is unavailable, or if all but one are unavailable, but these differences are out of scope for this post.


I’m obviously biased in favor of Calvin, but in going through this exercise, I found it very difficult to find cases where an ideal implementation of Spanner theoretically outperforms an ideal implementation of Calvin. The only place where I could find that Spanner has a clear performance advantage over Calvin is for latency of read-only transactions submitted by clients that are physically close to the location of the leader servers for the partitions accessed by that transaction. Since any complex transaction is likely to touch multiple partitions, this is almost impossible to guarantee in a real-world setting.
However, many real-world workloads do not require client-side interactive transactions, and furthermore only need transactional support for writes and are satisfied with performing reads against a snapshots (after all, this is the default isolation model of many SQL systems). It seems to me that Calvin is the better fit for a large class of modern applications.

Wednesday, October 28, 2015

Why MongoDB, Cassandra, HBase, DynamoDB, and Riak will only let you perform transactions on a single data item

(This post is co-authored by Daniel Abadi and Jose Faleiro and cross-posted on Jose's blog)

NoSQL systems such as MongoDB, Cassandra, HBase, DynamoDB, and Riak have made many things easier for application developers. They generally have extremely flexible data models, that reduce the burden of advance prediction of how an application will change over time. They support a wide variety of data types, allow nesting of data, and dynamic addition of new attributes. Furthermore, on the whole, they are relatively easy to install, with far fewer configuration parameters and dependencies than many traditional database systems.

On the other hand, their lack of support for traditional atomic transactions is a major step backwards in terms of ease-of-use for application developers. An atomic transaction enables a group of writes (to different items in the database) to occur in an all-or-nothing fashion --- either they will all succeed and be reflected in the database state, or none of them will. Moreover, in combination with appropriate concurrency control mechanisms, atomicity guarantees that concurrent and subsequent transactions either observe all of the completed writes of an atomic transaction or none of them. Without atomic transactions, application developers have to write corner-case code to account for cases in which a group of writes (that are supposed to occur together) have only partially succeeded or only partially observed by concurrent processes. This code is error-prone, and requires complex understanding of the semantics of an application.

At first it may seem odd that these NoSQL systems, that are so well-known for their developer-friendly features, should lack such a basic ease-of-use tool as an atomic transaction. One might have thought that this missing feature is a simple matter of maturity --- these systems are relatively new and perhaps they simply haven't yet gotten around to implementing support for atomic transactions. Indeed, Cassandra's "batch update" feature could be viewed as a mini-step in this direction (despite the severe constraints on what types of updates can be placed in a "batch update"). However, as we start to approach a decade since these systems were introduced, it is clear that there is a more fundamental reason for the lack of transactional support in these systems.

Indeed, there is a deeper reason for their lack of transactional support, and it stems from their focus on scalability. Most NoSQL systems are designed to scale horizontally across many different machines, where the data in a database is partitioned across these machines. The writes in a (general) transaction may access data in several different partitions (on several different machines). Such transactions are called "distributed transactions". Guaranteeing atomicity in distributed transactions requires that the machines that participate in the transaction coordinate with each other. Each machine must establish that the transaction can successfully commit on every other machine involved in the transaction. Furthermore, a protocol is used to ensure that no machine involved in the transaction will fail before the writes that it was involved in for that transactions are present in stable storage. This avoids scenarios where one set of nodes commit a transaction's writes, while another set of nodes abort or fail before the transaction is complete (which violates the all-or-nothing guarantee of atomicity).

This coordination process is expensive, both, in terms of resources, and in terms of adding latency to database requests. However, the bigger issue is that other operations are not allowed to read the writes of a transaction until this coordination is complete, since the all-or-nothing nature of transaction execution implies that these writes may need to be rolled-back if the coordination process determines that some of the writes cannot complete and the transaction must be aborted. The delay of concurrent transactions can cause further delay of other transactions that have overlapping read- and write-sets with the delayed transactions, resulting in overall "cloggage" of the system. The distributed coordination that is required for distributed transactions thus has significant drawbacks for overall database system performance --- both in terms of the throughput of transactions per unit time that the system can process, and in terms of the latency of transactions as they get caught up in the cloggage (this cloggage latency often dominates the latency of the transaction coordination protocol itself). Therefore, most NoSQL systems have chosen to disallow general transactions altogether rather than become susceptible to the performance pitfalls that distributed transactions can entail.

MongoDB, Riak, HBase, and Cassandra all provide support for transactions on a single key. This is because all information associated with a single key is stored on a single machine (aside from replicas stored elsewhere). Therefore, transactions on a single key are guaranteed not to involve the types of complicated distributed coordination described above.

Given that distributed transactions necessitate distributed coordination, it would seem that there is a fundamental tradeoff between scalable performance and support for distributed transactions. Indeed, many practitioners assume that this is the case. When they set out to build a scalable system, they immediately assume that they will not be able to support distributed atomic transactions without severe performance degradation.

This is in fact completely false. It is very much possible for a scalable system to support performant distributed atomic transactions.

In a recent paper, we published a new representation of the tradeoffs involved in supporting atomic transactions in scalable systems.  In particular, there exists a three-way tradeoff between fairness, isolation, and throughput (FIT). A scalable database system which supports atomic distributed transactions can achieve at most two out of these three properties. Fairness corresponds to the intuitive notion that the execution of any given transaction is not deliberately delayed in order to benefit other transactions.  Isolation provides each transaction with the illusion that it has the entire database system to itself. In doing so, isolation guarantees that if any pair of transactions conflict, then one transaction in the pair will always observe the writes of the other. As a consequence, it alleviates application developers from the burden of reasoning about complex interleavings of conflicting transactions' reads and writes. Throughput refers to the ability of the database to process many concurrent transactions per unit time (without hiccups in performance due to clogging).

The FIT tradeoff dictates that there exist three classes of systems that support atomic distributed transactions:
  1. Those that guarantee fairness and isolation, but sacrifice throughput, 
  2. Those that guarantee fairness and throughput, but sacrifice isolation, and 
  3. Those that guarantee isolation and throughput, but sacrifice fairness.

In other words, not only is it possible to build scalable systems with high throughput distributed transactions, but there actually exist two classes of systems that can do so: those that sacrifice isolation, and those that sacrifice fairness. We discuss each of these two alternatives below.

(Latency is not explicitly mentioned in the tradeoff, but systems that give up throughput also give up latency due to cloggage, and systems that give up fairness yield increased latency for those transactions treated unfairly.)

Give up on isolation
As described above, the root source of the database system cloggage isn't the distributed coordination itself. Rather, it is the fact that other transactions that want to access the data that a particular transaction wrote have to wait until after the distributed coordination is complete before reading or writing the shared data. This waiting occurs due to strong isolation, which guarantees that one transaction in a pair of conflicting must observe the writes of the other. Since a transaction's writes are not guaranteed to commit until after the distributed coordination process is complete, concurrent conflicting transactions cannot make progress for the duration of this coordination.

However, all of this assumes that it is unacceptable for transactions writes to not be immediately observable by concurrent conflicting transactions If this "isolation" requirement is dropped, there is no need for other transactions to wait until the distributed coordination is complete before executing and committing.

While giving up on strong isolation seemingly implies that distributed databases cannot guarantee correctness (because transactions execute against potentially stale database state), it turns out that there exists a class of database constraints that can be guaranteed to hold despite the use of weak isolation among transactions. For more details on the kinds of guarantees that can hold on constraints despite weak isolation, Peter Bailis's work on Read Atomic Multi-Partition (RAMP) transactions provides some great intuition.

Give up on fairness
The underlying motivation for giving up isolation in systems is that distributed coordination extends the duration for which transactions with overlapping data accesses are unable to make progress. Intuitively, distributed coordination and isolation mechanisms overlap in time.  This suggests that another way to circumvent the interaction between isolation techniques and distributed coordination is to re-order distributed coordination such that its overlap with any isolation mechanism is minimized. This intuition forms the basis of Isolation-Throughput systems (which give up fairness).  In giving up fairness, database systems gain the flexibility to pick the most opportune time to pay the cost of distributed coordination.  For instance, it is possible to perform coordination outside of transaction boundaries so that the additional time required to do the coordination does not increase the time that conflicting transactions cannot run. In general, when the system does not need to guarantee fairness, it can deliberately prioritize or delay specific transactions in order to benefit overall throughput.

G-Store is a good example of an Isolation-Throughput system (which gives up fairness).  G-Store extends a (non-transactional) distributed key-value store with support for multi-key transactions.  G-Store restricts the scope of transactions to an application defined set of keys called a KeyGroup. An application defines KeyGroups dynamically based on the set of keys it anticipates will be accessed together over the course of some period of time. Note that the only restriction on transactions is that the keys involved in the transaction be part of a single KeyGroup. G-Store allows KeyGroups to be created and disbanded when needed, and therefore effectively provides arbitrary transactions over any set of keys.

When an application defines a KeyGroup, G-Store moves the constituent keys from their nodes to a single leader node. The leader node copies the corresponding key-value pairs, and all transactions on the KeyGroup are executed on the leader. Since all the key-value pairs involved in a transaction are stored on a single node (the leader node), G-Store transactions do not need to execute a distributed commit protocol during transaction execution.

G-Store pays the cost of distributed coordination prior to executing transactions. In order to create a KeyGroup, G-Store executes an expensive distributed protocol to allow a leader node to take ownership of a KeyGroup, and then move the KeyGroup's constituent keys to the leader node. The KeyGroup creation protocol involves expensive distributed coordination, the cost of which is amortized across the transactions which execute on the KeyGroup.

The key point is that while G-Store still must perform distributed coordination, this coordination is done prior to transaction execution --- before the need to be concerned with isolation from other transactions. Once the distributed coordination is complete (all the relevant data has been moved to a single master node), the transaction completes quickly on a single node without forcing concurrent transactions with overlapping data accesses to wait for distributed coordination. Hence, G-Store achieves both high throughput and strong isolation.

However, the requirement that transactions restrict their scope to a single KeyGroup favors transactions that execute on keys which have already been grouped. This is "unfair" to transactions that need to execute on a set of as yet ungrouped keys. Before such transactions can begin executing, G-Store must first disband existing KeyGroups to which some keys may belong, and then create the appropriate KeyGroup --- a process with much higher latency than if the desired KeyGroup already existed.

The fundamental reason for the poor performance of conventional distributed transactions is the fact that the mechanisms for guaranteeing atomicity (distributed coordination), and isolation overlap in time. The key to enabling high throughput distributed transactions is to separate these two concerns. This insight leads to two ways of separating atomicity and isolation mechanisms. The first option is to weaken isolation such that conflicting transactions can execute and commit in parallel. The second option is to re-order atomicity and isolation mechanisms so that they do not overlap in time, and in doing so, give up fairness during transaction execution.

(Edit: MongoDB and HBase both have (or will soon have) limited support for multi-key transactions as long as those keys are within the same partition. However, hopefully it is clear to the reader that this post is discussing the difficulties of implementing distributed --- cross-partition --- transactions). 

Wednesday, November 7, 2012

Is Upstart the right way to get college student start-ups funded?

In 2010, the movie “The Social Network” was released, which has had a tremendously positive effect on the computer science department at Yale; and from what I have heard, a similar effect has been observed across the country. My understanding (I have not seen this movie myself) is that the movie’s plot revolves around Mark Zuckerberg and his role in the formation of Facebook. In the time since the movie was released, the number of computer science majors at Yale has nearly quadrupled, and these majors  are increasingly looking at start-ups (either founding their own or joining existing ones) as options for when they graduate (instead of going to Wall Street and working as quants, which has historically been the popular career path for Yale CS majors).

In my opinion, this is unquestionably a good thing. Yale has some of the brightest minds of the next generation, and I feel a lot more confident about the future of our country when I see these great minds being applied to creating new entities and jobs and building something real, instead of being wasted in the zero-sum gain arms race of who can create the automatic trading algorithm that is epsilon better than anybody else’s.

One consequence of this start-up craze is that I get bombarded with requests from students who want to meet with me to discuss their start-up idea. This partly because I teach the “Intro to Programming” course at Yale which has had consistently between 120 and 150 students (many of whom are budding entrepreneurs) enrolled since the release of “The Social Network”, partly because the success of Hadapt is certainly no secret around Yale, and partly because I live on Yale’s campus and part of my job in this capacity is to serve as an adviser and mentor to undergraduates.

When I meet with these students I hear all kinds of ideas. Some of them are good, and some of them are bad. Some of them make me think about an area in a different way, and some of them are carbon copies of something that already exists. Some I could get excited about and some I couldn’t. But just about all of them have one thing in common: the students involved vastly overestimate their probability of success. I suppose this should not surprise me --- after all, these are Yale students that have been successful in just about everything they have ever done in their life. So it follows that they would expect their start-up to be successful. But even when I talk to students at other universities who have start-up ideas --- students who have not necessarily been so successful in their lives --- even they are totally convinced that their startup idea is unlikely to fail. It seems that there is a basic psychological flaw in the human mind --- we so desperately want our dreams to come true that we ignore statistical data about the probability of success and trick ourselves into believing that we are the statistical anomaly and will succeed where others have failed.

Many of these students find out the hard reality regarding their start-up idea when they attempt raise funding. They find out that investors are extremely conscious of the probability of success of a group of students with no experience, no reputation, and a limited network. Most of these start-ups fail to raise funding from professional investors. Some students give up at this point. Other students continue along with limited funding from friends and family in an attempt to create more meat around the bones of their idea and reduce risk for the professional investors. Most will eventually fail, while a rare few will succeed.

The outcome of all this is that despite all of these students eager to be entrepreneurs and start companies, very few student ideas receive funding, and most of these ideas never see the light of day. Whether or not this is a good thing is certainly up for debate, but my feeling is that it is a shame that so few student start-ups get funding.

Therefore, when I first heard of Upstart (I think it was in August), I was quite interested in the idea --- it proposed a way to get student start-ups funded. I signed up to receive e-mail updates, but did not hear from them for several months. However, on Monday of this week I received an update from them that they were open for business. I looked through the profiles of the students who were looking for funding and I saw that no fewer than 4 out of the (approximately) 20 profiles that were available online were from Yale University.

However, a deeper look at the Upstart Website reveals a problematic clause that is attached with the funding of the student start-up ideas. This is not a traditional crowdfunding model where investors receive equity in the start-up in exchange for their investment dollars. Instead, the investors get a percentage of the student’s income for a 10-year period in exchange for the investment. This way, in the likely event that the student’s start-up idea does not work out, the investor is able to receive a nice return on investment by taking a cut from the student’s hard earned salary when the student enters the workforce.

This does not seem right to me. On one side you have students who have an inaccurate view of the probability of success of their start-up, and on the other side you have investors who are looking to profit off of the boundless optimism and dreams of these students. These students, with no experience in the real world, no understanding of what skills are necessary to build a company, and a perception of entrepreneurship built more from Hollywood than the cold realities of business, are more than happy to mortgage a percentage of 10 years of future earnings for a chance to receive some short-term money about which they have no idea how to properly evaluate the costs vs. benefits.

In the traditional model, where the  investor receives equity in exchange for the investment, at least the investor is in the same boat as the student --- their interests are aligned and focused on making the start-up a success. With the Upstart model, you have almost the exact opposite. Since the salary of a founder is typically below-market in exchange for the equity the founder receives, the expected rate of return for the investor is actually higher if the student were to give up on the start-up and get a normal job. This is especially true when the investment rate of return for the investor is capped (as it is in Upstart), so that even if the start-up were to take off and the student were to become very wealthy from it, the return to the investor is not markedly different from what it would have been if the company had failed and the student later received a salary at market value. To exacerbate the situation, the investor-investee relationship in Upstart is supposed to be somewhat also a mentor-mentee relationship, which is particularly dangerous when interests are misaligned.

I think Upstart should be commended for trying to get more funding to college students with ideas for starting companies. And although I don’t know many people involved, the people I do know are good people and I highly doubt they are trying to do anything evil. (Jonathan Eng was a TA for my Introduction to Programming class for me 4 years ago, and he was a good and honest TA). However, I do not believe the people involved in Upstart realize how hard it is for students to accurately evaluate the costs and benefits of receiving funding in this way. Therefore I am highly concerned about this model as a way forward for student entrepreneurship.

Monday, October 29, 2012

IEEE Computer issue on the CAP Theorem

Due to Hurricane Sandy, Yale gave me a day off from teaching today and I have finally been able to get to a few things on my "to-do" list. One of them is to write a blog post about the IEEE Computer CAP Retrospective edition and make my paper that appeared inside of it publicly available.

Earlier this year, the IEEE Computer magazine came out with an issue largely devoted to a 12-year retrospective of the CAP theorem and contains several articles from distributed systems researchers that contribute various opinions and thoughts about CAP. The first article is from Eric Brewer, who coined the CAP theorem 12 years ago (though he points out in his article that it was actually 14 years ago). A PDF of Brewer’s article is available for free from: The second article is from Seth Gilbert and Nancy Lynch (the same Gilbert and Lynch that proved the CAP theorem 10 years ago). 

The third article is from me, and contains my criticisms of CAP that long-time readers of my blog will be familiar with. In particular, I point out that many people assume that modern NoSQL systems relax consistency guarantees in order to gain availability due to the constraints of the CAP theorem, when the reality is that these systems give up on consistency even in the absence of network partitions, which is not required according to the CAP theorem. The  reason why they give up on consistency is because of a desire to improve system latency, an increasingly important requirement in the modern impatient world. I then describe the latency-consistency tradeoff in more detail, and end the article with the PACELC reformulation of CAP that debuted on my blog over two years ago. With the permission of the IEEE, I am making a free version of this article available today. This article is the first time that the PACELC formulation and my thoughts on CAP appear in a scholarly article, which gives people a venue to refer to (bibtex code available here) when citing this work (you can stop citing a blog post!)

The fourth article is from Raghu Ramakrishnan, entitled “CAP and Cloud Data Management” and describes the PNUTS system that I have mentioned in the past as a good example of a system for which the consistency-latency tradeoff has had a more direct impact on the system design than the consistency-availability tradeoff of CAP. The fifth article is from Ken Birman, Daniel Freedman, Qi Huang, and Patrick Dowell of Cornell University on overcoming CAP with soft-state replication. Unfortunately, I cannot find a free link to Raghu’s article, but if you have an IEEE account, you can access it at at: The Birman et. al. article can be found for free at:

If you have enjoyed my thoughts on CAP on this blog, I highly recommend you read each of these five articles. The Brewer article in particular acknowledges my past criticism of CAP not actually being about picking two of three out of C (consistency), A (availability), and P (partition tolerance) due to the fact that it does not make sense to reason about a system that is ‘CA’. (If there is no partition, any system can be both consistent and available --- the only question is what happens when there is a partition --- does consistency or availability get sacrificed?) Brewer uses this observation to lead into a nice generalization of consistency-availability tradeoff. In particular, when a partition occurs, the system does three things: (1) detect that the partition occurred, (2) enter a partition mode that may or may not limit some operations, and (3) initiate some sort of reconciliation algorithm when the partition is fixed. Depending on how these three things are implemented, it is  possible to obtain much of the spectrum between CP systems and AP systems. The article also contains a nice reference to the CRDT work by Shapiro et. al. at INRIA. Overall, I strongly support Brewer’s approach to navigating this tradeoff. It also fits nicely with Mehul Shah’s talk at HPTS in the way that the spectrum between consistency and availability is explicitly considered at system design time, rather than trying to bolt consistency on top of an AP (eventually consistent) system after the fact (a wildly suboptimal endeavor).

While most of Brewer’s article focused on the consistency-availability tradeoff, Brewer also briefly acknowledges that “in its classic interpretation, the CAP theorem ignores latency”, and that some systems reduce consistency for latency (he even refers to the PNUTS example I used in my original blog post). I remain convinced that PACELC is the best way to reason about both of these tradeoffs in a single formulation: if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?

Tuesday, June 26, 2012

Defending Matt Welsh’s 'Startup University' Post

A week ago, Matt Welsh released a blog post on attaching a startup incubator to a university in order to create a funding model for some of the research that is performed at the university. Unfortunately, the beginning part of the blog post talked about the “inefficiency” of universities in terms of “producing real products” and the (perhaps overly dramatic) assertion that “nothing of practical value came out of [Matt’s] entire research career”. Although Matt has clarified that it was not his intention to indicate that the goal of academic research was to “produce real, shipping products that people could use”, many people interpreted the opening part of Matt’s post in that way, and reacted negatively (including, notably, Michael Mitzenmacher who responded in a comment and Joe Hellerstein who responded in his own blog post).

If we ignore the problems with the first part of Matt’s post, the rest of the post raises some important points and interesting ideas. As an academic who has spent large chunks of time spinning off a research project into a startup (HadoopDB was commercialized by Hadapt, which by most available metrics has been an example of a research lab-to-startup success story), many parts of Matt’s article rung true:
  1. Matt’s statement: “Most universities make starting a company painfully difficult when it comes to questions of IP ownership [and] licensing” was certainly true for Hadapt. It took way too long, and way too much effort to get an agreement in place. Part of the problem was discussed in the comment thread of Matt’s post --- licensing patents are much better aligned with the core mission of a university than accepting equity in start-ups.
  2.  Matt’s statement: “Most universities also make starting a company painfully difficult when it comes to […] forcing the academic's research to be dissociated with their commercial activities.” This was also true for me. I do not mean to criticize the university --- I absolutely understand the need for the conflict of interest safeguards because of the way that universities (and the assumptions of incoming students) are structured today. However, restructuring some of these assumptions in the way that Matt talks about may reduce the legal liabilities, and allow for fewer safeguards to have to be put in place. I also think that the students are hurt more than helped by some of these safeguards. For example, one of the PhD students involved in HadoopDB wanted to work part time for Hadapt while finishing his PhD. However, due to the COI legal complexities, he was forbidden from doing this and was forced to choose between Hadapt and the PhD program (he, of course, chose to take a leave of absence and join Hadapt).
  3.  Matt’s statement that academics starting companies “involves a high degree of risk (potentially career-ending for pre-tenure faculty)” obviously resonates with me. Whether or not Hadapt is successful, it has certainly taken my time away from publishing papers (though obviously, I'm still trying to publish as much as I can --- see, for example, my last post on the Calvin project). Since publication quantity and quality remain key statistics for academic success, any conscious reduction of them comes with a clear risk.
The bottom line is that I absolutely agree with Matt’s assertion that there are a lot of extremely intelligent faculty in academic institutions across the world that have made the mental calculation and decided that the benefits do not outweigh the risks in spinning off a startup from an ongoing research project. Whether or not this is a bad thing is up for debate --- it is certainly not the core mission of a university to spin off companies or produce real-world products. However most universities do have some number of applied fields, and measuring impact in applied fields is often initiated by looking at real-world deployments of the research ideas. Starting companies is clearly the most direct mechanism for translating research ideas to real-world impact. Hence, it’s probably not a controversial statement to assert that reducing some of the barriers to starting companies would allow faculty in applied fields to increase their impact, the primary goal of research.

Therefore, allowing for explicit relationships between research groups and university-sponsored start-up incubators, where the university invests in a start-up, with proceeds from such investments being used to sponsor additional research in the department, is an idea worth considering. I would, however, change a few things about Matt’s proposal:
  1.  I would not simply replace venture capital money with university money. Although it is easy to get into the trap of assuming that the venture capitalist simply trades investment dollars for equity in the company, it turns out that venture capitalists provide a lot of value in addition to their money. Seeing firsthand the difference at Hadapt before and after we got big-name venture capitalists behind us really drove this point home for me.  Therefore, I would recommend that the university partner with venture capitalists, or otherwise hire successful venture capitalists to work in-house (and continue to compensate them using the standard venture capital compensation schemes). Although the Kauffman report has recently shed some light into how poorly venture capital has performed over the last decade, the top venture capitalists have still done very well, and it is important to remember that the goal for the university is not to turn a profit on the investment, but rather to increase the number of startups coming out of the university, in order to increase the research impact. Break-even performance or even small amounts of losses are totally acceptable.
  2.  The model will not work for any university. The location of the university is critical. Trying to get an incubator going for universities located in the middle of nowhere is a recipe for disaster. Technologists like to think that the technology that the company is commercializing is the most important factor in the company’s success. In fact, it falls way behind ‘market’ and ‘people’ as a determining factor. The company needs competent and experienced people throughout the organization --- the engineering team, marketing, sales, support, etc. Recruiting a competent team in a location where there have been small numbers of comparable companies is likely to be futile. Students from the university can only get you so far --- you need a mix of experienced people as well.
  3.  There needs to be explicit mechanisms in place to reduce the risk for the faculty member. This means that the faculty member should get credit for certain company metrics at promotion or yearly evaluation time in addition to standard paper citation metrics. Company financial data is probably not a great metric, but customer counts of people actually using the technology, or even customer counts at “me-too” competitors could be used. Three years after publishing the original HadoopDB paper, there are real people using this technology to solve real problems. It’s pretty rare to see such an immediate impact, and it ought to count for something.
Obviously my own experiences have made me predisposed to liking Matt’s ideas, but I do encourage people to read the second half of his post independently of the first half.

Wednesday, May 16, 2012

If all these new DBMS technologies are so scalable, why are Oracle and DB2 still on top of TPC-C? A roadmap to end their dominance.

(This post is coauthored by Alexander Thomson and Daniel Abadi)
In the last decade, database technology has arguably progressed furthest along the scalability dimension. There have been hundreds of research papers, dozens of open-source projects, and numerous startups attempting to improve the scalability of database technology. Many of these new technologies have been extremely influential---some papers have earned thousands of citations, and some new systems have been deployed by thousands of enterprises.

So let’s ask a simple question: If all these new technologies are so scalable, why on earth are Oracle and DB2 still on top of the TPC-C standings? Go to the TPC-C Website with the top 10 results in raw transactions per second. As of today (May 16th, 2012), Oracle 11g is used for 3 of the results (including the top result), 10g is used for 2 of the results, and the rest of the top 10 is filled with various versions of DB2. How is technology designed decades ago still dominating TPC-C? What happened to all these new technologies with all these scalability claims?

The surprising truth is that these new DBMS technologies are not listed in the TPC-C top ten results not because that they do not care enough to enter, but rather because they would not win if they did.

To understand why this is the case, one must understand that scalability does not come for free. Something must be sacrificed to achieve high scalability. Today, there are three major categories of tradeoff that can be exploited to make a system scale. The new technologies basically fall into two of these categories; Oracle and DB2 fall into a third. And the later parts of this blog post describes research from our group at Yale that introduces a fourth category of tradeoff that provides a roadmap to end the dominance of Oracle and DB2.

These categories are:

(1) Sacrifice ACID for scalability. Our previous post on this topic discussed this in detail. Basically we argue that a major class of new scalable technologies fall under the category of “NoSQL” which achieves scalability by dropping ACID guarantees, thereby allowing them to eschew two phase locking, two phase commit, and other impediments to concurrency and processor independence that hurt scalability. All of these systems that relax ACID are immediately ineligible to enter the TPC-C competition since ACID guarantees are one of TPC-C’s requirements. That’s why you don’t see NoSQL databases in the TPC-C top 10---they are immediately disqualified.

(2) Reduce transaction flexibility for scalability. There are many so-called “NewSQL” databases that claim to be both ACID-compliant and scalable. And these claims are true---to a degree. However, the fine print is that they are only linearly scalable when transactions can be completely isolated to a single “partition” or “shard” of data. While these NewSQL databases often hide the complexity of sharding from the application developer, they still rely on the shards to be fairly independent. As soon as a transaction needs to span multiple shards (e.g., update two different user records on two different shards in the same atomic transaction), then these NewSQL systems all run into problems. Some simply reject such transactions. Others allow them, but need to perform two phase commit or other agreement protocols in order to ensure ACID compliance (since each shard may fail independently). Unfortunately, agreement protocols such as two phase commit come at a great scalability cost (see our 2010 paper that explains why). Therefore, NewSQL databases only scale well if multi-shard transactions (also called “distributed transactions” or “multi-partition transactions”) are very rare. Unfortunately for these databases, TPC-C models a fairly reasonable retail application where customers buy products and the inventory needs to be updated in the same atomic transaction. 10% of TPC-C New Order transactions involve customers buying products from a “remote” warehouse, which is generally stored in a separate shard. Therefore, even for basic applications like TPC-C, NewSQL databases lose their scalability advantages. That’s why the NewSQL databases do not enter TPC-C results --- even just 10% of multi-shard transactions causes their performance to degrade rapidly.

(3) Trade cost for scalability. If you use high end hardware, it is possible to get stunningly high transactional throughput using old database technologies that don’t have shared-nothing horizontally scalability. Oracle tops TPC-C with an incredibly high throughput of 500,000 transactions per second. There exists no application in the modern world that produces more than 500,000 transactions per second (as long as humans are initiating the transactions---machine-generated transactions are a different story). Therefore, Oracle basically has all the scalability that is needed for human scale applications. The only downside is cost---the Oracle system that is able to achieve 500,000 transactions per second costs a prohibitive $30,000,000!

Since the first two types of tradeoffs are immediate disqualifiers for TPC-C, the only remaining thing to give up is cost-for-scale, and that’s why the old database technologies are still dominating TPC-C. None of these new technologies can handle both ACID and 10% remote transactions.

A fourth approach...

TPC-C is a very reasonable application. New technologies should be able to handle it. Therefore, at Yale we set out to find a new dimension in this tradeoff space that could allow a system to handle TPC-C at scale without costing $30,000,000. Indeed, we are presenting a paper next week at SIGMOD (see the full paper) that describes a system that can achieve 500,000 ACID-compliant TPC-C New Order transactions per second using commodity hardware in the cloud. The cost to us to run these experiments was less than $300 (of course, this is renting hardware rather than buying, so it’s hard to compare prices --- but still --- a factor of 100,000 less than $30,000,000 is quite large).

Calvin, our prototype system designed and built by a large team of researchers at Yale that include Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, Anton Petrov, Michael Giuffrida, and Aaron Segal (in addition to the authors of this blog post), explores a tradeoff very different from the three described above. Calvin requires all transactions to be executed fully server-side and sacrifices the freedom to non-deterministically abort or reorder transactions on-the-fly during execution. In return, Calvin gets scalability, ACID-compliance, and extremely low-overhead multi-shard transactions over a shared-nothing architecture. In other words, Calvin is designed to handle high-volume OLTP throughput on sharded databases on cheap, commodity hardware stored locally or in the cloud. Calvin significantly improves the scalability over our previous approach to achieving determinism in database systems.

Scaling ACID

The key to Calvin’s strong performance is that it reorganizes the transaction execution pipeline normally used in DBMSs according to the principle: do all the "hard" work before acquiring locks and beginning execution. In particular, Calvin moves the following stages to the front of the pipeline:

  • Replication. In traditional systems, replicas agree on each modification to database state only after some transaction has made the change at some "master" replica. In Calvin, all replicas agree in advance on the sequence of transactions that they will (deterministically) attempt to execute.
  • Agreement between participants in distributed transactions. Database systems traditionally use two-phase commit (2PC) to handle distributed transactions. In Calvin, every node sees the same global sequence of transaction requests, and is able to use this already-agreed-upon information in place of a commit protocol.
  • Disk accesses. In our VLDB 2010 paper, we observed that deterministic systems performed terribly in disk-based environments due to holding locks for the 10ms+ duration of reading the needed data from disk, since they cannot reorder conflicting transactions on the fly. Calvin gets around this setback by prefetching into memory all records that a transaction will need during the replication phase---before locks are even acquired.

As a result, each transaction’s user-specified logic can be executed at each shard with an absolute minimum of runtime synchronization between shards or replicas to slow it down, even if the transaction’s logic requires it to access records at multiple shards. By minimizing the time that locks are held, concurrency can be greatly increased, thereby leading to near-linear scalability on a commodity cluster of machines.

Strongly consistent global replication

Calvin’s deterministic execution semantics provide an additional benefit: replicating transactional input is sufficient to achieve strongly consistent replication. Since replicating batches of transaction requests is extremely inexpensive and happens before the transactions acquire locks and begin executing, Calvin’s transactional throughput capacity does not depend at all on its replication configuration.

In other words, not only can Calvin can run 500,000 transactions per second on 100 EC2 instances in Amazon’s US East (Virginia) data center, it can maintain strongly-consistent, up-to-date 100-node replicas in Amazon’s Europe (Ireland) and US West (California) data centers---at no cost to throughput.

Calvin accomplishes this by having replicas perform the actual processing of transactions completely independently of one another, maintaining strong consistency without having to constantly synchronize transaction results between replicas. (Calvin’s end-to-end transaction latency does depend on message delays between replicas, of course---there is no getting around the speed of light.)

Flexible data model

So where does Calvin fall in the OldSQL/NewSQL/NoSQL trichotomy?

Actually, nowhere. Calvin is not a database system itself, but rather a transaction scheduling and replication coordination service. We designed the system to integrate with any data storage layer, relational or otherwise. Calvin allows user transaction code to access the data layer freely, using any data access language or interface supported by the underlying storage engine (so long as Calvin can observe which records user transactions access). The experiments presented in the paper use a custom key-value store. More recently, we’ve hooked Calvin up to Google’s LevelDB and added support for SQL-based data access within transactions, building relational tables on top of LevelDB’s efficient sorted-string storage.

From an application developer’s point of view, Calvin’s primary limitation compared to other systems is that transactions must be executed entirely server-side. Calvin has to know in advance what code will be executed for a given transaction. Users may pre-define transactions directly in C++, or submit arbitrary Python code snippets on-the-fly to be parsed and executed as transactions.

For some applications, this requirement of completely server-side transactions might be a difficult limitation. However, many applications prefer to execute transaction code on the database server anyway (in the form of stored procedures), in order to avoid multiple round trip messages between the database server and application server in the middle of a transaction.

If this limitation is acceptable, Calvin presents a nice alternative in the tradeoff space to achieving high scalability without sacrificing ACID or multi-shard transactions. Hence, we believe that our SIGMOD paper may present a roadmap for overcoming the scalability dominance of the decades-old database solutions on traditional OLTP workloads. We look forward to debating the merits of this approach in the weeks ahead (and Alex will be presenting the paper at SIGMOD next week).