Wednesday, December 7, 2011

Replication and the latency-consistency tradeoff

As 24/7 availability becomes increasingly important for modern applications, database systems are frequently replicated in order to stay up and running in the face of database server failure. It is no longer acceptable for an application to wait for a database to recover from a log on disk --- most mission-critical applications need immediate failover to a replica.

There are several important tradeoffs to consider when it comes to system design for replicated database systems. The most famous one is CAP --- you have to trade off consistency vs. availability in the event of a network partition. In this post, I will go into detail about a lesser-known but equally important tradeoff --- between latency and consistency. Unlike CAP, where consistency and availability are only traded off in the event of a network partition, the latency vs. consistency tradeoff is present even during normal operations of the system. (Note: the latency-consistency tradeoff discussed in this post is the same as the "ELC" case in my PACELC post).

The intuition behind the tradeoff is the following: there's no way to perform consistent replication across database replicas without some level of synchronous network communication. This communication takes time and introduces latency. For replicas that are physically close to each other (e.g., on the same switch), this latency is not necessarily onerous. But replication over a WAN will introduce significant latency.

The rest of this post adds more meat to the above intuition. I will discuss several general techniques for performing replication, and show how each technique trades off latency or consistency. I will then discuss several modern implementations of distributed database systems and show how they fit into the general replication techniques that are outlined in this post.

There are only three alternatives for implementing replication (each with several variations): (1) data updates are sent to all replicas at the same time, (2) data updates are sent to an agreed upon master node first, or (3) data updates are sent to a single (arbitrary) node first. Each of these three cases can be implemented in various ways; however each implementation comes with a consistency-latency tradeoff. This is described in detail below.

  1. Data updates are sent to all replicas at the same time. If updates are not first passed through a preprocessing layer or some other agreement protocol, replica divergence (a clear lack of consistency) could ensue (assuming there are multiple updates to the system that are submitted concurrently, e.g., from different clients), since each replica might choose a different order with which to apply the updates . On the other hand, if updates are first passed through a preprocessing layer, or all nodes involved in the write use an agreement protocol to decide on the order of operations, then it is possible to ensure that all replicas will agree on the order in which to process the updates, but this leads to several sources of increased latency. For the case of the agreement protocol, the protocol itself is the additional source of latency. For the case of the preprocessor, the additional sources of latency are:

    1. Routing updates through an additional system component (the preprocessor) increases latency

    2. The preprocessor either consists of multiple machines or a single machine. If it consists of multiple machines, an agreement protocol to decide on operation ordering is still needed across machines. Alternatively, if it runs on a single machine, all updates, no matter where they are initiated (potentially anywhere in the world) are forced to route all the way to the single preprocessor first, even if there is a data replica that is nearer to the update initiation location.


  2. Data updates are sent to an agreed upon location first (this location can be dependent on the actual data being updated) --- we will call this the “master node” for a particular data item. This master node resolves all requests to update the same data item, and the order that it picks to perform these updates will determine the order that all replicas perform the updates. After it resolves updates, it replicates them to all replica locations. There are three options for this replication:

    1. The replication is done synchronously, meaning that the master node waits until all updates have made it to the replica(s) before "committing" the update. This ensures that the replicas remain consistent, but synchronous actions across independent entities (especially if this occurs over a WAN) increases latency due to the requirement to pass messages between these entities, and the fact that latency is limited by the speed of the slowest entity.

    2. The replication is done asynchronously, meaning that the update is treated as if it were completed before it has been replicated. Typically the update has at least made it to stable storage somewhere before the initiator of the update is told that it has completed (in case the master node fails), but there are no guarantees that the update has been propagated to replicas. The consistency-latency tradeoff in this case is dependent on how reads are dealt with:
      1. If all reads are routed to the master node and served from there, then there is no reduction in consistency. However, there are several latency problems with this approach:
        1. Even if there is a replica close to the initiator of the read request, the request must still be routed to the master node which could potentially be physically much farther away.

        2. If the master node is overloaded with other requests or has failed, there is no option to serve the read from a different node. Rather, the request must wait for the master node to become free or recover. In other words, there is a potential for increased latency due to lack of load balancing options.

      2. If reads can be served from any node, read latency is much better, but this can result in inconsistent reads of the same data item, since different locations have different versions of a data item while its updates are still being propagated, and a read can potentially be sent to any of these locations. Although the level of reduced consistency can be bounded by keeping track of update sequence numbers and using them to implement “sequential/timeline consistency” or “read-your-writes consistency”, these options are nonetheless reduced consistency options. Furthermore, write latency can be high if the master for a write operation is geographically far away from the requester of the write.

    3. A combination of (a) and (b) are possible. Updates are sent to some subset of replicas synchronously, and the rest asynchronously. The consistency-latency tradeoff in this case again is determined by how reads are dealt with. If reads are routed to at least one node that had been synchronously updated (e.g. when R + W > N in a quorum protocol, where R is the number of nodes involved in a synchronous read, W is the number of nodes involved in a synchronous write, and N is the number of replicas), then consistency can be preserved, but the latency problems of (a), (b)(i)(1), and (b)(i)(2) are all present (though to somewhat lower degrees, since the number of nodes involved in the synchronization is smaller, and there is potentially more than one node that can serve read requests). If it is possible for reads to be served from nodes that have not been synchronously updated (e.g. when R + W <= N), then inconsistent reads are possible, as in (b)(ii) above .

  3. Data updates are sent to an arbitrary location first, the updates are performed there, and are then propagated to the other replicas. The difference between this case and case (2) above is that the location that updates are sent to for a particular data item is not always the same. For example, two different updates for a particular data item can be initiated at two different locations simultaneously. The consistency-latency tradeoff again depends on two options:
    1. If replication is done synchronously, then the latency problems of case (2)(a) above are present. Additionally, extra latency can be incurred in order to detect and resolve cases of simultaneous updates to the same data item initiated at two different locations.

    2. If replication is done asynchronously, then similar consistency problems as described in case (1) and (2b) above present themselves.

Therefore, no matter how the replication is performed, there is a tradeoff between consistency and latency. For carefully controlled replication across short distances, there exists reasonable options (e.g. choice 2(a) above, since network communication latency is small in local data centers); however, for replication over a WAN, there exists no way around the significant consistency-latency tradeoff.

To more fully understand the tradeoff, it is helpful to consider how several well-known distributed systems are placed into the categories outlined above. Dynamo, Riak, and Cassandra choose a combination of (2)(c) and (3) from the replication alternatives described above. In particular, updates generally go to the same node, and are then propagated synchronously to W other nodes (case (2)(c)). Reads are synchronously sent to R nodes with R + W typically being set to a number less than or equal to N, leading to a possibility of inconsistent reads. However, the system does not always send updates to the same node for a particular data item (e.g., this can happen in various failure cases, or due to rerouting by a load balancer), which leads to the situation described in alternative (3) above, and the potentially more substantial types of consistency shortfalls. PNUTS chooses option (2)(b)(ii) above, for excellent latency at reduced consistency. HBase chooses (2) (a) within a cluster, but gives up consistency for lower latency for replication across different clusters (using option (2)(b)).

In conclusion, there are two major reasons to reduce consistency in modern distributed database systems, and only one of them is CAP. Ignoring the consistency-latency tradeoff of replicated systems is a great oversight, since it is present at all times during system operation, whereas CAP is only relevant in the (arguably) rare case of a network partition. In fact, the consistency-latency tradeoff is potentially more significant than CAP, since it has a more direct effect of the baseline operations of modern distributed database systems.

Tuesday, October 4, 2011

Overview of the Oracle NoSQL Database

Oracle is the clear market leader in the commercial database community, and therefore it is critical for any member of the database community to pay close attention to the new product announcements coming out of Oracle’s annual Open World conference. The sheer size of Oracle’s sales force, entrenched customer base, and third-party ecosystem instantly gives any new Oracle product the potential for very high impact. Oracle’s new products require significant attention simply because they’re made by Oracle.

I was particularly eager for this year’s Oracle Open World conference, because there were rumors of two separate new Oracle products involving Hadoop and NoSQL --- two of the central research focuses of my database group at Yale --- one of them (Hadoop) also being the focus of my recent startup (Hadapt). Oracle’s Hadoop announcements, while very interesting from a business perspective (everyone is talking about how this “validates” Hadoop), are not so interesting from a technical perspective (the announcements seem to revolve around (1) creating a “connector” between Hadoop and Oracle, where Hadoop is used for ETL tasks, and the output of these tasks are then loaded over this connector to the Oracle DBMS and (2) packaging the whole thing into an appliance, which again is very important from a business perspective since there is certainly a market for anything that makes Hadoop easier to use, but does not seem to be introducing any technically interesting new contributions).

In contrast, the Oracle NoSQL database is actually a brand new system built by the Oracle BerkeleyDB team, and is therefore very interesting from a technical perspective. I therefore spent way too much time trying to find out as much as I could about this new system from a variety of sources. There is not yet a lot of publicly available information about the system; however there is a useful whitepaper written by the illustrious Harvard professor Margo Seltzer, who has been working with Oracle since they acquired her start-up in 2006 (the aforementioned BerkeleyDB).

Due to the dearth of available information on the system, I thought that it would be helpful to the readers of my blog if I provided an overview of what I’ve learned about it so far. Some of the facts I state below have been directly made by Oracle; other facts are inferences that I’ve made, based on my understanding of the system architecture and implementation. As always, if I have made any mistakes in my inferences, please let me know, and I will fix them as soon as possible.

The coolest thing about the Oracle NoSQL database is that it is not a simple copy of a currently existing NoSQL system. It is not Dynamo or SimpleDB. It is not Bigtable or HBase. It is not Cassandra or Riak. It is not MongoDB or CouchDB. It is a new system that has a chosen a different point (actually --- several different points) in the system-design tradeoff space than any of the above mentioned systems. Since it makes a different set of tradeoffs, it is entirely inappropriate to call it “better” or “worse” than any of these systems. There will be situations where the Oracle solution will be more appropriate, and there will be situations where other systems will be more appropriate.

Overview of the system:
Oracle NoSQL database is a distributed, replicated key-value store. Given a cluster of machines (in a shared-nothing architecture, with each machine having its own storage, CPU, and memory), each key-value pair is placed on several of these machines depending on the result of a hash function on the key. In particular, the key-value pair will be placed on a single master node, and a configurable number of replica nodes. All write and update operations for a key-value pair go to the master node for that pair first, and then all replica nodes afterwards. This replication is typically done asynchronously, but it is possible to request that it be done synchronously if one is willing to tolerate the higher latency costs. Read operations can go to any node if the user doesn’t mind incomplete consistency guarantees (i.e. reads might not see the most recent data), but they must be served from the master node if the user requires the most recent value for a data item (unless replication is done synchronously). There is no SQL interface (it is a NoSQL system after all!) --- rather it supports simple insert, update, and delete operations of key-value pairs.

The following is where the Oracle NoSQL Database falls in various key dimensions:

CAP
Like many NoSQL databases, the Oracle NoSQL Database is configurable to be either C/P or A/P in CAP. In particular, if writes are configured to be performed synchronously to all replicas, it is C/P in CAP --- a partition or node failure causes the system to be unavailable for writes. If replication is performed asynchronously, and reads are configured to be served from any replica, it is A/P in CAP --- the system is always available, but there is no guarantee of consistency. [Edit: Actually this configuration is really just P of CAP --- minority partitions become unavailable for writes (see comments about eventual consistency below). This violates the technical definition of "availability" in CAP. However, it is obviously the case that the system still has more availability in this case than the synchronous write configuration.]

Eventual consistency
Unlike Dynamo, SimpleDB, Cassandra, or Riak, the Oracle NoSQL Database does not support eventual consistency. I found this to be extremely amusing, since Oracle’s marketing material associates NoSQL with the BASE acronym. But the E in BASE stands for eventual consistency! So by Oracle’s own definition, their lack of support of eventual consistency means that their NoSQL Database is not actually a NoSQL Database! (In my opinion, their database is really NoSQL --- they just need to fix their marketing literature that associates NoSQL with BASE). My proof for why the Oracle NoSQL Database does not support eventual consistency is the following: Let’s say the master node for a particular key-value pair fails, or a network partition separates the master node from its replica nodes. The key-value pair becomes unavailable for writes for a short time until the system elects a new master node from the replicas. Writes can then continue at the new master node. However, any writes that had been submitted to the old master node, but had not yet been sent to the replicas before the master node failure (or partition) are lost. In an eventually consistent system, these old writes can be reconciled with the current state of the key-value pair after the failed node recovers its log from stable storage, or when the network partition is repaired. Of course, if replication had been configured to be done synchronously (at a cost of latency), there will not be data loss during network partitions or node failures. Therefore, there is a fundamental difference between the Oracle NoSQL database system and eventually consistent NoSQL systems: while eventually consistent NoSQL systems choose to tradeoff consistency for latency and availability during failure and network partition events, the Oracle NoSQL system instead trades of durability for latency and availability. To be clear, this difference is only for inserts and updates --- the Oracle NoSQL database is able to trade-off consistency for latency on read requests --- it supports similar types of timeline consistency tradeoffs as the Yahoo PNUTs/Sherpa system.

[Two of the members of the Oracle NoSQL Database team have commented below. There is a little bit of a debate about my statement that the Oracle NoSQL Database lacks eventual consistency, but I stand by the text I wrote above. For more, see the comments.]

Joins
Like most NoSQL systems, the Oracle NoSQL database does not support joins. It only supports simple read, write, update, and delete operations on key-value pairs.

Data Model
The Oracle NoSQL database actually has a more subtle data model than simple key-value pairs. In particular, the key is broken down into a “major key path” and “minor key path” where all keys with the same “major key path” are guaranteed to be stored on the same physical node. I expect that the way minor keys will be used in the Oracle NoSQL database will map directly to the way column families are used in Bigtable, HBase and Cassandra. Rather than trying to gather together every possible attribute about a key in a giant “value” for the single key-value pair, you can separate them into separate key-value pairs where the “major key path” is the same for all the keys in the set of key-value pairs, but the “minor key path” will be different. This is similar to how column families for the same key in Bigtable, HBase, and Cassandra can also be stored separately. Personally, I find the major and minor key path model to be more elegant than the column family model (I have ranted against column-families in the past).

ACID compliance
Like most NoSQL systems, the Oracle NoSQL database is not ACID compliant. Besides the durability and consistency tradeoffs mentioned above, the Oracle NoSQL database also does not support arbitrary atomic transactions (the A in ACID). However, it does support atomic operations on the same key, and even allows atomic transactions on sets of keys that share the same major key path (since keys that share the same major key path are guaranteed to be stored on the same node, atomic operations can be performed without having to worry about distributed commit protocols across multiple machines).

Summary
The sweet spot for the Oracle NoSQL database seems to be in single-rack deployments (e.g. the Oracle Big Data appliance) with a low-latency network, so that the system can be set up to use synchronous replication while keeping latency costs of this type of replication small (and the probability of network partitions are small). Another sweet spot is for wider area deployments, but the application is able to work around reduced durability guarantees. It therefore seems to present the largest amount of competition for NoSQL databases like MongoDB which have similar sweet spots. However, the Oracle NoSQL database will need to add additional “developer-friendly” features if it wants to compete head-to-head with MongoDB. Either way, there are clearly situations where the Oracle NoSQL database will be a great fit, and I love that Oracle (in particular, the Oracle BerkeleyDB team) built this system from scratch as an interesting and technically distinct alternative to currently available NoSQL systems. I hope Oracle continues to invest in the system and the team behind it.

Tuesday, July 19, 2011

Hadoop's tremendous inefficiency on graph data management (and how to avoid it)

Hadoop is great. It seems clear that it will serve as the basis of the vast majority of analytical data management within five years. Already today it is extremely popular for unstructured and polystructured data analysis and processing, since it is hard to find other options that are superior from a price/performance perspective. The reader should not take the following as me blasting Hadoop. I believe that Hadoop (with its ecosystem) is going to take over the world.

The problem with Hadoop is that its strength is also its weakness. Hadoop gives the user tremendous flexibility and power to scale all kinds of different data management problems. This is obviously great. But it is this same flexibility that allows the user to perform incredibly inefficient things and not care because (a) they can simply add more machines and use Hadoop's scalability to hide inefficiency in user code (b) they can convince themselves that since everyone talks about Hadoop as being designed for "batch data processing" anyways, they can let their process run in the background and not care about how long it will take for it to return.

Although not the subject of this post, an example of this inefficiency can be found in a SIGMOD paper that a bunch of us from Yale and the University of Wisconsin published 5 weeks ago. The paper shows that using Hadoop on structured (relational) data is at least a factor of 50 less efficient than it needs to be (an incredibly large number given how hard data center administrators work to yield less than a factor of two improvement in efficiency). As many readers of this blog already know, this factor of 50 improvement is the reason why Hadapt was founded. But this post is not about Hadapt or relational data. In this post, the focus is on graph data, and how if one is not careful, using Hadoop can be well over a factor of 1000 less efficient than it needs to be.

Before we get into how to improve Hadoop's efficiency on graph data by a factor of 1000, let's pause for a second to comprehend how dangerous it is to let inefficiencies in Hadoop become widespread. Imagine a world where the vast majority of data processing runs on Hadoop (a not entirely implausible scenario). If people allow these factors of 50 or 1000 to exist in their Hadoop utilization, these inefficiency factors translate directly to factors of 50 or 1000 more power utilization, more carbon emissions, more data center space, and more silicon waste. The disastrous environmental consequences in a world where everyone standardizes on incredibly inefficient technology is downright terrifying. And this is ignoring the impact on businesses in terms of server and energy costs, and lower performance. It seems clear that developing a series of "best practices" around using Hadoop efficiently is going to be extremely important moving forward.

Let's delve into the subject of graph data in more detail. Recently there was a paper by Rohloff et. al. that showed how to store graph data (represented in vertex-edge-vertex "triple" format) in Hadoop, and perform sub-graph pattern matching in a scalable fashion over this graph of data. The particular focus of the paper is on Semantic Web graphs (where the data is stored in RDF and the queries are performed in SPARQL), but the techniques presented in the paper are generalizable to other types of graphs. This paper and resulting system (called SHARD) has received significant publicity, including a presentation at HadoopWorld 2010, a presentation at DIDC 2011, and a feature on Cloudera's Website. In fact, it is a very nice technique. It leverages Hadoop to scale sub-graph pattern matching (something that has historically be difficult to do); and by aggregating all outgoing edges for a given vertex into the same key-value pair in Hadoop, it even scales queries in a way that is 2-3 times more efficient than the naive way to use Hadoop for the same task.

The only problem is that, as shown by an upcoming VLDB paper that we're releasing today, this technique is an astonishing factor of 1340 times less efficient than an alternative technique for processing sub-graph pattern matching queries within a Hadoop-based system that we introduce in our paper. Our paper, led by my student, Jiewen Huang, achieves these enormous speedups in the following ways:

  1. Hadoop, by default, hash partitions data across nodes. In practice (e.g., in the SHARD paper) this results in data for each vertex in the graph being randomly distributed across the cluster (dependent on the result of a hash function applied to the vertex identifier). Therefore, data that is close to each other in the graph can end up very far away from each other in the cluster, spread out across many different physical machines. For graph operations such as sub-graph pattern matching, this is wildly suboptimal. For these types of operations, the graph is traversed by passing through neighbors of vertexes; it is hugely beneficial if these neighbors are stored physically near each other (ideally on the same physical machine). When using hash partitioning, since there is no connection between graph locality and physical locality, a large amount of network traffic is required for each hop in the query pattern being matched (on the order of one MapReduce job per graph hop), which results in severe inefficiency. Using a clustering algorithm to graph partition data across nodes in the Hadoop cluster (instead of using hash partitioning) is a big win.

  2. Hadoop, by default, has a very simple replication algorithm, where all data is generally replicated a fixed number of times (e.g. 3 times) across the cluster. Treating all data equally when it comes to replication is quite inefficient. If data is graph partitioned across a cluster, the data that is on the border of any particular partition is far more important to replicate than the data that is internal to a partition and already has all of its neighbors stored locally. This is because vertexes that are on the border of a partition might have several of their neighbors stored on different physical machines. For the same reasons why it is a good idea to graph partition data to keep graph neighbors local, it is a good idea to replicate data on the edges of partitions so that vertexes are stored on the same physical machine as their neighbors. Hence, allowing different data to be replicated at different factors can further improve system efficiency.

  3. Hadoop, by default, stores data on a distributed file system (HDFS) or a sparse NoSQL store (HBase). Neither of these data stores are optimized for graph data. HDFS is optimized for unstructured data, and HBase for semi-structured data. But there has been significant research in the database community on creating optimized data stores for graph-structured data. Using a suboptimal store for the graph data is another source of tremendous inefficiency. By replacing the physical storage system with graph-optimized storage, but keeping the rest of the system intact (similar to the theme of the HadoopDB project), it is possible to greatly increase the efficiency of the system.

To a first degree of approximation, each of the above three improvements yield an entire order of magnitude speedup (a factor of 10). By combining them, we therefore saw the factor of 1340 improvement in performance on the identical benchmark that was run in the SHARD paper. (For more details on the system architecture, partitioning and data placement algorithms, query processing, and experimental results please see our paper).

It is important to note that since we wanted to run the same benchmark as the SHARD paper, we used the famous Lehigh University Benchmark (LUBM) for Semantic Web graph data and queries. Semantic Web sub-graph pattern matching queries tend to contain quite a lot of constants (especially on edge labels) relative to other types of graph queries. The next step for this project is to extend and benchmark the system on other graph applications (the types of graphs that people tend to use systems based on Google's Pregel project today).

In conclusion, it is perfectly acceptable to give up a little bit of efficiency for improved scalability when using Hadoop. However, once this decrease in efficiency starts to reach a factor of two, it is likely a good idea to think about what is causing this inefficiency, and attempt to find ways to avoid it (while keeping the same scalability properties). Certainly once the factor extends beyond the factor of two (such as the enormous 1340 factor we discovered in our VLDB paper), the sheer waste in power and hardware cannot be ignored. This does not mean that Hadoop should be thrown away; however it will become necessary to package Hadoop with "best practice" solutions to avoid such unnecessarily high levels of waste.

Thursday, May 19, 2011

Why Sam Madden is wrong about peer review

Yesterday my former PhD advisor, Sam Madden, wrote a blog post consisting of a passionate defense for the status quo in the peer review process (though he does say that the review quality needs to be improved). In an effort to draw attention to his blog (Sam is a super-smart guy, and you will get a lot out of reading his blog) I intend to start a flame war with him in this space.

At issue: The quality of reviews of research paper submissions in the database community is deteriorating rapidly. It is clear that something needs to be fixed. Jeff Naughton offered several suggestions for how to fix the problem in his ICDE keynote. A few days ago, I publicly supported his fifth suggestion (eliminating the review process altogether) on Twitter. Sam argued against this suggestion using five main points. Below I list each of Sam's points, and explain why everything he says is wrong:


Sam's point #1: Most of the submissions aren't very good. The review process does the community a favor in making sure that these bad papers do not get published.

My response: I think only a few papers are truly embarrassing, but who cares? Most of the videos uploaded to YouTube aren't very good. They don't in any way detract from the good videos that are uploaded. The cost of publishing a bad paper is basically zero if everybody knows that all papers will be accepted. The cost of rejecting a good paper, which then gets sent to a non-database venue and receives all kind of publicity there, yields tremendous opportunity cost to the database community. Sam Madden should know this very well since (perhaps) his most famous paper fits in that category. The model of "accept everything and let the good submissions carry you" has always proven to be a better model than "let's have a committee of busy people who basically have zero incentive to do a good job (beyond their own ethical standards) decide what to accept" when the marginal cost of accepting an additional submission is near zero. In the Internet age, the good submissions (even from unknown authors) get their appropriate publicity with surprising speed (see YouTube, Hacker News, Quora, etc.).

Sam's point #2: If every paper is accepted, then how do we decide which papers get the opportunity to be presented at the conference? It seems we need a review committee at least for that.

My response: First of all, there might be fewer submissions under the "accept everything model", since there will not be any resubmissions, and there is incentive for people to make sure that their paper is actually ready for publication before submitting it (because the onus of making sure their paper is not an embarrassment now falls on the authors and not on the PC --- assuming once something is published, you can't take it back). So it might be possible to just let everyone give a talk (if you increase the number of tracks). However, if that is not feasible, there are plenty of other options. For example, all papers are accepted immediately; over the course of one calendar year, it sits out there in the public and can be cited by other papers. The top sixty papers in terms of citations after one year get to present at the conference. This only extends the delay between submission and the actual conference by 4 months --- today there is usually an 8 month delay while papers are being reviewed, and camera-ready papers are being prepared.

Sam's point #3: Eliminating the review system will discourage people from working hard on their papers.

My response: I could not disagree more. Instead of having your paper reviewed by three people in private, every problem, every flaw in logic, every typo is immediately out there in the public for people to look at and comment on. As long as submissions cannot be withdrawn, the fear of long term embarrassment yields enough incentive for the authors to ensure that the paper is in good shape at the time of submission.

Sam's point #4: Having papers in top conferences is an important metric for evaluating researchers.

My Response: This is a horrible, horrible metric, and being able to finally eliminate it might be the best outcome of switching to an "accept everything" model. Everybody knows that it is much easier to get a paper accepted that goes into tremendous depth on an extremely narrow (and ultimately inconsequential) problem than to write a broad paper that solves a higher level (and important) problem, but has less depth. The "paper counting" metric incentivizes people to write inconsequential papers. Good riddance.

Sam's point #5: Having papers accepted provides a form of validation, a way to measure progress and success. There is also some kind of psychological benefit.

My response: People who measure themselves in this way are doomed for failure. If you have a paper accepted that nobody ever reads or cites over the long term, you have made zero impact. Just because you managed to get a paper through three poor reviewers is no cause for celebration. We should be celebrating impact, not publication. Furthermore, I strongly disagree with the psychological benefit argument. Getting a paper rejected does FAR more psychological damage than getting a paper accepted does good.


In conclusion, it's time to eliminate the private peer review process and open it up to the public. All papers should be accepted for publication, and people should be encouraged to review papers in public (on their blogs, on twitter, in class assignments that are published on the Web, etc). Let the free market bring the good papers to the top and let the bad papers languish in obscurity. This is the way the rest of the Internet works. It's time to bring the database community to the Internet age. Imagine how much more research could be done if we didn't have to waste so much time of the top researchers in the world with PC duties, and revising good papers because they were improperly rejected. Imagine how many good researchers we have lost because of the psychological trauma of working really hard on a good paper, only to see it rejected. The current system is antiquated and broken, and the solution is obvious and easy to implement. It's time for a change.

Monday, March 28, 2011

Why I'm doing a start-up pre-tenure

Thanks to the tireless work of the entire Hadapt team, we had a very successful launch at GigaOM's Structure Big Data conference last week. In coming out of stealth, we told the world what we're doing (in short, we're building the only Big Data analytical platform architected from scratch to be (1) optimized for cloud deployments and (2) closely integrated with Hadoop so you don't need those annoying connectors to non-Hadoop-based data management systems anymore; i.e. we're bringing high performance SQL to Hadoop). Although a lot of people knew I was involved in a start-up, several people were surprised to find out at the launch how centrally involved I am in Hadapt, and I have received a lot of questions along the lines of what Maryland professor Jimmy Lin (@lintool) tweeted last week:

.@daniel_abadi wondering how the tenure track thing fits in with @Hadapt (r u on leave?) - but congrats on coming out of the Ivory tower! :)

Although Jimmy did not question my sanity in his tweet, others have, so I think it is time for me to explain my (hopefully rational) decision-making process that lead me to start a company while still on the tenure-track at Yale.

A few facts to get out the way: although I am currently on teaching leave from Yale, I am not taking a complete leave of absence, which means my tenure clock is still ticking while I'm putting all this effort into Hadapt. The time I'm spending on Hadapt necessarily subtracts from the time I have available to spend on more traditional research activities of junior faculty (publishing papers, serving on program committees and editorial boards of publication venues, and attending conferences), which means that there is a huge risk that when I come up for tenure, if I am evaluated using traditional evaluation metrics, I will not have optimized my performance in these areas, and thereby will reduce the probability of receiving tenure. When I was considering starting Hadapt, I sent e-mails to several senior faculty members in my field and asked them if they could think of an example of a database systems professor doing a start-up while still a junior faculty member, and going on to eventually receive tenure (I desperately wanted a precedent that I could use to justify my decision). Not a single one of the people I e-mailed were able to think of such a case (in fact, one of them called the chair of my department to yell at him for even thinking of letting me start a company while still pre-tenure). Starting Hadapt is a gamble --- there's no doubt about it.

So why am I doing it? I want my research to make impact, which to me means that my research ideas should make it into real systems that are used by real people. Unfortunately for me, the research I enjoy the most is research that envisions complete system designs (rather than research on individual techniques that can be applied directly to today's systems). It's hard enough to publish these system design papers; but it's almost impossible to get other people to actually adopt your design in real-world deployments unless an extensive and complete prototype is available, or your design is already proven in real-world applications. For example, there have been many papers published by academics that fall in the same general space as the Google Bigtable paper. Yet the Bigtable paper has had a tremendous amount of impact, while the other papers languish in obscurity. Why? Because when Powerset and Zvents needed to implement a scalable real-time database, they felt safer using the design suggested in the Google paper (in their respective HBase and Hypertable projects) than the design from some other academic paper that has not been proven in the real world (even if the other design is more elegant and a better fit for the problem at hand).

Therefore, if you want to publish system design papers that make impact on the real world, you seemingly only have three choices:

(1) You can use the resources in your lab to build a complete prototype of your idea. That way, when people are considering using your idea, their risk is significantly reduced by trying out your system on their application without significant upfront development cost. Unfortunately, building a complete prototype is a much harder task than building enough of a prototype to get a paper published. It involves a ton of work to deal with all of the corner cases, and to make it work well out of the box --- this amount of work is far too much for a small handful of students to do (especially if they want to graduate before they retire). Therefore additional engineers must be hired to complete the prototype. In the DARPA glory days, this was possible --- I've heard stories of database projects burning over a million dollars per year to complete the engineering of an academic prototype. Unfortunately, those days are long gone. My attempts to get just one tiny programmer to build out the HadoopDB prototype were rebuffed by the National Science Foundation.

(2) You leave academia and work for Google, Yahoo, Facebook, IBM, etc. Matt Welsh has discussed in significant detail his decision to leave Harvard and do exactly that. This is a great solution in many ways --- it increases the probability of your research making impact by orders of magnitude, and has the added bonus of eliminating a lot of the administrative time sinks inherent in academic jobs. If I didn't love other aspects of being part of an academic community so much, this is certainly what I would do.

(3) You do a start-up. This is basically the same as choice (1), except you raise the money to build out the prototype from angel investors and venture capitalists instead of from the government (which typically funds the academic lab). The main downside is that starting a company is highly non-trivial, and you end up having to spend a lot of time in all kinds of non-technical tasks --- meeting with investors, meeting with potential customers, interviewing potential employees, investing the time to understand the market, coming up with a go-to-market strategy, attending board meetings, dealing with patents, participating in boring trade-shows, etc., etc., etc. It adds up to an extraordinary amount of time. It's also more competitive than academia --- there are far more people who want to see you fail in the start-up world than in academia, and some of these people go to great lengths to increase the probability of your failure. There are all kinds of hurdles that come up, and you need to have a strong will to overcome them. If it wasn't for the most determined person I have ever met, Justin Borgman, the CEO of Hadapt, we would never have made it to where we are today. It's hard to start a company, but in my mind, it was the only viable option if I wanted my three years of research on HadoopDB to make impact (Hadapt is a commercialization of the HadoopDB research project).

If it wasn't for the fact that I spent the majority of the last decade soaking up the wisdom of Mike Stonebraker, I might not have chosen option (3). But I watched as my PhD thesis on C-Store was commercialized by Vertica (which was sold last month to HP), and another one of my research projects (H-Store) was commercialized by VoltDB. Thanks to Stonebraker and the first-class engineers at Vertica, I can claim that my PhD research is in use today by Groupon, Verizon, Twitter, Zynga, and hundreds of other businesses. When I come up for tenure, I want to be able to make similar claims about my research at Yale on HadoopDB. So I'm taking the biggest gamble of my career to see that happen. I just hope that the people writing letters for me at tenure time take my contributions to Hadapt into consideration when they are evaluating the impact I have made on the database systems field. I know that this will require a departure from the traditional way junior faculty are evaluated, but it's time to increase the value we place on building real, usable systems. Otherwise, there'll be no place left in academia for systems researchers.


[Note: Hadapt has successfully raised a round of financing and is hiring. If you have experience building real systems, especially database systems --- or even if you have built reasonably complex academic prototypes --- please send an e-mail to hackers@hadapt.com. I personally read every e-mail that goes to that address.]

Tuesday, February 15, 2011

Why I no longer trust EMC [Update: maybe they are not so bad]

[Update: After publishing this blog post I received a very pleasant phone call from two representatives from Mozy informing me they had managed to recover my data. See the end of this blog post for more details.]

It's possible to argue that my entire research agenda over the past few years has focused on cloud computing. HadoopDB can be thought of as a large scale analytical database system for the cloud. My work on database determinism that focuses on building horizontally scalable database systems is entirely motivated by the elastic scalability of the cloud. In order for this research to make impact, "the cloud" needs to be more than a temporary phenomenon. Therefore, I feel quite invested in the success or failure of the cloud.

One common argument people make against the cloud (amongst others) is that if you put your data in the cloud, you are losing control over your data. If the cloud provider does not have appropriate processes in place to safeguard data, it's quite possible that your data could get corrupted or lost. This is problematic since most users do not get to see the internal processes, so they need to (to some extent) blindly trust the cloud provider --- a tricky proposition for many people. The way I usually answer this criticism is that a competitive business climate will solve this problem --- the companies that have bad processes will lose data and go out of business, and the ones that have more safeguards in place will win.

However, the above argument only works if cases of data loss get publicized so that the companies that lose data will lose business. Since I recently went through the horrible experience of losing data I put in the cloud, I therefore feel obligated to share this experience on this blog.

About a year ago I felt that if I was going to go around talking about how great the cloud was, I should at least be using a cloud data backup service for my PC. I ended up deciding between Mozy and DropBox, and went with Mozy because it was owned by EMC. I figured that EMC was a trustworthy company, and they understand storage and the cloud better than most. I figured I would start out with the free version, and then would upgrade to the paid version when I ran out of space.

Around 2 months ago, the hard drive on my Sony Vaio laptop failed. Since the laptop was owned by Yale, I had to go through the Yale processes to get it replaced. It turned out to be a nightmare because Yale did not buy the laptop directly from Sony, but went through an intermediary organization. Although the laptop was under warrantee, neither Sony nor the intermediary organization was willing to take responsibility for following through on the warrantee. This caused significant delays in getting the hard drive replaced, especially during the holiday season at the end of the semester.

After around two months, my laptop was finally returned with a new hard drive. I was excited that I had an opportunity to take advantage of my EMC Mozy backup for the first time --- theoretically they should have been able to recover all my files and put them in the same places where they existed on my laptop before it failed.

When I went to log in, Mozy claimed that I had the wrong password. I tried again. And again. Mozy would not let me in. Finally, I gave up and clicked on "Forgot my password" and Mozy claimed to reset it and send it to me. But I never received an e-mail. So I tried again. And again. Still no e-mails. I e-mailed support. Four days later, I still had not received a response. I e-mailed again --- another few days and still no response. At this point I was getting desperate --- it had been a week and I had no way of logging in to retrieve my files. Since I wasn't a "MozyPro" customer, all my attempts to call up support were rebuffed. I tried calling up the Mozy sales number to see if they could help me, but they were unable to. I tried the online chat, and they were unable to as well, but suggested that I email "forgot@mozy.com" to try to get my password reset manually that way.

This last suggestion worked, and I was finally able to log in. But to my horror, all my files were gone! It's hard to describe the despair as one starts to realize that one put the trust in the wrong place and all files created in the last year might actually be gone. I e-mailed support and this time received a much faster response:

" Mozy may terminate your account and these Terms immediately and without notice if your computer fails to access the Services to perform a backup for more than thirty (30) days or you fail to comply with these Terms."

So, because it took so long to get my computer replaced (the whole reason why I was using Mozy in the first place), Mozy decided to delete my account (without telling me). I e-mailed back support and asked if there was any way to recover the files even though the account was deleted. They wrote back:

"I wish there was something I could do for you. I have even checked with an L2 tech to try and get the files back and he said that he was not able to recover them."

So I trusted EMC Mozy to backup my files, and they decided to delete them. And they do not have the processes in place to recover them. This is not how the cloud is supposed to work. Clearly EMC does not understand the cloud. I hope that anybody reading this blog does not make the same mistake: EMC 's cloud services are not trustworthy. If you have similar stories, please share them with me --- cloud providers need to feel pressure not to arbitrarily delete data without first warning their customers. Otherwise the cloud cannot work.

[Update: It turns out that EMC Mozy does have important safeguards in place. After coming across this blog post, several members of the Mozy technical team met with each other to try to understand what happened, managed to recover my data, and called me afterward. Here's the scoop: the Mozy software is designed to notify you before your account is deleted. The problem was that my computer with the Mozy software installed had failed, so the software couldn't notify me. Mozy does indeed wait six months before deleting an account, but for me, due to a weird corner case involving a second computer that had previously been backing up to Mozy that I had stopped using, the six month clock had started ticking in July. Thus, the timing of my second computer failing was really unlucky. However, since they did have safeguards in place, they did manage to recover this deleted data. I am obviously really grateful that they went to this great effort. They told me in the phone that they learned from this experience and are making improvements as a result --- most notably to do more than rely on the software to notify a user before the account is deleted. Given how helpful and straightforward the Mozy employees were over the course of this phone call, I wholeheartedly believe that they really are going to fix this issue. Hence, I have no qualms recommending Mozy to other people moving forward. Again, the most important thing was that there were safeguards in place --- obviously it took some additional motivation for these safeguards to be used, but as long as they exist, I feel comfortable using cloud storage moving forward.]