Tuesday, July 19, 2011

Hadoop's tremendous inefficiency on graph data management (and how to avoid it)

Hadoop is great. It seems clear that it will serve as the basis of the vast majority of analytical data management within five years. Already today it is extremely popular for unstructured and polystructured data analysis and processing, since it is hard to find other options that are superior from a price/performance perspective. The reader should not take the following as me blasting Hadoop. I believe that Hadoop (with its ecosystem) is going to take over the world.

The problem with Hadoop is that its strength is also its weakness. Hadoop gives the user tremendous flexibility and power to scale all kinds of different data management problems. This is obviously great. But it is this same flexibility that allows the user to perform incredibly inefficient things and not care because (a) they can simply add more machines and use Hadoop's scalability to hide inefficiency in user code (b) they can convince themselves that since everyone talks about Hadoop as being designed for "batch data processing" anyways, they can let their process run in the background and not care about how long it will take for it to return.

Although not the subject of this post, an example of this inefficiency can be found in a SIGMOD paper that a bunch of us from Yale and the University of Wisconsin published 5 weeks ago. The paper shows that using Hadoop on structured (relational) data is at least a factor of 50 less efficient than it needs to be (an incredibly large number given how hard data center administrators work to yield less than a factor of two improvement in efficiency). As many readers of this blog already know, this factor of 50 improvement is the reason why Hadapt was founded. But this post is not about Hadapt or relational data. In this post, the focus is on graph data, and how if one is not careful, using Hadoop can be well over a factor of 1000 less efficient than it needs to be.

Before we get into how to improve Hadoop's efficiency on graph data by a factor of 1000, let's pause for a second to comprehend how dangerous it is to let inefficiencies in Hadoop become widespread. Imagine a world where the vast majority of data processing runs on Hadoop (a not entirely implausible scenario). If people allow these factors of 50 or 1000 to exist in their Hadoop utilization, these inefficiency factors translate directly to factors of 50 or 1000 more power utilization, more carbon emissions, more data center space, and more silicon waste. The disastrous environmental consequences in a world where everyone standardizes on incredibly inefficient technology is downright terrifying. And this is ignoring the impact on businesses in terms of server and energy costs, and lower performance. It seems clear that developing a series of "best practices" around using Hadoop efficiently is going to be extremely important moving forward.

Let's delve into the subject of graph data in more detail. Recently there was a paper by Rohloff et. al. that showed how to store graph data (represented in vertex-edge-vertex "triple" format) in Hadoop, and perform sub-graph pattern matching in a scalable fashion over this graph of data. The particular focus of the paper is on Semantic Web graphs (where the data is stored in RDF and the queries are performed in SPARQL), but the techniques presented in the paper are generalizable to other types of graphs. This paper and resulting system (called SHARD) has received significant publicity, including a presentation at HadoopWorld 2010, a presentation at DIDC 2011, and a feature on Cloudera's Website. In fact, it is a very nice technique. It leverages Hadoop to scale sub-graph pattern matching (something that has historically be difficult to do); and by aggregating all outgoing edges for a given vertex into the same key-value pair in Hadoop, it even scales queries in a way that is 2-3 times more efficient than the naive way to use Hadoop for the same task.

The only problem is that, as shown by an upcoming VLDB paper that we're releasing today, this technique is an astonishing factor of 1340 times less efficient than an alternative technique for processing sub-graph pattern matching queries within a Hadoop-based system that we introduce in our paper. Our paper, led by my student, Jiewen Huang, achieves these enormous speedups in the following ways:

  1. Hadoop, by default, hash partitions data across nodes. In practice (e.g., in the SHARD paper) this results in data for each vertex in the graph being randomly distributed across the cluster (dependent on the result of a hash function applied to the vertex identifier). Therefore, data that is close to each other in the graph can end up very far away from each other in the cluster, spread out across many different physical machines. For graph operations such as sub-graph pattern matching, this is wildly suboptimal. For these types of operations, the graph is traversed by passing through neighbors of vertexes; it is hugely beneficial if these neighbors are stored physically near each other (ideally on the same physical machine). When using hash partitioning, since there is no connection between graph locality and physical locality, a large amount of network traffic is required for each hop in the query pattern being matched (on the order of one MapReduce job per graph hop), which results in severe inefficiency. Using a clustering algorithm to graph partition data across nodes in the Hadoop cluster (instead of using hash partitioning) is a big win.

  2. Hadoop, by default, has a very simple replication algorithm, where all data is generally replicated a fixed number of times (e.g. 3 times) across the cluster. Treating all data equally when it comes to replication is quite inefficient. If data is graph partitioned across a cluster, the data that is on the border of any particular partition is far more important to replicate than the data that is internal to a partition and already has all of its neighbors stored locally. This is because vertexes that are on the border of a partition might have several of their neighbors stored on different physical machines. For the same reasons why it is a good idea to graph partition data to keep graph neighbors local, it is a good idea to replicate data on the edges of partitions so that vertexes are stored on the same physical machine as their neighbors. Hence, allowing different data to be replicated at different factors can further improve system efficiency.

  3. Hadoop, by default, stores data on a distributed file system (HDFS) or a sparse NoSQL store (HBase). Neither of these data stores are optimized for graph data. HDFS is optimized for unstructured data, and HBase for semi-structured data. But there has been significant research in the database community on creating optimized data stores for graph-structured data. Using a suboptimal store for the graph data is another source of tremendous inefficiency. By replacing the physical storage system with graph-optimized storage, but keeping the rest of the system intact (similar to the theme of the HadoopDB project), it is possible to greatly increase the efficiency of the system.

To a first degree of approximation, each of the above three improvements yield an entire order of magnitude speedup (a factor of 10). By combining them, we therefore saw the factor of 1340 improvement in performance on the identical benchmark that was run in the SHARD paper. (For more details on the system architecture, partitioning and data placement algorithms, query processing, and experimental results please see our paper).

It is important to note that since we wanted to run the same benchmark as the SHARD paper, we used the famous Lehigh University Benchmark (LUBM) for Semantic Web graph data and queries. Semantic Web sub-graph pattern matching queries tend to contain quite a lot of constants (especially on edge labels) relative to other types of graph queries. The next step for this project is to extend and benchmark the system on other graph applications (the types of graphs that people tend to use systems based on Google's Pregel project today).

In conclusion, it is perfectly acceptable to give up a little bit of efficiency for improved scalability when using Hadoop. However, once this decrease in efficiency starts to reach a factor of two, it is likely a good idea to think about what is causing this inefficiency, and attempt to find ways to avoid it (while keeping the same scalability properties). Certainly once the factor extends beyond the factor of two (such as the enormous 1340 factor we discovered in our VLDB paper), the sheer waste in power and hardware cannot be ignored. This does not mean that Hadoop should be thrown away; however it will become necessary to package Hadoop with "best practice" solutions to avoid such unnecessarily high levels of waste.