Thursday, December 30, 2010

Machine vs. human generated data

Curt Monash has recently been discussing the differences between machine-generated data and human-generated data, and trying to define these terms on his blog. I think this is a good subject to dive into, since I frequently use the existence of machine-generated data to justify to myself why 90% of my research cycles are spent on scalability problems in database systems. Rather than try to fit a response as a comment on his post, I thought I would devote a post to this subject here.

In short, the following are the main reasons why machine-generated data is important:

  1. Machines are capable of producing data at very high rates. In the time it took you to read this sentence, my three-year old laptop could have produced the entire works of Shakespeare.

  2. The human population is not growing anywhere near as fast as Moore’s law. In the last decade, the world’s population has increased by about 20%. Meanwhile transistor counts (and also hard-disk capacity since it increases by roughly the same rate) has increased by over 2000%, If all data was closely tied to human actions, then the “Big Data” research area would be a dying field, as technological advancements would eventually render today’s “Big Data” miniscule, and there would be no new “Big Data” to take its place. (All this assumes that women don’t start to routinely give birth to 15 children, and nobody figures out how to perform human cloning in a scalable fashion). No researcher dreams of writing papers that makes only a temporary impact. With machine-generated data, we have the potential for data generation to increase at the same rate as machines are getting faster, which means that “Big Data” today will still be “Big Data” tomorrow (even though the definition of “Big” will be adjusted).

  3. The predicted demise of the magnetic hard disk for solid state alternatives will not come as fast as some people think. As long as hard disk capacity maintains pace with the rate of machine-generated data generation, it will remain the most cost-efficient option for machine-generated “Big Data” (at least until race-track memory becomes a viable candidate). Yes, I/O bandwidth does not increase at the same rate as capacity, but if the machine-generated data is to be kept around, the biggest of “Big Data” databases will need the high capacity of hard disks, at least at a low tier of storage. Which means that we must remain conscious of disk-speed limitations when it comes to complete data scans.

Curt attempts to define “machine-generated data” in his post as the following:

Machine-generated is data that was produced entirely by machines OR data that is more about observing humans than recording their choices.

He then goes on to include Web log data (including user clickstream logs), and social media and gaming records data as examples of machine-generated data.

If you agree with the three reasons listed above on why machine-generated data is important, then there is a problem with both the above definition of machine-generated data and the examples. Clickstream data and social media/gaming data are fundamentally different from environmental sensor data that has no human involvement whatsoever. Certainly the scale of clickstream and gaming datasets is much larger than the scale of other human-generated datasets such as point of sale data (humans can make clicks on the Internet or in a computer game at a much faster rate than they can buy things, or write things down). And certainly, for every human click, there might be 5X more network log data (as Monash writes about in his post). But ultimately, without humans making clicks, there would be no data, and as long as the additional machine-generated data is linearly related to each human action (e.g. this 5X number remains relatively constant over time) then these datasets are not always going to be “Big Data”, for the reasons described in point (2) above.

The basic source of confusion here is that click-stream datasets and social gaming data sets are some of the biggest datasets known to exist (eBay, Facebook, and Yahoo’s multi-petabyte clickstream data warehouses are known to be amongst the largest data warehouses in the world). Since machines are well-known to have the ability to produce data at a faster rate than humans, it is easy to fall into the trap of thinking that these huge datasets are machine generated.

However, these datasets are not increasing at the same rate that machines are getting faster. It might seem that way since the companies that broadcast the size of their datasets are getting larger and gaining users a rapid pace, and these companies are deciding to throw away less data, but over the long term the rate of increase of these datasets must slow down due to the human limitation. This makes them less interesting for the future of “Big Data” research.

I don’t necessarily have a better way to define machine-generated data, but I’ll end this blog post with my best attempt:

Machine-generated data is data that is generated as a result of a decision of an independent computational agent or a measurement of an event that is not caused by a human action.

Machine generated “Big Data” is machine-generated data whose rate of generation increases with the speed of the underlying hardware of the machines that generate it.

Under this definition, stock trade data (independent computation agents), environmental sensor data, RFID data, and satellite data all fall under the category of machine-generated data. An interesting debate could form over whether genomic sequencing data is machine-generated or not. To the extent that DNA and mRNA are being produced outside of humans, I think it is fair to put genomic sequencing data under the machine-generated category as well.

Tuesday, August 31, 2010

The problems with ACID, and how to fix them without going NoSQL

(This post is coauthored by Alexander Thomson and Daniel Abadi)

It is a poorly kept secret that NoSQL is not really about eliminating SQL from database systems (e.g., see these links). Rather, systems such as Bigtable, HBase, Hypertable, Cassandra, Dynamo, SimpleDB (and a host of other key-value stores), PNUTS/Sherpa, etc. are mostly concerned with system scalability. It turns out to be quite difficult to scale traditional, ACID-compliant relational database systems on cheap, shared-nothing scale-out architectures, and thus these systems drop some of the ACID guarantees in order to achieve shared-nothing scalability (letting the application developer handle the increased complexity that programming over a non-ACID compliant system entails). In other words, NoSQL really means NoACID.

Our objective in this post is to explain why ACID is hard to scale. At the same time, we argue that NoSQL/NoACID is the lazy way around these difficulties---it would be better if the particular problems that make ACID hard to scale could be overcome. This is obviously a hard problem, but we have a few new ideas about where to begin.

ACID, scalability and replication

For large transactional applications, it is well known that scaling out on commodity hardware is far cheaper than scaling up on high-end servers. Most of the largest transactional applications therefore use a shared-nothing architecture where data is divided across many machines and each transaction is executed at the appropriate one(s).

The problem is that if a transaction accesses data that is split across multiple physical machines, guaranteeing the traditional ACID properties becomes increasingly complex: ACID's atomicity guarantee requires a distributed commit protocol (such as two-phase commit) across the multiple machines involved in the transaction, and its isolation guarantee insists that the transaction hold all of its locks for the full duration of that protocol. Since many of today's OLTP workloads are composed of fairly lightweight transactions (each involving less than 10 microseconds of actual work), tacking a couple of network round trips onto every distributed transaction can easily mean that locks are held for orders of magnitude longer than the time each transaction really spends updating its locked data items. This can result in skyrocketing lock contention between transactions, which can severely limit transactional throughput.

In addition, high availability is becoming ever more crucial in scalable transactional database systems, and is typically accomplished via replication and automatic fail-over in the case of a crash. The developer community has therefore come to expect ACID's consistency guarantee (originally promising local adherence to user-specified invariants) to also imply strong consistency between replicas (i.e. replicas are identical copies of one other, as in the CAP/PACELC sense of the word consistency).

Unfortunately, strongly consistent replication schemes either come with high overhead or incur undesirable tradeoffs. Early approaches to strongly consistent replication attempted to synchronize replicas during transaction execution. Replicas executed transactions in parallel, but implemented some protocol to ensure agreement about any change in database state before committing any transaction. Because of the latency involved in such protocols (and due to the same contention issue discussed above in relation to scalability), synchronized active replication is seldom used in practice today.

Today's solution is usually post-write replication, where each transaction is executed first at some primary replica, and updates are propagated to other replicas after the fact. Basic master-slave/log-shipping replication is the simplest example of post-write replication, although other schemes which first execute each transaction at one of multiple possible masters fall under this category. In addition to the possibility of stale reads at slave replicas, these systems suffer a fundamental latency-durability-consistency tradeoff: either a primary replica waits to commit each transaction until receiving acknowledgement of sufficient replication, or it commits upon completing the transaction. In the latter case, either in-flight transactions are lost upon failure of the primary replica, threatening durability, or they are retrieved only after the failed node has recovered, while transactions executed on other replicas in the meantime threaten consistency in the event of a failure.

In summary, it is really hard to guarantee ACID across scalable, highly available, shared-nothing systems due to complex and high overhead commit protocols, and difficult tradeoffs in available replication schemes.

The NoACID solution

Designers of NoSQL systems, aware of these issues, carefully relax some ACID guarantees in order to achieve scalability and high availability. There are two ways that ACID is typically weakened. First, systems like Bigtable, SQL Azure, sharded MySQL, and key-value stores support atomicity and isolation only when each transaction only accesses data within some convenient subset of the database (a single tuple in Bigtable and KV stores, or a single database partition in SQL Azure and sharded MySQL). This eliminates the need for expensive distributed commit protocols, but at a cost: Any logical transaction which spans more than one of these subsets must be broken up at the application level into separate transactions; the system therefore guarantees neither atomicity nor isolation with respect to arbitrary logical transactions. In the end, the programmer must therefore implement any additional ACID functionality at the application level.

Second, lazy replication schemes such as eventual consistency sacrifice strong consistency to get around the tradeoffs of post-write replication (while also allowing for high availability in the presence of network partitions, as specified in the CAP theorem). Except with regard to some well-known and much-publicized Web 2.0 applications, losing consistency at all times (regardless of whether a network partition is actually occurring) is too steep a price to pay in terms of complexity for the application developer or experience for the end-user.

Fixing ACID without going NoSQL

In our opinion, the NoSQL decision to give up on ACID is the lazy solution to these scalability and replication issues. Responsibility for atomicity, consistency and isolation is simply being pushed onto the developer. What is really needed is a way for ACID systems to scale on shared-nothing architectures, and that is what we address in the research paper that we will present at VLDB this month.

Our view (and yes, this may seem counterintuitive at first), is that the problem with ACID is not that its guarantees are too strong (and that therefore scaling these guarantees in a shared-nothing cluster of machines is too hard), but rather that its guarantees are too weak, and that this weakness is hindering scalability.

The root of these problems lies in the isolation property within ACID. In particular, the serializability property (which is the standard isolation level for fully ACID systems) guarantees that execution of a set of transactions occurs in a manner equivalent to some sequential, non-concurrent execution of those transactions, even if what actually happens under the hood is highly threaded and parallelized. So if three transactions (let's call them A, B and C) are active at the same time on an ACID system, it will guarantee that the resulting database state will be the same as if it had run them one-by-one. No promises are made, however, about which particular order execution it will be equivalent to: A-B-C, B-A-C, A-C-B, etc.

This obviously causes problems for replication. If a set of (potentially non-commutative) transactions is sent to two replicas of the same system, the two replicas might each execute the transactions in a manner equivalent to a different serial order, allowing the replicas' states to diverge.

More generally, most of the intra- and inter-replica information exchange that forms the basis of the scalability and replication woes of ACID systems described above occurs when disparate nodes in the system have to forge agreement about (a) which transactions should be executed, (b) which will end up being committed, and (c) with equivalence to which serial order.

If the isolation property were to be strengthened to guarantee equivalence to a predetermined serial order (while still allowing high levels of concurrency), and if a layer were added to the system which accepts transaction requests, decides on a universal order, and sends the ordered requests to all replicas, then problems (a) and (c) are eliminated. If the system is also stripped of the right to arbitrarily abort transactions (system aborts typically occur for reasons such as node failure and deadlock), then problem (b) is also eliminated.

This kind of strengthening of isolation introduces new challenges (such as deadlock avoidance, dealing with failures without aborting transactions, and allowing highly concurrent execution without any on-the-fly transaction reordering), but also results in a very interesting property: given an initial database state and a sequence of transaction requests, there exists only one valid final state. In other words, determinism.

The repercussions of a deterministic system are broad, but one advantage is immediately clear: active replication is trivial, strongly consistent, and suffers none of the drawbacks described above. There are some less obvious advantages too. For example, the need for distributed commit protocols in multi-node transactions is eliminated, which is a critical step towards scalability. (Why distributed commit protocols can be omitted in distributed systems is non-obvious, and will be discussed in a future blog post; the topic is also addressed at length in our paper.)

A deterministic DBMS prototype

In our paper, entitled “The Case for Determinism in Database Systems”, we propose an architecture and execution model that avoids deadlock, copes with failures without aborting transactions, and achieves high concurrency. The paper contains full details, but the basic idea is to use ordered locking coupled with optimistic lock location prediction, while exploiting deterministic systems' nice replication properties in the case of failures.

We go on in the paper to present measurements and analyses of the performance characteristics of a fully ACID deterministic database prototype based on our execution model, which we implemented alongside a standard (nondeterministic) two-phase locking system for comparison. It turns out that the deterministic scheme performs horribly in disk-based environments, but that as transactions get shorter and less variable in length (thanks to the introduction of flash and the ever-plummeting cost of memory) our scheme becomes more viable. Running the prototype on modern hardware, deterministic execution keeps up with the traditional system implementation on the TPC-C benchmark, and actually shows drastically more throughput and scalability than the nondeterministic system when the frequency of multi-partition transactions increases.

Our prototype system is currently being reworked and extended to include several optimizations which appear to be unique to explicitly deterministic systems (see the Future Work section in our paper's appendix for details), and we look forward to releasing a stable codebase to the community in the coming months, in hopes that it will spur further dialogue and research on deterministic systems and on the scalability of ACID systems in general.

Monday, August 2, 2010

Thoughts on Kickfire’s apparent demise

There have been some recent conflicting reports on the future prospects of Kickfire’s analytical database technology. Forbes reported a couple of months ago that Kickfire sold $5 million worth of boxes in their first year of existence (they launched their product in April 2009), and was extremely positive about Kickfire’s outlook. Then, a couple of months later, Curt Monash reported that Kickfire was discontinuing their product, and selling their IP and engineers. Obviously, Monash is the more reliable source here (and I independently heard a rumor from a reputable source that Teradata was acquiring Kickfire at a “firesale” price).

In my interactions with the company, I have been impressed with Raj Cherabuddi and members of the Kickfire technical team, and it is always sad when good technology fails to gain any traction in the marketplace. I’m also sad (though obviously to a lesser extent) to see the many thousands of words I have written about Kickfire in this blog (including an in-depth post on their technology) become largely obsolete. In fact, one of the first posts I wrote for this blog was on the subject of Kickfire --- a mostly positive post --- but questioning their go-to-market strategy. In that post, I took issue with the assumption that there is a “mass market” for data warehousing in the MySQL ecosystem, especially for a proprietary approach like Kickfire's. The CEO of Kickfire kindly took the time to respond to this original post, quoting a bunch of IDC numbers about the size of the MySQL data warehouse market. I chose to respond to this comment in a separate post, in which I said (amongst other things):

"The point of my post was that I think the [MySQL data warehouse] market is smaller than these [IDC] numbers indicate. Sure, there are a lot of MySQL deployments, but that's because it's free. The number of people actually paying for the MySQL Enterprise Edition is far less, but those are probably the people who'd be willing to pay for a solution like Kickfire's. Furthermore, […] a lot of people who use MySQL for warehousing are using sharded MySQL, which is nontrivial (or at least not cheap) to port to non-shared-nothing solutions like Kickfire and Infobright. Finally, the amount of data that corporations are keeping around is increasing rapidly, and the size of data warehouses are doubling faster than Moore's law. So even if most warehouses today are pretty small, this might not be the case in the future. I'm a strong believer that MPP shared-nothing parallel solutions are the right answer for the mass market of tomorrow. Anyway, the bottom line is that I'm openly wondering if the market is actually much smaller than the IDC numbers would seem to suggest. But obviously, if Kickfire, Infobright, or Calpont achieves a large amount of success without changing their market strategy, I'll be proven incorrect."


I think the above paragraph lists two of the three most probable reasons why Kickfire seems to have failed:

(1) Building a propriety database stack of hardware and software around a MySQL codebase that attributes much of its success to being open and free is a poor cultural match.

(2) Trying to make it in the "Big Data Era" without a scalable MPP product is a recipe for disaster. It is well known that over 95% of data warehouses are smaller than 5TB, and that MPP is not strictly necessary for less than 5TB, so it is easy to get into the trap of Kickfire’s thinking that the mass market is addressable without building a MPP product. However, businesses are looking forward, and seeing much more data in their future (whether this is wishful or realistic thinking is entirely irrelevant), and can often be reluctant to select a product with known scalability limits.

(The third alternative reason why Kickfire might have failed is the TPC-H benchmark orientation. It is really easy to spend a lot of time working on an analytical database to get it to run TPC-H --- even optimizing it for TPC-H --- before realizing that the marketing benefit that the product gets from stellar TPC-H numbers does not justify the time investment of getting it to run --- and in fact find out that many of the features that were added for TPC-H are not actually used by real-life customers.)

It is tempting to add a fourth reason for Kickfire’s demise --- the long list of failed hardware-accelerated DBMS companies and Kickfire’s obvious inclusion in this group. However, I believe that Netezza’s success is a demonstration of the potential of the benefits of hardware acceleration and the appliance approach in the modern era where the rate of performance improvements with each successive processor generation is slowing significantly.

Anyway, RIP Kickfire (assuming the rumors are correct). Good technology. Bad go-to-market strategy. Tough fit for the “Big Data” era.

Tuesday, July 6, 2010

Quick thoughts on EMC acquiring Greenplum

EMC announced today that they are acquiring Greenplum. Below are the first thoughts that crossed my mind when I heard about this deal.

  • Congratulations to the whole team at Greenplum. Every interaction I’ve had with a Greenplum employee has been very positive (especially Florian Waas, Luke Lonergan, and I guess Joe Hellerstein even though he’s just an advisor), and I’m really happy for all of them.
  • Aster Data’s launch a couple of years ago seems to have hurt Greenplum more than any other company. Aster Data and Greenplum have extremely similar products (a parallelization layer over PostgreSQL with MapReduce integration), though Greenplum has made more changes and innovations at the DBMS level than Aster Data has (most notably the column-storage option which I have written about in the past). However, Aster Data has reshaped their focus on deep, embedded analytics with extensive analytic libraries, which has met the market with more success than Greenplum’s focus on the enterprise data cloud (EDC) vision, since Greenplum’s product was not ready yet to compete with the likes of Teradata in the EDC space. Greenplum’s recent Chorus offering also seems to have also been a failure. Hence, I do not think the acquisition price was an extremely large number.
  • This deal seems like bad news for Greenplum’s direct competitor, ParAccel, which is extremely close to EMC and relies on EMC for their “enterprise-class” solution that includes high availability and disaster recovery. I believe EMC routinely helps win ParAccel some business.
  • People predicted that the DatAllegro acquisition by Microsoft would spur additional industry consolidation. That clearly did not happen, as two years passed and there were no non-trivial acquisitions in the data warehouse space. But then SAP squired Sybase (and Sybase IQ) and EMC acquired Greenplum, so I am sure people will be predicting that 2010 is the year for all the predicted consolidation. I have my doubts, since HP is the only major player that clearly needs to upgrade its “Big Data” offerings. But I wouldn’t be surprised if there was one more acquisition this year.
Other "Quick Thoughts" posts worth reading:

Friday, April 23, 2010

Problems with CAP, and Yahoo’s little known NoSQL system

Over the past few weeks, in my advanced database system implementation class I teach at Yale, I’ve been covering the CAP theorem, its implications, and various scalable NoSQL systems that would appear to be influenced in their design by the constraints of CAP. Over the course of my coverage of this topic, I am convinced that CAP falls far short of giving a complete picture of the engineering tradeoffs behind building scalable, distributed systems.

My problems with CAP

CAP is generally described as the following: when you build a distributed system, of three desirable properties you want in your system: consistency, availability, and tolerance of network partitions, you can only choose two.

Already there is a problem, since this implies that there are three types of distributed systems one can build: CA (consistent and available, but not tolerant of partitions), CP (consistent and tolerant of network partitions, but not available), and AP (available and tolerant of network partitions, but not consistent). The definition of CP looks a little strange --- “consistent and tolerant of network partitions, but not available” --- the way that this is written makes it look like such as system is never available --- a clearly useless system. Of course, this is not really the case; rather, availability is only sacrificed when there is a network partition. In practice, this means that the roles of the A and C in CAP are asymmetric. Systems that sacrifice consistency (AP systems) tend to do so all the time, not just when there is a network partition (the reason for this will become clear by the end of this post). The potential confusion caused by the asymmetry of A and C is my first problem.

My second problem is that, as far as I can tell, there is no practical difference between CA systems and CP systems. As noted above, CP systems give up availability only when there is a network partition. CA systems are “not tolerant of network partitions”. But what if there is a network partition? What does “not tolerant” mean? In practice, it means that they lose availability if there is a partition. Hence CP and CA are essentially identical. So in reality, there are only two types of systems: CP/CA and AP. I.e., if there is a partition, does the system give up availability or consistency? Having three letters in CAP and saying you can pick any two does nothing but confuse this point.

But my main problem with CAP is that it focuses everyone on a consistency/availability tradeoff, resulting in a perception that the reason why NoSQL systems give up consistency is to get availability. But this is far from the case. A good example of this is Yahoo’s little known NoSQL system called PNUTS (in the academic community) or Sherpa (to everyone else).

(Note, readers from the academic community might wonder why I’m calling PNUTS “little known”. It turns out, however, that outside the academic community, PNUTS/Sherpa is almost never mentioned in the NoSQL discussion --- in fact, as of April 2010, it’s not even categorized in the list of 35+ NoSQL systems at the nosql-database.org Website).

PNUTS and CAP

If you examine PNUTS through the lens of CAP, it would seem that the designers have no idea what they are doing (I assure you this is not the case). Rather than giving up just one of consistency or availability, the system gives up both! It relaxes consistency by only guaranteeing “timeline consistency” where replicas may not be consistent with each other but updates are guaranteed to be applied in the same order at all replicas. However, they also give up availability --- if the master replica for a particular data item is unreachable, that item becomes unavailable for updates (note, there are other configurations of the system with availability guarantees similar to Dynamo/Cassandra, I’m focusing in this post on the default system described in the original PNUTS paper). Why would anyone want to give up both consistency and availability? CAP says you only have to give up just one!

The reason is that CAP is missing a very important letter: L. PNUTS gives up consistency not for the goal of improving availability. Instead, it is to lower latency. Keeping replicas consistent over a wide area network requires at least one message to be sent over the WAN in the critical path to perform the write (some think that 2PC is necessary, but my student Alex Thomson has some research showing that this is not the case --- more on this in a future post). Unfortunately, a message over a WAN significantly increases the latency of a transaction (on the order of hundreds of milliseconds), a cost too large for many Web applications that businesses like Amazon and Yahoo need to implement. Consequently, in order to reduce latency, replication must be performed asynchronously. This reduces consistency (by definition). In Yahoo’s case, their method of reducing consistency (timeline consistency) enables an application developer to rely on some guarantees when reasoning about how this consistency is reduced. But consistency is nonetheless reduced.

Conclusion: Replace CAP with PACELC

In thinking about CAP the past few weeks, I feel that it has become overrated as a tool for explaining the design of modern scalable, distributed systems. Not only is the asymmetry of the contributions of C, A, and P confusing, but the lack of latency considerations in CAP significantly reduces its utility.

To me, CAP should really be PACELC --- if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?

Systems that tend to give up consistency for availability when there is a partition also tend to give up consistency for latency when there is no partition. This is the source of the asymmetry of the C and A in CAP. However, this confusion is not present in PACELC.

For example, Amazon’s Dynamo (and related systems like Cassandra and SimpleDB) are PA/EL in PACELC --- upon a partition, they give up consistency for availability; and under normal operation they give up consistency for lower latency. Giving up C in both parts of PACELC makes the design simpler --- once the application is configured to be able to handle inconsistencies, it makes sense to give up consistency for both availability and lower latency.

Fully ACID systems are PC/EC in PACELC. They refuse to give up consistency, and will pay the availability and latency costs to achieve it.

However, there are some interesting counterexamples where the C’s of PACELC are not correlated. One such example is PNUTS, which is PC/EL in PACELC. In normal operation they give up consistency for latency; however, upon a partition they don’t give up any additional consistency (rather they give up availability).

In conclusion, rewriting CAP as PACELC removes some confusing asymmetry in CAP, and, in my opinion, comes closer to explaining the design of NoSQL systems.


(A quick plug to conclude this post: the PNUTS guys are presenting a new benchmark for cloud data serving which compares PNUTS vs. other NoSQL systems at the first annual ACM Symposium on Cloud Computing 2010 (ACM SOCC 2010) in Indianapolis on June 10th and 11th. SOCC 2010 is held in conjunction with SIGMOD 2010 and the recently released program looks amazing.)

Monday, March 29, 2010

Distinguishing Two Major Types of Column-Stores

I have noticed that Bigtable, HBase, Hypertable, and Cassandra are being called column-stores with increasing frequency (e.g. here, here, and here), due to their ability to store and access column families separately. This makes them appear to be in the same category as column-stores such as Sybase IQ, C-Store, Vertica, VectorWise, MonetDB, ParAccel, and Infobright, which also are able to access columns separately. I believe that calling both groups of systems column-stores has lead to a great deal of confusion and misplaced expectations. This blog post attempts to clear up some of this confusion, highlighting the high level differences between these groups of systems. At the end, I will propose some potential ways to rename these groups to avoid confusion in the future.

For this blog post, I will refer to the following two groups as Group A and Group B:
  • Group A: Bigtable, HBase, Hypertable, and Cassandra. These four systems are not intended to be a complete list of systems in Group A --- these are simply the four systems I understand the best in this category and feel the most confident discussing.

  • Group B: Sybase IQ, C-Store, Vertica, VectorWise, MonetDB, ParAccel, and Infobright. Again, this is not a complete list, but these are the systems from this group I know best. (Row/column hybrid systems such as Oracle or Greenplum are ignored from this discussion to avoid confusion, but the column-store aspects of these systems are closer to Group B than Group A.)
Differences between Group A and Group B
  1. Data Model. Group A uses a multi-dimensional map (something along the lines of a sparse, distributed, persistent multi-dimensional sorted map). Typically a row-name, column-name, and timestamp are sufficient to uniquely map to a value in the database. Group B uses a traditional relational data model. This distinction has caused great confusion. People more familiar with Group A are very much aware that Group A does not use a relational data model and assume that since Group B are also called column-stores, then Group B also does not use a relational data model. This has resulted in many intelligent people saying “column-stores are not relational”, which is completely incorrect.

  2. Independence of Columns. Group A stores parts of a data entity or “row” in separate column-families, and has the ability to access these column-families separately. This means that not all parts of a row are picked up in a single I/O operation from storage, which is considered a good thing if only a subset of a row is relevant for a particular query. However, column-families may consist of many columns, and these columns within column-families are not independently accessible.

    Group B stores columns from a traditional relational database table separately so that they can be accessed independently. Like Group A, this is useful for queries that only access a subset of table attributes in any particular query. However, the main difference is that every column is stored separately, instead of families of columns as in Group A (this statement ignores fine-grained hybrid options within Group B).

  3. Interface. Group A is distinguished by being part of the NoSQL movement and does not typically have a traditional SQL interface. Group B supports standard SQL interfaces.

  4. Optimized workload. Group B is optimized for read-mostly analytical workloads. These systems support reasonably fast load times, but high update rates tend to be problematic. Hence, data warehouses are an ideal market for Group B, since they are typically bulk-loaded, require many complex read queries, and are updated rarely. In contrast, Group A can handle a more diverse set of application requirements (Cassandra, in particular, can handle a much higher rate of updates). Group B systems tend to struggle on workloads that “get” or “put” individual rows in the data set, but thrive on big aggregations and summarizations that require scanning many rows as part of a single query. In contrast, Group A generally does better for individual row queries, and does not perform well on aggregation-heavy workloads. Much of the reason for this difference can be explained in the “pure column” vs “column-family” difference between the systems. Group A systems can put attributes that tend to be co-accessed in the same column-family; this saves the seek cost that results from column-stores needing to find different attributes from the same row in many different places. Another reason for the difference is the storage layer implementation, explained below.

  5. Storage layer. Assume the following customer table



    Although there is some variation within the systems in Group B, to the first order of approximation, this group will store the table in the following way:

    (ID) 1, 2, 3, 4, 5, 6

    (First Name) Joe, Jack, Jill, James, Jamie, Justin

    (Last Name) Smith, Williams, Davis, Miller, Wilson, Taylor

    (Phone) 555-1234, 555-5668, 555-5432, NULL, 555-6527, 555-8247

    (Email) jsmith@gmail.com, jwilliams@gmail.com, NULL, jmiller@yahoo.com, NULL, jtaylor@aol.com

    Note that each value is stored by itself, without information about what row or column it came from. We can figure out what column it came from since all values from the same column are stored consecutively. We can figure out what row it came from by counting how many values came before it in the same column. The fourth value in the id column matches up to the same row as the fourth value in the last name column and the fourth value in the phone column, etc. Note that this means that columns that are undefined for a particular row must be explicitly stored as NULL in the column list; otherwise we can no longer match up values based on their position in their corresponding lists.

    Meanwhile, systems in Group A will either explicitly store the row-name, the column-name or both with each value. E.g.: row2, lastname: Williams; row5, phone: 555-6527, etc. The reason is that Group A uses a sparse data model (different rows can have a very different set of columns defined). Storing NULLs for every undefined column could soon lead to the majority of the database being filled with NULLs. Hence these systems will explicitly have column-name/value pairs for each element in a row within a column-family, or row-name/value pairs for each element within a single column column-family. (Group A will also typically store a timestamp per value, but explaining this will only complicate this discussion).

    This results in Group B typically taking much less space on storage than Group A (at least for structured data that easily fits into a relational model). Furthermore, by storing just column-values without column-names or row-names, Group B optimizes performance for column-operations where each element within a column is read and an operation (like a predicate evaluation or an aggregation) is applied. Hence, the data model combined with the storage layer implementation results in wildly different target applications for Group A and Group B.


Renaming the Groups

Clearly along each of these five dimensions, Group A and Group B are extremely different. Consequently, even though calling them both column-stores has some advantages (it makes it seem like the “column-store movement” is large and a really hot area), I believe that lumping Group A and Group B together does more damage than good, and that we need to make a greater effort to avoid confusing these two groups in the future. Here are some suggestions for names for Group A and Group B towards this goal:

  • Group A: “column-family store”
    Group B: “column-store”

    (The problem here is that Group B doesn’t get a new name, which means that “column-store” could either refer to Group B or both Group A/B)

  • Group A: “non-relational column-store”
  • Group B: “relational column-store”

  • Group A: “sparse column-store”
  • Group B: “dense column-store”

Of these, the relational/non-relational distinction is probably the most important, and would be my vote for the new names. If you have a different idea for names, or want to vote on one of the above schemes, please leave a comment below.


Addendum:

Mike Stonebraker e-mailed me his vote which I reproduce here with his permission:
Group A are really row stores. I.e. they store a column family in a row-by-row fashion. Effectively, they are using materialized views for column families and storing materialized views row-by-row. Systems in Group B have a sophisticated column-oriented optimizer -- no such thing exists for Group A.
He then went on to call Group A "MV row-stores" and sub-categorize Group B depending on how materialized views are stored, but his overarching point was that referring to anything in Group A as a "column-store" is really misleading.


Addendum 2:

This post seems to have attracted some high quality comments. I recommend reading the comment thread.

Wednesday, January 20, 2010

New England Database Summit 2010 Program

I just finished putting together the program for New England Database Summit, 2010 (thanks to the PC: Yanlei Diao, Olga Papaemmanouil, and Elke Rundensteiner). The schedule is really packed this year, with three keynotes/invited talks, eight technical session talks, and approximately thirty posters from students and researchers. The featured talks are from Raghu Ramakrishnan (Chief Scientist for Audience & Cloud Computing at Yahoo!), Curt Monash (President, Monash Research), and C. Mohan (IBM Fellow and IBM India Chief Scientist).

Registration for New England Database Summit is free (thanks Netezza!), but we need to know how much food, coffee, appetizers, and drinks to order (breakfast and lunch is included in the free admission, and there will be free beer and wine during the poster session), so registration will be closed after 5PM on Friday, January 22nd.

In the past we've had 200+ attendees, so it's a great opportunity to come network with researchers, academics, and database professionals from all over New England.

A summary of the current program is listed below. However, the NEDBSummit Website has more details (including talk abstracts).


Time Event
9:00 AM Welcoming remarks
9:10-10:10 Raghu Ramakrishnan (Chief
Scientist for Audience & Cloud Computing at Yahoo!) Cloud Data Serving

10:10-10:35 Coffee Break
Technical Session 1
10:35-10:55 Carlo Curino, Evan Jones, Yang
Zhang, Eugene Wu, Sam Madden RelationalCloud: The case for a database
service

10:55-11:15 Mike Dirolf An Introduction to MongoDB
11:15-11:35 Elke Rundensteiner, R. Nehme,
and E. Bertino The Query Mesh Project: A
Powerful Multi-Route Query Processing Paradigm

11:35-11:55 Andy Pavlo MapReduce and Parallel
DBMSs: Together At Last

11:55-12:15 Gregory Malecha, Greg
Morrisett, Avraham Shinnar, and Ryan Wisnesky Toward a Verified
Relational Database Management System

12:15 PM Lunch
1:10-2:10 Curt Monash (President, Monash Research). Database and analytic technology: The state of the union

Technical Session 2
2:10-2:30 Paul Brown SciDB: Massively
Parallel Array Data Storage, Processing and Analysis

2:30-2:50 Coffee Break
2:50-3:10 Julia Stoyanovich, William Mee,
Kenneth A. Ross Semantic Ranking and Result Visualization for Life
Sciences Publications

3:10-3:30 Mirek Riedewald, Alper Okcan,
Daniel Fink Scalable Search and Ranking for Scientific Data
3:30-4:30 C. Mohan (IBM Fellow and
IBM India Chief Scientist). Implications of Storage Class
Memories (SCMs) on Software Architectures

4:30 PM Poster Session and Appetizers / Drinks (Building 32, R&D Area, 4th Floor)
6:00 PM Adjourn

Thursday, January 7, 2010

Exadata's columnar compression

I recently came across a nice whitepaper from Oracle that describes the Exadata columnar compression scheme. I wrote up a brief overview of Oracle's columnar compression in the past (in my hybrid storage layout post), but I was pleased to see the whitepaper give some additional details. Before discussing the main content of the whitepaper, I want to correct a few technical inaccuracies in the article:

"Storing column data together, with the same data type and similar characteristics, drastically increases the storage savings achieved from compression. However, storing data in this manner can negatively impact database performance when application queries access more than one or two columns, perform even a modest number of updates, or insert small numbers of rows per transaction."

The first part of the above statement is usually true; columns tend to be self-similar since they contain data from the same attribute domain, and therefore laying out columns contiguously on storage typically yields lower entropy and higher compression ratios than storing rows contiguously. However, the second part of the statement is quite misleading and only true for the most na├»ve of column-oriented layout schemes. True, the benefit of storing data in columns decreases as a query accesses a higher percentage of columns from a table. However, the number of columns that need to be accessed such that storing data in columns impacts performance negatively relative to a storing data in rows is quite a bit larger than one or two columns. There’s a VLDB 2006 paper by Stavros Harizopoulos et. al. that runs a bunch of benchmarks to understand this tradeoff in more detail. There are a bunch of factors that affect whether column-storage or row-storage is best beyond simply the number of columns accessed (e.g. prefetch size, query selectivity, compression types), most notably the fact that column-stores struggle on needle-in-a-haystack queries since the multiple I/Os needed to get data from different columns are not amortized across the retrieval of multiple rows, but the bottom line is that the query space where pure column-oriented layout outperforms pure row-oriented layout is quite a bit larger than what the Oracle whitepaper claims. Regarding updates and inserts, the VLDB 2005 C-Store paper and OSDI 2006 Google Bigtable paper discuss how to alleviate this column-store disadvantage by temporarily storing the updates in a write-optimized store.

"Oracle’s Exadata Hybrid Columnar Compression technology is a new method for organizing data within a database block."

I feel that this is not giving enough credit to the VLDB 2001 paper by Ailamaki et. al. on PAX. Even though the PAX paper didn’t apply compression to this layout, and PAX didn’t include multiple blocks inside a single compression unit, the basic idea is the same.

One of the key benefits of the Exadata Hybrid Columnar approach is that it provides both the compression and performance benefits of columnar storage without sacrificing the robust feature set of the Oracle Database. For example, while optimized for scan-level access, because row data is self-contained within compression units, Oracle is still able to provide efficient row-level access, with entire rows typically being retrieved with a single I/O.

To people who have even a rudimentary knowledge of column-stores, the juxtaposed statements “entire rows being retrieved with a single I/O” and “provides the performance benefits of columnar storage” is clearly inaccurate. The most often cited advantage of column-stores (indeed cited even more often than the compression advantages) is that if you have a query that has to scan two out of twenty columns in a table (i.e. “tell me the sum of revenue per store”), a column-store only has to read in those two columns while a row-store has to scan the entire table (because multiple entire rows are picked up with each I/O). Hence, a column-store is very I/O efficient, an increasingly important bottleneck in database performance (albeit somewhat reduced with good compression algorithms). Again, Oracle’s statement is true for needle-in-a-haystack queries, but false for the more frequent scan-based queries one finds in data warehouse workloads (which is where Exadata is primarily marketed).


Despite the above overstatements, it is clear that Exadata’s Hybrid Columnar scheme is a big win (relative to vanilla Exadata) for most Exadata datasets and workloads, for the following reasons:

  1. Storage costs are lowered by having a smaller data footprint.
  2. I/O performance is improved by having to read less data from storage.
  3. Data is kept compressed as it is scanned in the storage layer. Selections and projections performed in the storage layer can be performed directly on the compressed data. (Those of you who are familiar with my academic work will not be surprised I like this, given that I wrote a paper on integrating compression and execution for column-oriented storage layouts in a SIGMOD 2006 paper). This yields all the storage costs savings and runtime I/O savings of compression without the performance disadvantage of runtime decompression.

    The whitepaper is vague when it comes to the question of whether
    data remains in compressed form after it is scanned in the Exadata Storage Servers as it is shipped over the Infiniband network to Oracle RAC for further processing. It appears from the way I read the paper (I could be wrong about this) that data is decompressed before it is shipped over the network, which means that Oracle is not gaining all of the potential performance from operating directly on compressed data, as data does have to be decompressed eventually, though the amount of data that needs to be decompressed is smaller, since selections and projections have occurred. If Oracle RAC could operate directly on the columnar compressed format, then decompression would never have to occur, which would further increase performance (query results would obviously have to be decompressed, but usually the magnitude of query results is much smaller than the magnitude of data processed to produce them). Not only is Oracle missing out on potential performance by decompressing data before sending it over the network, but also network bottlenecks could be introduced.
  4. If storage costs are the most important consideration (e.g. for historical data that would otherwise be archived to tape or other offline device), there’s a knob that allows you to further increase compression at the cost of having to decompress data at runtime (and this decompression is heavyweight), thereby reducing runtime performance. Furthermore, the same table can be compressed using the lightweight (high-performance) compression and heavyweight (low-performance) compression using Oracle’s partitioning feature (e.g. you could partition by date and have the historical partitions use the heavyweight compression scheme and the active partitions use the lightweight compression scheme). Note that lightweight still yields good compression --- Oracle claims up to 10X, but this is obviously going to be very dependent on your data.

The most interesting part of the whitepaper was the figure of a compression unit on page four. It is clear that one complication that arises from compressing each column separately is that for a fixed number of rows (e.g. tuples numbered 1-1000), the column sizes are going to vary wildly (depending on how compressible the data from each column is). Figuring out how to fit all columns from a set of rows into a fixed-size block (or set of blocks inside a compression unit) without leaving large chunks of the block empty is a nontrivial optimization problem, and something that Oracle probably had to think about in detail in their implementation.

Anyway, if you’re interested in Oracle’s approach to hybrid columnar storage, the whitepaper is worth a read. With rumors of additional developments in Oracle’s hybrid columnar storage starting to surface, it is likely that this will not be my last blog post on this subject .