Wednesday, July 15, 2009

Sybase IQ throws its hat into the in-DBMS analytics ring

Sybase IQ announced this week the availability of Sybase IQ 15.1. The press release made it clear that this version is all about in-DBMS analytics. Perhaps the most notable addition is the integration of the DB Lytix (a product from Fuzzy Logix) analytics library into Sybase IQ, so DB Lytix functions can be run inside of the DBMS.

It's possible that I'm looking in the wrong places, but I have seen very little attention paid to this announcement. I take this as a symptom of a very good thing: so many DBMS vendors are adding in-DBMS analytics features to their products, that announcements such as this are not really news any more. In the last 18 months we've had the announcement of the Teradata-SAS partnership (where a number of SAS functions are being implemented inside Teradata and run in parallel on Teradata's shared-nothing architecture), Netezza opening up their development platform so that Netezza partners and customers can implement complex analytical functions inside the Netezza system, and the announcement of in-database MapReduce functionality by Greenplum, Aster Data, and Vertica (though, as explained by Colin White, the vision of when MapReduce should be used --- e.g., for ETL or user queries --- varies across these three companies). Though not announced yet, I'm told that Microsoft Project Madison (the shared-nothing version of SQL Server to be released in 2010) will natively run windowed analytics functions in parallel inside the DBMS.

As a DBMS researcher, this is great news, as the DBMS is starting to become the center of the universe, the location where everything happens. Though some would argue that in-DBMS analytics has been available for decades via user-defined functions (UDFs), many agree that UDF performance has been disappointing at best, and shipping data out of the DBMS to perform complex analytics has been common practice. The reasons for this are many-fold: query optimizers have trouble estimating the cost of UDFs, arbitrary user written code is hard to automatically run in parallel, and various security and implementation bugs manifest themselves (see, e.g., Section 4.3.5 of the "A Comparison of Approaches to Large Scale Data Analysis" paper).

One interesting trend to note is that there seem to be two schools of thought emerging with different opinions on how to allow complex in-DBMSs analytics without resorting to regular UDFs. Teradata, Microsoft, Sybase, and, to an extent, Netezza, all seem to believe that providing a library of preoptimized functions distributed with the software is the way to go. This allows the vendor to build into the system the ability to run these functions in parallel across all computing resources (a shared-nothing MPP cluster in all cases except Sybase) and to make sure these functions can be planned appropriately along with other data processing operations. The other school of thought is adopted by vendors that allow customers more freedom to implement their own functions, but constrain the language in which this code is written (such as MapReduce or LINQ) to facilitate the automatic parallelization of this code inside the DBMS.

Obviously, these camps are not mutually exclusive. As in-DBMS analytics continues to grow in popularity, it is likely we'll start to see vendors adopt both options.

Whatever school of thought you prefer, it is clear that the last 18 months has seen tremendous progress for in-database analytics. Shipping data out of the DBMS for analysis never made a lot of sense, and finally, viable alternative options are emerging. Database systems are becoming increasingly powerful platforms for data processing, and this is good for everyone.

No comments:

Post a Comment