Archive for October, 2011

muSOAing for 10/14/11 – The Big Data Universe

October 14, 2011

It seems that with each passing day,  the big data universe keeps growing and one sees new offerings in this space.   With all this noise and clutter, how is one to formulate their Big Data strategy.    It seems that a few patterns of Big Data usage are emerging.  The basic tenet is still the same which is MPP on commodity servers.   The code is co-located with the data and is processed and sent back to a master node.

This being the paradigm,  there are a plethora of offerings now in the Big Data space.   You of course have the Hadoop Universe with HBase, Hive, Pig etc.    Other opensource platforms like Cassandra.  Then you have the commercial ones like MongoDB,  Couchbase,  Allegrograph, MapR, LexisNexis…   You then have this third category of vendors who were erstwhile purveyors of traditional data management technologies like Teradata, EMC and Oracle now trying to re-invent themselves as Big Data leaders with offerings like AsterData, GreenPlum and Exadata.

So how are these folks getting mindshare.   Here is my take on this.   Hadoop is still the bread and butter of Big Data computing.   The barrier for entry is low,  anyone can download and play around with the various distributions from Apache and Cloudera and it’s ecosystem of products like HBase and Hive can and are solving real life data management problems across verticals.   This is evidenced by the various implementations that are running in production today and Yahoo is probably the best example.

The MongoDBs and Couchbases of the world seem to be solutions that are not so much general purpose but aligned with specific verticals like Advertising, Telcos or Healthcare.   Given the unique requirements of these domains,   the vendors have built the value added layers on top of the basic frameworks in the form of search and aggregation algorithms so for now they probably can be viewed as purpose built Big Data platforms aligned with specific verticals.

At the other end you have  Teradata/AsterData,  EMC/Greenplum and Oracle/Exadata.   These can be viewed as vertically integrated solutions that have everything you need in a box, kind of like a Big Data Happy Meal.    For all intents and purposes it is a blackbox.  You get this refrigerator size Big Data appliance that will have everything you need,  the software, the processors and storage all packaged into one based on proprietary standards.     These would be useful probably for folks who have been using the traditional offerings from these companies and now want to upgrade their existing infrastructures to support Big Data paradigms.


muSOAing for 10/7/11 – Hive Partitions

October 8, 2011

This is a very interesting concept in Hive.  Hive lets you segment and categorize data based on system and business attributes like timestamp, date or fields like customerid, orderid etc.    One potential use of this feature is to facilitate the updates of information in Hive.    One potential use case could be to move data into the existing partition into an ODS,  change this information and then rewrite the partition in Hive with this changed information.

The feature of dynamic partitions is also very useful and powerful.   Some of the features in the upcoming version of Hive (0.80) will be very useful especially the table append feature with the INSERT INTO syntax.  Hive has already proved to be a very robust client for a BigData warehouse.  One thing lacking though is effective front end tools like Business Objects or Cognos.   A lot of these implementations for GUI driven analytics is currently home grown so there is a lot of opportunity.