muSOAing for 2/17/12 – Hadoop usage patterns

February 17, 2012

It is very interesting to note the various usage patterns that Hadoop is being subjected to.    Hadoop unlike other offerings is a complete ecosystem.  It’s key components in terms of data storage and retrieval are Hadoop, HBase and Hive.   Depending on the use case, usage pattern and need,  any one or all three of them can play a critical role in your Big Data strategy.

The typical rule of thumb is,  if you are dealing with semi-structured data like clickstream or weblog data,  you will stick to Hadoop and use either Java or Pig to run some map/reduce on them or you could also look at analytical tools and engines like R and Mahout.

If you are from the datawarehousing world then you talk the language of relational databases.  If this is the case, then HBase and Hive will be the answer to a lot of your needs.  The good thing is that this ecosystem is able to cater to every type of need that you might have in the Big Data space.

You now have an infrastructure that is truly elastic, you can add servers on demand.  You do not have to throw away any data and can park all of it in your infrastructure.    You also have all the tools and platforms needed to process and analyze this data.

Too easy?  Well, yes and no.  Setting up all of the above and getting to a level of agility where you are able to get meaningful insights into your data still takes a lot of work.  You do need highly skilled people to setup, maintain and run such an infrastructure.   This ecosystem is growing and becoming better with each passing day.  It question these days is not if you should adopt such a strategy but when should I hop onto this bandwagon.

muSOAing for 1/20/12 – 2012 – year of BigData?

January 20, 2012

The question is will 2012 be the year of Big Data or the 2010s be the decade of Big Data.   All the developments seem to validate the latter.    Apart from swimming in an ocean of data,   vendors,  clients and potential users are all faced with this giant Tsunami of chatter and offerings that is becoming increasingly difficult to sort out the grain from the chaff.

One can only hope that this does not become a stumbling block and barrier for Big Data adoption.   It is quite obvious that Big Data has matured from the buzzword and hype stage to become very mainstream.   There are challenges that one faces at every stage of Big Data adoption, starting with technology selection and matching that up with the actual needs.   The setting up of the environment,   the whole process of moving data into this infrastructure and finally the biggest challenge, to mine this information to get meaningful information from it.

Though all this seems a bit daunting to someone who is just wetting their feet in BigData,   there are offerings out there that make this space very palatable to users across the spectrum.   One is reminded of the EAI space where one had to deal with proprietary server frameworks which lacked any type of user interface and needed a lot of elbow grease and coding to setup and use.    This problem was solved to a great extent with the advent of BPM technology that put a friendly face in front of archaic EAI platforms.

The very same is happening in the BigData space.    Vendors like Pentaho are putting a very friendly face to Big Data offerings like Hadoop, Cassandra, MongoDB etc.  and making this field very sexy.

muSOAing for 12/10/11 – Show me the analytics

December 11, 2011

BigData is now poised for it’s next big leap(s).    Infrastructures like Hadoop, Cassandra, HBase etc. have created gigantic parking lots for oceans of information.   People no longer have to thow away their data and can hold on to every bit.  Now the next frontiers to conquer will be ease of use,  analytics and orchestration between heterogeneous big data and traditional middleware platforms.    In short an application layer over foundational building blocks like Hadoop, Hbase and Hive.     Once these are in place, adoption will accelerate and the day will not be far when the platform will become the bread and butter platforms for data warehousing, BI and Advanced Analytics.   In fact, this phenomenon is already starting to occur and there is no looking back.

muSOAing for 11/24/11 – LINUXization of Hadoop

November 24, 2011

I was tempted to call it the Balkanization of BigData with Hortonworks being the Slovenia,  Cloudera being the Serbia and Apache’s distribution Croatia probably?  But that would be totally inappropriate.   There will be no Balkanization but there certainly will be LINUXization.    At least these three version will be pre-eminent and which versions will people opt for is anybody’s guess.    There are really two camps in the Hadoop world,  the ones that favor RPM distributions and ones that are die hard tarball fans.     I personally belong to the latter camp.

Personally, I think tarballs are the way to go and anyone in the Hadoop release business should definitely support both.   Once you have the process nailed,  tarballs are the easiest to install.   They lend themselves to super fast distribution and deployment with tools like puppet and even your custom home grown tools built with shell and tcl/expect scripts.  Best of all, you can do a hadoop install without root access and that is the best perk or feature that tarballs offer.

RPMs certainly have a mind of their own.   Moreover you need root access to install RPMs.   The files are scattered in myriad places and it is very hard to track and maintain an installation if installed this way.

All said and done,  the Hadoop universe is truly unique as it brings a hitherto non-existent synergy between hardware, software and distributed computing.  For me personally it is very good to be part of this universe and I would even go to the extent of calling it the Vedanta of high end computing.    Enough of pontification on thanksgiving day and with this,  I am taking my hands off the keyboard and closing the screen on my mac book pro (another great perk of working with Hadoop).

muSOAing for 10/14/11 – The Big Data Universe

October 14, 2011

It seems that with each passing day,  the big data universe keeps growing and one sees new offerings in this space.   With all this noise and clutter, how is one to formulate their Big Data strategy.    It seems that a few patterns of Big Data usage are emerging.  The basic tenet is still the same which is MPP on commodity servers.   The code is co-located with the data and is processed and sent back to a master node.

This being the paradigm,  there are a plethora of offerings now in the Big Data space.   You of course have the Hadoop Universe with HBase, Hive, Pig etc.    Other opensource platforms like Cassandra.  Then you have the commercial ones like MongoDB,  Couchbase,  Allegrograph, MapR, LexisNexis…   You then have this third category of vendors who were erstwhile purveyors of traditional data management technologies like Teradata, EMC and Oracle now trying to re-invent themselves as Big Data leaders with offerings like AsterData, GreenPlum and Exadata.

So how are these folks getting mindshare.   Here is my take on this.   Hadoop is still the bread and butter of Big Data computing.   The barrier for entry is low,  anyone can download and play around with the various distributions from Apache and Cloudera and it’s ecosystem of products like HBase and Hive can and are solving real life data management problems across verticals.   This is evidenced by the various implementations that are running in production today and Yahoo is probably the best example.

The MongoDBs and Couchbases of the world seem to be solutions that are not so much general purpose but aligned with specific verticals like Advertising, Telcos or Healthcare.   Given the unique requirements of these domains,   the vendors have built the value added layers on top of the basic frameworks in the form of search and aggregation algorithms so for now they probably can be viewed as purpose built Big Data platforms aligned with specific verticals.

At the other end you have  Teradata/AsterData,  EMC/Greenplum and Oracle/Exadata.   These can be viewed as vertically integrated solutions that have everything you need in a box, kind of like a Big Data Happy Meal.    For all intents and purposes it is a blackbox.  You get this refrigerator size Big Data appliance that will have everything you need,  the software, the processors and storage all packaged into one based on proprietary standards.     These would be useful probably for folks who have been using the traditional offerings from these companies and now want to upgrade their existing infrastructures to support Big Data paradigms.

muSOAing for 10/7/11 – Hive Partitions

October 8, 2011

This is a very interesting concept in Hive.  Hive lets you segment and categorize data based on system and business attributes like timestamp, date or fields like customerid, orderid etc.    One potential use of this feature is to facilitate the updates of information in Hive.    One potential use case could be to move data into the existing partition into an ODS,  change this information and then rewrite the partition in Hive with this changed information.

The feature of dynamic partitions is also very useful and powerful.   Some of the features in the upcoming version of Hive (0.80) will be very useful especially the table append feature with the INSERT INTO syntax.  Hive has already proved to be a very robust client for a BigData warehouse.  One thing lacking though is effective front end tools like Business Objects or Cognos.   A lot of these implementations for GUI driven analytics is currently home grown so there is a lot of opportunity.

muSOAing for 9/15/11 – Hive is buzzing

September 15, 2011

So there has always been a lot of buzz around Hive (no pun intended). Traditional DB users have always loved Hive, what with it’s SQL like language to perform DDL and DML operations. Now when it comes to features like updates and indexing, that is where things start to get a bit rough. Hive is certainly not contender for queries that involve low latency. When it runs map/reduce for pretty much every type of query. Updates are another big issue, it being an append only store. Some innovative solutions like partitioning data properly seems to be the way to go but there are a lot of other strategies being adopted to deal with this issue, not limited to hooking it up with HBase. Watch this space for more updates (again no pun intended).

muSOAing for 8/7/11 – The World of HBase

August 8, 2011

It seems that with each passing day there is something new happening in the HBase ecosystem. I have already mentioned about the availability of co-processors with v0.91 onwards but let us look at the plethora of options available today.

For someone from the relational world, HBase may come across as a very loosely typed database system. Even I have been led to believe the same but beware this simplicity is very deceptive akin to an iron hand in a velvet glove. Who would not be if all it takes to create a table is to specify it’s name a a column family. You need not even specify the table key or it’s columns. It can be done on the fly. But that does not mean that you can let your guard down. HBase was designed to handle mountains of information and the way it does that is through it’s very rich Java API that gives you myriad option to both ETL and Query information in and out of it’s highly distributed and scalable architecture.

Mull this (over a glass of your favorite libation), just for ETL you have such a variety of options starting with the most atomic PUT to batch PUTs, HFileOutputFormat, bulk uploads and an endless combination of these including map/reduce. There is an equally diverse buffet of options for querying such as atomic GETs, batch GETs, range scans, co-processors etc.

Watch this space for more detailed information including examples of these features.

muSOAing for 7/19/11 – more on HBase

July 19, 2011

It seems that with each new version the HBase feature and function list keeps growing. One may complain that HBase is too loosely typed, you do not specify any key explicitly. For that matter you don’t even spell out the column names explicitly. All it asks of you initially is a table name and a column family. Maybe there was a reason to design it this way so you could hide data from others and operate in stealth mode?

Whereas when it comes to querying and mining information, eventough HBase does not support SQL like operations, you have a plethora of API options to perform a wide range of operations such as atomic GETs, batch GETs, scans, map/reduce and now with co-processors you can do complex aggregations. Even the for the data loading part, it gives you a lot of rich API options. For the most part, it seems that HBase can support all your data management needs. Even indexing can be achieved in a lot of innovative ways.

Watch this space for some detailed technical information on each of these features.

muSOAing for 7/2/11 – Hadoop, what is in store?

July 2, 2011

We can now safely say that Hadoop and it’s ecosystem of offerings is now mainstream. This is evidenced by several indicators. First of all the hockey stick like growth in the adoption of Hadoop across all verticals and horizontals. The second indicator is the plethora of new features that are being incorporated into Hadoop, HBase etc. The third indicator is the formal announcement at this week’s Hadoop Summit of Hortonworks and I can go on and on in the same vein.

Suffice it to say that Hadoop is doing to Information Management what the Web did to Information Access a decade and a half ago. It is a game change, industry defining technology. Having said that, you have to be prepared for the usual hype and me toos, the discordant noise emitted by all the folks who want to ride on this gravy train. To be able to cut through this chaff and get to the grain will be the challenge as with all such paradigm shifting technologies. Expect the inevitable shakeouts and consolidations that will follow in the next few months. A lot of the said consolidation is already happening with acquisitions of key players and more will follow and then finally the ones that are left will be the true contenders that the industry at large can safely deal with.