Archive for February, 2012

muSOAing for 2/17/12 – Hadoop usage patterns

February 17, 2012

It is very interesting to note the various usage patterns that Hadoop is being subjected to.    Hadoop unlike other offerings is a complete ecosystem.  It’s key components in terms of data storage and retrieval are Hadoop, HBase and Hive.   Depending on the use case, usage pattern and need,  any one or all three of them can play a critical role in your Big Data strategy.

The typical rule of thumb is,  if you are dealing with semi-structured data like clickstream or weblog data,  you will stick to Hadoop and use either Java or Pig to run some map/reduce on them or you could also look at analytical tools and engines like R and Mahout.

If you are from the datawarehousing world then you talk the language of relational databases.  If this is the case, then HBase and Hive will be the answer to a lot of your needs.  The good thing is that this ecosystem is able to cater to every type of need that you might have in the Big Data space.

You now have an infrastructure that is truly elastic, you can add servers on demand.  You do not have to throw away any data and can park all of it in your infrastructure.    You also have all the tools and platforms needed to process and analyze this data.

Too easy?  Well, yes and no.  Setting up all of the above and getting to a level of agility where you are able to get meaningful insights into your data still takes a lot of work.  You do need highly skilled people to setup, maintain and run such an infrastructure.   This ecosystem is growing and becoming better with each passing day.  It question these days is not if you should adopt such a strategy but when should I hop onto this bandwagon.