muSOAing for 12/14/14

December 14, 2014

Hi There,  It’s been a while.  Have been very busy in the Big Data trenches.  The frontlines are advancing and there have been new frontiers to explore since my last post.   A lot has transpired and there has been a lot of action and there will be a lot more action in the days, weeks, months and years to come.

Big Data has been adopted by a lot of diverse types of organizations in the past couple of years.   All this has done is to only bring to the fore a lot of lacunae and shortcomings in not only the products themselves but also in the solutions that have been put together using these tools and solutions.  To put it simply, a lot of creative engineering and elbow grease have come into play in all the solutions you see today that are currently running in production.

This is not to say that people and organizations that have been the driving force behind both the users and the solution providers for a lot of these technologies have let the grass grow under their feet so to speak.  This space is set to explode in the next coming weeks and months.   A lot of creating and innovative products and solutions are in the works.   Let us put it this way,  be ready to be disrupted (once again).    Being on the frontlines,  I have and am seeing a lot of this taking shape and I will be sharing my thoughts, views and opinions about a lot of this in terms of the problem space,  the appropriate technology and who is building what to address the gaps that exist today.  So watch this space closely.


Packt Publishing Offer

March 29, 2014

Check out #Packt’s amazing Buy One, Get One Free offer #Packt2k

muSOAing for 9/4/13 – Good book on Hive

September 4, 2013

Instant Apache Hive Essentials How-to by Darren Lee

Finally a well rounded and well grounded book on Hive. Everyone thinks that HQL is the beginning an end of Hive, without realizing that there is so much more to this interesting technology. Darren has finally come out with this much needed work that explores all the other key aspects of Hive such as it’s storage formats and how they are essential for running efficient map/reduce. The more advanced features such as UDFs. The ability to interact with Hive using non-interactive mechanisms like drivers, thrift server etc. I personally found this book to be very useful and eye opening eventough I have been working with Hive for the past two years. I give this work a two thumbs up.

muSOAing for 7/11/13 – Hadoop the O/S for Big Data

July 11, 2013

Been over a year since my last post.   Hadoop has clearly emerged as the O/S for Big Data.  The sun of the Big Data Interplanetary system.  What I have seen in the past one year is,

a.  Lot more adoption of Hadoop

b.  Some real production deployments of Hadoop

c.  A lot of focus still on the ELT and Infrastructure portion

d.  An increasing shift to Advanced Scientific Analysis

This may sound like like a cliche but Hadoop has truly arrived.    People are using it for doing real work.  A lot of OLAP and decision support systems are now being run on this platform.   I will be sharing my Hadoop Implementation Experiences both at a broad, holistic level as well as the down in the weeds technical level.    There is a lot to talk about especially relating to how to make all these moving parts firing in harmony to keep the engine purring smoothly.

muSOAing for 6/3/12 – RegionObserver coprocessor

June 4, 2012

Coprocessors have introduced very powerful capabilities into HBase.   The lifecycle methods of a coprocessor like RegionObserver give you trigger like capability.   For instance you can override the preGet, postGet methods and perform your own logic.

The code thus written can be deployed server wide by including the change in hbase-site.xml or for a specific table in the hbase command shell.

You can also now avail of stored procedure like capability by using coprocessor endpoints.   Plan to explore this next.

muSOAing for 6/1/12 – CDH4B2 HBase coprocessors

June 1, 2012

Simple,  I mean the setup process to run my first coprocessor app to test out the built in AggregationClient.    I still think that the scan on the table is a bit slow.   I see lucene and some indexing on the horizon.

muSOAing for 5/31/12 – CDH4B2

June 1, 2012

Up and running with CDH4B2 Hadoop and HBase (pseudo-distributed mode,  mac os x rocks).   Checked out MRv2.   The whole underlying architecture has changed.  No more jobtracker and tasktracker  and instead you now have NodeManager and ResourceManager based on the new yarn framework.    Have not done any benchmarks yet but planning to do some for m/r and hbase.  My map/reduce jobs did seem to run a lot more faster though did not do any benchmarks.

The new yarn framework definitely needs a lot more memory on startup (4GB).     The HBase processes are the same but the nice thing about this version is there is no ambiguity about zookeeper, you can start the  process separately.  I plan to check out the coprocessor feature (check this space for future updates).   Next on the list to checkout are hadoop high availability and hbase coprocessors and did I mention that mac os x rocks  and Cloudera, you too rock.  Thank you for continuing to provide these tarballs.

muSOAing for 4/28/12 – Wazzup Big Data?

April 29, 2012

A very frequent question I keep getting asked is,  what are the different ways in which I can use Hadoop.    I always tell them, the sky is the limit.  Ok, jokes aside,  what can you not  do with Hadoop.    After you get thru the initial challenges of ETL,  data formats,  normalization etc.   the fun really begins.  Now you think, what sort of insights do I want into my data,  is it related to clustering, classification or is it related to stuff like real-time analytics,   information that is targeted to specific users.    Whatever the application may be,  there are tools ready to serve your needs.    I have seen of late tools like Mahout, R and SAS being use increasingly to address these needs.   Complex algorithms apart from k-means,  bayesian are ideal for an HPC environment like Hadoop.

muSOAing for 4/7/12 – Hadoop cures depression

April 8, 2012

Well, the title of this post may sound and seem a bit dramatic but believe me,  it is quite true.    Why this is so is because it lets me do all my work in the very eclectic and democratic world of Linux (if you know what I mean).   Not only is the technology so cool,  a perfect marriage between powerful parallel processing software coupled with powerful server grade hardware,  it lets your creativity take wing with all the other great tools and technologies in the ever growing Hadoop ecosystem.

This ecosystem is being sustained and expanded by a great bunch of individuals with whom it has been my honor and privilege to work with.    Take for instance this case when I was really stuck trying to resolve this quirky network issue when setting up my cluster and what do you know,  this great dude and my close friend and colleague Sujee Maniyam had just the solution I was looking for and not only that he had published this piece on github  (    This is what makes this whole space so fun and exciting to be in, it almost seems that you can have your cake and eat it too and last but not the least,  I now get to work only on my beloved macbook pro.

muSOAing for 3/7/12 – Hadooptopia?

March 8, 2012

Installed CDH4 B1 on my new super duper quad core macbook pro.   Plan to check out all it’s new features especially HBase coprocessors and the new features of Hive.    Of course trying out namenode high availability feature is also there. With the release of all these features, Hadoop has become an even stronger and compelling platform.    The ever growing Hadoop ecosystem of products has something for everyone.

Along with this almost everybody who is a somebody in the Big Data space is building application layers on top of Hadoop to make it more user friendly and easier to use.   Companies like Pentaho, Tableau and Karmasphere come to mind.    Watch this space for updates on all of these features mentioned in this post.