Archive for the ‘NoSQL’ Category

muSOAing for 7/11/13 – Hadoop the O/S for Big Data

July 11, 2013

Been over a year since my last post.   Hadoop has clearly emerged as the O/S for Big Data.  The sun of the Big Data Interplanetary system.  What I have seen in the past one year is,

a.  Lot more adoption of Hadoop

b.  Some real production deployments of Hadoop

c.  A lot of focus still on the ELT and Infrastructure portion

d.  An increasing shift to Advanced Scientific Analysis

This may sound like like a cliche but Hadoop has truly arrived.    People are using it for doing real work.  A lot of OLAP and decision support systems are now being run on this platform.   I will be sharing my Hadoop Implementation Experiences both at a broad, holistic level as well as the down in the weeds technical level.    There is a lot to talk about especially relating to how to make all these moving parts firing in harmony to keep the engine purring smoothly.


muSOAing for 6/3/12 – RegionObserver coprocessor

June 4, 2012

Coprocessors have introduced very powerful capabilities into HBase.   The lifecycle methods of a coprocessor like RegionObserver give you trigger like capability.   For instance you can override the preGet, postGet methods and perform your own logic.

The code thus written can be deployed server wide by including the change in hbase-site.xml or for a specific table in the hbase command shell.

You can also now avail of stored procedure like capability by using coprocessor endpoints.   Plan to explore this next.

muSOAing for 6/1/12 – CDH4B2 HBase coprocessors

June 1, 2012

Simple,  I mean the setup process to run my first coprocessor app to test out the built in AggregationClient.    I still think that the scan on the table is a bit slow.   I see lucene and some indexing on the horizon.

muSOAing for 5/31/12 – CDH4B2

June 1, 2012

Up and running with CDH4B2 Hadoop and HBase (pseudo-distributed mode,  mac os x rocks).   Checked out MRv2.   The whole underlying architecture has changed.  No more jobtracker and tasktracker  and instead you now have NodeManager and ResourceManager based on the new yarn framework.    Have not done any benchmarks yet but planning to do some for m/r and hbase.  My map/reduce jobs did seem to run a lot more faster though did not do any benchmarks.

The new yarn framework definitely needs a lot more memory on startup (4GB).     The HBase processes are the same but the nice thing about this version is there is no ambiguity about zookeeper, you can start the  process separately.  I plan to check out the coprocessor feature (check this space for future updates).   Next on the list to checkout are hadoop high availability and hbase coprocessors and did I mention that mac os x rocks  and Cloudera, you too rock.  Thank you for continuing to provide these tarballs.

muSOAing for 4/28/12 – Wazzup Big Data?

April 29, 2012

A very frequent question I keep getting asked is,  what are the different ways in which I can use Hadoop.    I always tell them, the sky is the limit.  Ok, jokes aside,  what can you not  do with Hadoop.    After you get thru the initial challenges of ETL,  data formats,  normalization etc.   the fun really begins.  Now you think, what sort of insights do I want into my data,  is it related to clustering, classification or is it related to stuff like real-time analytics,   information that is targeted to specific users.    Whatever the application may be,  there are tools ready to serve your needs.    I have seen of late tools like Mahout, R and SAS being use increasingly to address these needs.   Complex algorithms apart from k-means,  bayesian are ideal for an HPC environment like Hadoop.

muSOAing for 4/7/12 – Hadoop cures depression

April 8, 2012

Well, the title of this post may sound and seem a bit dramatic but believe me,  it is quite true.    Why this is so is because it lets me do all my work in the very eclectic and democratic world of Linux (if you know what I mean).   Not only is the technology so cool,  a perfect marriage between powerful parallel processing software coupled with powerful server grade hardware,  it lets your creativity take wing with all the other great tools and technologies in the ever growing Hadoop ecosystem.

This ecosystem is being sustained and expanded by a great bunch of individuals with whom it has been my honor and privilege to work with.    Take for instance this case when I was really stuck trying to resolve this quirky network issue when setting up my cluster and what do you know,  this great dude and my close friend and colleague Sujee Maniyam had just the solution I was looking for and not only that he had published this piece on github  (    This is what makes this whole space so fun and exciting to be in, it almost seems that you can have your cake and eat it too and last but not the least,  I now get to work only on my beloved macbook pro.

muSOAing for 3/7/12 – Hadooptopia?

March 8, 2012

Installed CDH4 B1 on my new super duper quad core macbook pro.   Plan to check out all it’s new features especially HBase coprocessors and the new features of Hive.    Of course trying out namenode high availability feature is also there. With the release of all these features, Hadoop has become an even stronger and compelling platform.    The ever growing Hadoop ecosystem of products has something for everyone.

Along with this almost everybody who is a somebody in the Big Data space is building application layers on top of Hadoop to make it more user friendly and easier to use.   Companies like Pentaho, Tableau and Karmasphere come to mind.    Watch this space for updates on all of these features mentioned in this post.

muSOAing for 2/17/12 – Hadoop usage patterns

February 17, 2012

It is very interesting to note the various usage patterns that Hadoop is being subjected to.    Hadoop unlike other offerings is a complete ecosystem.  It’s key components in terms of data storage and retrieval are Hadoop, HBase and Hive.   Depending on the use case, usage pattern and need,  any one or all three of them can play a critical role in your Big Data strategy.

The typical rule of thumb is,  if you are dealing with semi-structured data like clickstream or weblog data,  you will stick to Hadoop and use either Java or Pig to run some map/reduce on them or you could also look at analytical tools and engines like R and Mahout.

If you are from the datawarehousing world then you talk the language of relational databases.  If this is the case, then HBase and Hive will be the answer to a lot of your needs.  The good thing is that this ecosystem is able to cater to every type of need that you might have in the Big Data space.

You now have an infrastructure that is truly elastic, you can add servers on demand.  You do not have to throw away any data and can park all of it in your infrastructure.    You also have all the tools and platforms needed to process and analyze this data.

Too easy?  Well, yes and no.  Setting up all of the above and getting to a level of agility where you are able to get meaningful insights into your data still takes a lot of work.  You do need highly skilled people to setup, maintain and run such an infrastructure.   This ecosystem is growing and becoming better with each passing day.  It question these days is not if you should adopt such a strategy but when should I hop onto this bandwagon.

muSOAing for 1/20/12 – 2012 – year of BigData?

January 20, 2012

The question is will 2012 be the year of Big Data or the 2010s be the decade of Big Data.   All the developments seem to validate the latter.    Apart from swimming in an ocean of data,   vendors,  clients and potential users are all faced with this giant Tsunami of chatter and offerings that is becoming increasingly difficult to sort out the grain from the chaff.

One can only hope that this does not become a stumbling block and barrier for Big Data adoption.   It is quite obvious that Big Data has matured from the buzzword and hype stage to become very mainstream.   There are challenges that one faces at every stage of Big Data adoption, starting with technology selection and matching that up with the actual needs.   The setting up of the environment,   the whole process of moving data into this infrastructure and finally the biggest challenge, to mine this information to get meaningful information from it.

Though all this seems a bit daunting to someone who is just wetting their feet in BigData,   there are offerings out there that make this space very palatable to users across the spectrum.   One is reminded of the EAI space where one had to deal with proprietary server frameworks which lacked any type of user interface and needed a lot of elbow grease and coding to setup and use.    This problem was solved to a great extent with the advent of BPM technology that put a friendly face in front of archaic EAI platforms.

The very same is happening in the BigData space.    Vendors like Pentaho are putting a very friendly face to Big Data offerings like Hadoop, Cassandra, MongoDB etc.  and making this field very sexy.

muSOAing for 12/10/11 – Show me the analytics

December 11, 2011

BigData is now poised for it’s next big leap(s).    Infrastructures like Hadoop, Cassandra, HBase etc. have created gigantic parking lots for oceans of information.   People no longer have to thow away their data and can hold on to every bit.  Now the next frontiers to conquer will be ease of use,  analytics and orchestration between heterogeneous big data and traditional middleware platforms.    In short an application layer over foundational building blocks like Hadoop, Hbase and Hive.     Once these are in place, adoption will accelerate and the day will not be far when the platform will become the bread and butter platforms for data warehousing, BI and Advanced Analytics.   In fact, this phenomenon is already starting to occur and there is no looking back.