Archive for the ‘Cloud Computing’ Category

muSOAing for 6/3/12 – RegionObserver coprocessor

June 4, 2012

Coprocessors have introduced very powerful capabilities into HBase.   The lifecycle methods of a coprocessor like RegionObserver give you trigger like capability.   For instance you can override the preGet, postGet methods and perform your own logic.

The code thus written can be deployed server wide by including the change in hbase-site.xml or for a specific table in the hbase command shell.

You can also now avail of stored procedure like capability by using coprocessor endpoints.   Plan to explore this next.

muSOAing for 6/1/12 – CDH4B2 HBase coprocessors

June 1, 2012

Simple,  I mean the setup process to run my first coprocessor app to test out the built in AggregationClient.    I still think that the scan on the table is a bit slow.   I see lucene and some indexing on the horizon.

muSOAing for 5/31/12 – CDH4B2

June 1, 2012

Up and running with CDH4B2 Hadoop and HBase (pseudo-distributed mode,  mac os x rocks).   Checked out MRv2.   The whole underlying architecture has changed.  No more jobtracker and tasktracker  and instead you now have NodeManager and ResourceManager based on the new yarn framework.    Have not done any benchmarks yet but planning to do some for m/r and hbase.  My map/reduce jobs did seem to run a lot more faster though did not do any benchmarks.

The new yarn framework definitely needs a lot more memory on startup (4GB).     The HBase processes are the same but the nice thing about this version is there is no ambiguity about zookeeper, you can start the  process separately.  I plan to check out the coprocessor feature (check this space for future updates).   Next on the list to checkout are hadoop high availability and hbase coprocessors and did I mention that mac os x rocks  and Cloudera, you too rock.  Thank you for continuing to provide these tarballs.

muSOAing for 10/14/11 – The Big Data Universe

October 14, 2011

It seems that with each passing day,  the big data universe keeps growing and one sees new offerings in this space.   With all this noise and clutter, how is one to formulate their Big Data strategy.    It seems that a few patterns of Big Data usage are emerging.  The basic tenet is still the same which is MPP on commodity servers.   The code is co-located with the data and is processed and sent back to a master node.

This being the paradigm,  there are a plethora of offerings now in the Big Data space.   You of course have the Hadoop Universe with HBase, Hive, Pig etc.    Other opensource platforms like Cassandra.  Then you have the commercial ones like MongoDB,  Couchbase,  Allegrograph, MapR, LexisNexis…   You then have this third category of vendors who were erstwhile purveyors of traditional data management technologies like Teradata, EMC and Oracle now trying to re-invent themselves as Big Data leaders with offerings like AsterData, GreenPlum and Exadata.

So how are these folks getting mindshare.   Here is my take on this.   Hadoop is still the bread and butter of Big Data computing.   The barrier for entry is low,  anyone can download and play around with the various distributions from Apache and Cloudera and it’s ecosystem of products like HBase and Hive can and are solving real life data management problems across verticals.   This is evidenced by the various implementations that are running in production today and Yahoo is probably the best example.

The MongoDBs and Couchbases of the world seem to be solutions that are not so much general purpose but aligned with specific verticals like Advertising, Telcos or Healthcare.   Given the unique requirements of these domains,   the vendors have built the value added layers on top of the basic frameworks in the form of search and aggregation algorithms so for now they probably can be viewed as purpose built Big Data platforms aligned with specific verticals.

At the other end you have  Teradata/AsterData,  EMC/Greenplum and Oracle/Exadata.   These can be viewed as vertically integrated solutions that have everything you need in a box, kind of like a Big Data Happy Meal.    For all intents and purposes it is a blackbox.  You get this refrigerator size Big Data appliance that will have everything you need,  the software, the processors and storage all packaged into one based on proprietary standards.     These would be useful probably for folks who have been using the traditional offerings from these companies and now want to upgrade their existing infrastructures to support Big Data paradigms.

muSOAing for 7/2/11 – Hadoop, what is in store?

July 2, 2011

We can now safely say that Hadoop and it’s ecosystem of offerings is now mainstream. This is evidenced by several indicators. First of all the hockey stick like growth in the adoption of Hadoop across all verticals and horizontals. The second indicator is the plethora of new features that are being incorporated into Hadoop, HBase etc. The third indicator is the formal announcement at this week’s Hadoop Summit of Hortonworks and I can go on and on in the same vein.

Suffice it to say that Hadoop is doing to Information Management what the Web did to Information Access a decade and a half ago. It is a game change, industry defining technology. Having said that, you have to be prepared for the usual hype and me toos, the discordant noise emitted by all the folks who want to ride on this gravy train. To be able to cut through this chaff and get to the grain will be the challenge as with all such paradigm shifting technologies. Expect the inevitable shakeouts and consolidations that will follow in the next few months. A lot of the said consolidation is already happening with acquisitions of key players and more will follow and then finally the ones that are left will be the true contenders that the industry at large can safely deal with.

muSOAing for 6/17/11 – Hadoop Metrics

June 17, 2011

One big area of interest for me is the metrics for Hadoop. This is the stuff you can see in the default web page thru port 50030 and also available for you thru the JobClient API. As of this writing, this whole infrastructure is undergoing a sea change as is the case with the APIs in general. All the APIs are changing for both Hadoop and HBase.

This is a good sign. It says that the adoption of these tools is increasing and as people are using these products, they are demanding more features. It seems at this point that the primary users of these platforms are from the DW camp. Their interests lie in using alternate platforms to perform the same DW and BI tasks that they have been doing with traditional relational DB oriented infrastructures. Moving to a new platform like Hadoop also requires a different mindset which means unlearning a few of the concepts associated with traditional platforms. It will also mean coming out of the comfort zone and expecting Hadoop+HBase to behave exactly like your Oracle datawarehouse. This is a bit like switching from Windows to Mac. Things will only get better, it is just that you have to get used to it.

I think I have digressed a bit, I started off with Hadoop metrics but then wandered off and done some pontification.

muSOAing for 5/30/11 – Can I use HBase instead of…….

May 30, 2011

This is an oft posed question these days. Folks looking to cut down on their DW licensing costs, looking for alternatives. The Hadoop/HBase platform is definitely a candidate for many popular use cases. This platform is rapidly evolving and is already turning out to be a big challenge to all the traditional DW players. Talk about disruptive technology, this is as disruptive as it can get. Map/Reduce has changed the entire status quo. It has unleashed a veritable Tech Tsunami that is threatening to wash away those who are unwilling to acknowledge and adapt to this rapidly changing landscape. All my blogs in the near future will focus on BigData as there is a lot happening in this space and hence a lot to mull and ponder upon.

muSOAing for 5/13/11 – Are you Hadooping yet?

May 13, 2011

It should be quite obvious by now that Big Data is catching on big time across all Industry verticals. In my personal interactions there hasn’t been one organization that is not looking at Big Data solutions to deal with their data analytical and storage needs. The top on this list is of course Hadoop and it’s ecosystem of offerings chiefly HBase and also Hive, Pig, Zookeeper, Mahout etc. It seems that almost everyday something new is happening in this space. Folks are talking about fault tolerant Hadoop installations, next generation map reduce and a host of companies like NetApp, EMC etc. are offering Hadoop based solutions. These are very exciting times for Big Data and Advanced Analytics.

muSOAing for 4/17/11 – Write once Read Many?

April 17, 2011

One of the features of a Big Data setup is it’s Write once Read Many paradigm. Any Big Data infrastructure like Hadoop is still a data warehousing infrastructure used for analyzing historical information. Your relational store will still be your repository for ongoing OLTP needs with data being ETLd into your Big Data infrastructure. With data being written to file systems and being analyzed using map/reduce at the lowest level. Advocates encourage the use of higher level tools like Pig and Hive to perform analytics. These tools do execute map/reduce for you but provide you with higher level SQL like interfaces that you are already familiar with to issue your commands which are translated into map/reduce directives under the covers.

With the adoption of Hadoop increasing by the day across all verticals, the need in this area is only going to increase. It also has something for everybody, the technology nerd who can get started on the cheap to your CIO who can now have a multi-node Big Data infra up and running in no time and churning out useful and timely business analytics.

muSOAing for 3/24/11 – The Vertically Integrated Architect II

March 24, 2011

So what are the other qualities. Being able to provide thought leadership and the willingness to evangelize and put forth your thoughts and ideas in various fora such as personal blogs, articles, whitepapers and even downloadable code artifacts. All these make for a well rounded architect aka a Vertically integrated one.

An Architect should be able to see the big picture, the so called 1000 foot view, as well as be able to roll up his/her sleeves and be able to build reusable artifacts. If you do not code then you are not the master of your own domain and you cannot be a true Architect.

Being able to take a problem however nebulous and be able to turn that into a solution, that is the ultimate challenge, being able to bring true value and be able to solve real world problems and at the same time be able to address all the key business drivers such as ROI, value proposition etc. In a sense, I would not be wrong if I said that the end here truly drive the means. The end goal is a well architected system and the means is to bring to bear your full complete set of Architectural firepower.