muSOAing for 10/14/11 – The Big Data Universe

October 14, 2011

It seems that with each passing day,  the big data universe keeps growing and one sees new offerings in this space.   With all this noise and clutter, how is one to formulate their Big Data strategy.    It seems that a few patterns of Big Data usage are emerging.  The basic tenet is still the same which is MPP on commodity servers.   The code is co-located with the data and is processed and sent back to a master node.

This being the paradigm,  there are a plethora of offerings now in the Big Data space.   You of course have the Hadoop Universe with HBase, Hive, Pig etc.    Other opensource platforms like Cassandra.  Then you have the commercial ones like MongoDB,  Couchbase,  Allegrograph, MapR, LexisNexis…   You then have this third category of vendors who were erstwhile purveyors of traditional data management technologies like Teradata, EMC and Oracle now trying to re-invent themselves as Big Data leaders with offerings like AsterData, GreenPlum and Exadata.

So how are these folks getting mindshare.   Here is my take on this.   Hadoop is still the bread and butter of Big Data computing.   The barrier for entry is low,  anyone can download and play around with the various distributions from Apache and Cloudera and it’s ecosystem of products like HBase and Hive can and are solving real life data management problems across verticals.   This is evidenced by the various implementations that are running in production today and Yahoo is probably the best example.

The MongoDBs and Couchbases of the world seem to be solutions that are not so much general purpose but aligned with specific verticals like Advertising, Telcos or Healthcare.   Given the unique requirements of these domains,   the vendors have built the value added layers on top of the basic frameworks in the form of search and aggregation algorithms so for now they probably can be viewed as purpose built Big Data platforms aligned with specific verticals.

At the other end you have  Teradata/AsterData,  EMC/Greenplum and Oracle/Exadata.   These can be viewed as vertically integrated solutions that have everything you need in a box, kind of like a Big Data Happy Meal.    For all intents and purposes it is a blackbox.  You get this refrigerator size Big Data appliance that will have everything you need,  the software, the processors and storage all packaged into one based on proprietary standards.     These would be useful probably for folks who have been using the traditional offerings from these companies and now want to upgrade their existing infrastructures to support Big Data paradigms.

muSOAing for 10/7/11 – Hive Partitions

October 8, 2011

This is a very interesting concept in Hive.  Hive lets you segment and categorize data based on system and business attributes like timestamp, date or fields like customerid, orderid etc.    One potential use of this feature is to facilitate the updates of information in Hive.    One potential use case could be to move data into the existing partition into an ODS,  change this information and then rewrite the partition in Hive with this changed information.

The feature of dynamic partitions is also very useful and powerful.   Some of the features in the upcoming version of Hive (0.80) will be very useful especially the table append feature with the INSERT INTO syntax.  Hive has already proved to be a very robust client for a BigData warehouse.  One thing lacking though is effective front end tools like Business Objects or Cognos.   A lot of these implementations for GUI driven analytics is currently home grown so there is a lot of opportunity.

muSOAing for 9/15/11 – Hive is buzzing

September 15, 2011

So there has always been a lot of buzz around Hive (no pun intended). Traditional DB users have always loved Hive, what with it’s SQL like language to perform DDL and DML operations. Now when it comes to features like updates and indexing, that is where things start to get a bit rough. Hive is certainly not contender for queries that involve low latency. When it runs map/reduce for pretty much every type of query. Updates are another big issue, it being an append only store. Some innovative solutions like partitioning data properly seems to be the way to go but there are a lot of other strategies being adopted to deal with this issue, not limited to hooking it up with HBase. Watch this space for more updates (again no pun intended).

muSOAing for 8/7/11 – The World of HBase

August 8, 2011

It seems that with each passing day there is something new happening in the HBase ecosystem. I have already mentioned about the availability of co-processors with v0.91 onwards but let us look at the plethora of options available today.

For someone from the relational world, HBase may come across as a very loosely typed database system. Even I have been led to believe the same but beware this simplicity is very deceptive akin to an iron hand in a velvet glove. Who would not be if all it takes to create a table is to specify it’s name a a column family. You need not even specify the table key or it’s columns. It can be done on the fly. But that does not mean that you can let your guard down. HBase was designed to handle mountains of information and the way it does that is through it’s very rich Java API that gives you myriad option to both ETL and Query information in and out of it’s highly distributed and scalable architecture.

Mull this (over a glass of your favorite libation), just for ETL you have such a variety of options starting with the most atomic PUT to batch PUTs, HFileOutputFormat, bulk uploads and an endless combination of these including map/reduce. There is an equally diverse buffet of options for querying such as atomic GETs, batch GETs, range scans, co-processors etc.

Watch this space for more detailed information including examples of these features.

muSOAing for 7/19/11 – more on HBase

July 19, 2011

It seems that with each new version the HBase feature and function list keeps growing. One may complain that HBase is too loosely typed, you do not specify any key explicitly. For that matter you don’t even spell out the column names explicitly. All it asks of you initially is a table name and a column family. Maybe there was a reason to design it this way so you could hide data from others and operate in stealth mode?

Whereas when it comes to querying and mining information, eventough HBase does not support SQL like operations, you have a plethora of API options to perform a wide range of operations such as atomic GETs, batch GETs, scans, map/reduce and now with co-processors you can do complex aggregations. Even the for the data loading part, it gives you a lot of rich API options. For the most part, it seems that HBase can support all your data management needs. Even indexing can be achieved in a lot of innovative ways.

Watch this space for some detailed technical information on each of these features.

muSOAing for 7/2/11 – Hadoop, what is in store?

July 2, 2011

We can now safely say that Hadoop and it’s ecosystem of offerings is now mainstream. This is evidenced by several indicators. First of all the hockey stick like growth in the adoption of Hadoop across all verticals and horizontals. The second indicator is the plethora of new features that are being incorporated into Hadoop, HBase etc. The third indicator is the formal announcement at this week’s Hadoop Summit of Hortonworks and I can go on and on in the same vein.

Suffice it to say that Hadoop is doing to Information Management what the Web did to Information Access a decade and a half ago. It is a game change, industry defining technology. Having said that, you have to be prepared for the usual hype and me toos, the discordant noise emitted by all the folks who want to ride on this gravy train. To be able to cut through this chaff and get to the grain will be the challenge as with all such paradigm shifting technologies. Expect the inevitable shakeouts and consolidations that will follow in the next few months. A lot of the said consolidation is already happening with acquisitions of key players and more will follow and then finally the ones that are left will be the true contenders that the industry at large can safely deal with.

muSOAing for 6/17/11 – Hadoop Metrics

June 17, 2011

One big area of interest for me is the metrics for Hadoop. This is the stuff you can see in the default web page thru port 50030 and also available for you thru the JobClient API. As of this writing, this whole infrastructure is undergoing a sea change as is the case with the APIs in general. All the APIs are changing for both Hadoop and HBase.

This is a good sign. It says that the adoption of these tools is increasing and as people are using these products, they are demanding more features. It seems at this point that the primary users of these platforms are from the DW camp. Their interests lie in using alternate platforms to perform the same DW and BI tasks that they have been doing with traditional relational DB oriented infrastructures. Moving to a new platform like Hadoop also requires a different mindset which means unlearning a few of the concepts associated with traditional platforms. It will also mean coming out of the comfort zone and expecting Hadoop+HBase to behave exactly like your Oracle datawarehouse. This is a bit like switching from Windows to Mac. Things will only get better, it is just that you have to get used to it.

I think I have digressed a bit, I started off with Hadoop metrics but then wandered off and done some pontification.

muSOAing for 5/30/11 – Can I use HBase instead of…….

May 30, 2011

This is an oft posed question these days. Folks looking to cut down on their DW licensing costs, looking for alternatives. The Hadoop/HBase platform is definitely a candidate for many popular use cases. This platform is rapidly evolving and is already turning out to be a big challenge to all the traditional DW players. Talk about disruptive technology, this is as disruptive as it can get. Map/Reduce has changed the entire status quo. It has unleashed a veritable Tech Tsunami that is threatening to wash away those who are unwilling to acknowledge and adapt to this rapidly changing landscape. All my blogs in the near future will focus on BigData as there is a lot happening in this space and hence a lot to mull and ponder upon.

muSOAing for 5/13/11 – Are you Hadooping yet?

May 13, 2011

It should be quite obvious by now that Big Data is catching on big time across all Industry verticals. In my personal interactions there hasn’t been one organization that is not looking at Big Data solutions to deal with their data analytical and storage needs. The top on this list is of course Hadoop and it’s ecosystem of offerings chiefly HBase and also Hive, Pig, Zookeeper, Mahout etc. It seems that almost everyday something new is happening in this space. Folks are talking about fault tolerant Hadoop installations, next generation map reduce and a host of companies like NetApp, EMC etc. are offering Hadoop based solutions. These are very exciting times for Big Data and Advanced Analytics.

muSOAing for 4/27/11 – Relax and be RESTful

April 27, 2011

Having always dealt with WSDL based SOAP services, when the RESTful mantra started to be bandied around a few years ago, I was really curious. I was of the firm belief that there should always be a firm contract between the caller and provider of the web service, even if the caller or client is internal to the organization. Having lived in that comfort zone, I was one for dismissing REST services as trivial and not to be taken too seriously. This idea of overloading the URL to send the metadata and then sending just the payload and processing it at the backend did not appeal to me a lot.

However with increased adoption and annotation support from frameworks like Jersey, I have started taking a serious look at REST. The ease with which one can churn out a service warrants a serious second look at this paradigm. I am of the firm opinion now that unless services need to be published for external (B2B) or enterprise wide consumption through a service registry, there is really no need to adopt WSDL based SOAP services. Where a contract is not of prime importance and you can do away with the overhead, REST services will suffice.


Follow

Get every new post delivered to your Inbox.