muSOAing for 4/9/11 – What is all this buzz around Hive?

It need not be stressed anymore that Hadoop has taken the lead in Big Data infrastructures. Nearly everyone I speak to has a Hadoop cluster installation. While Hadoop by itself has been quite ground breaking, the tools that are evolving in it’s ever growing ecosystem are even more interesting. Of these, I want to focus today on Hive. For someone from the relational world, exposure to Hive is like a Kid being let into a candy store.

Hadoop should still be viewed as a massive data warehouse on steroids which adheres to the write once read many paradigm. The data being still stored in HDFS and a bulk of the analytics being done in memory. Hive on the other hand acts as a layer over HDFS by providing two key features. One is it’s ability to map the HDFS metadata (file system data) as tables in it’s own relational meta store and the other key feature is providing a SQL like query language to run analytics on this metadata.

Another amazing feature is it’s ability to deal with multiple terabytes of information even possibly a few petabytes. As if this is not enough, along comes HBase with is massively distributed filesystem management system overlayed on HDFS, sort of a Hive on steroids.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: