Category Archives: Hive

Hive Introduction

Introduction

The driving force behind conceptualizing Hive was that creating MapReduce programs was not easy for many users.
Writing a custom MapReduce program for Word Count takes 63 lines of code. Having Hive perform the same task only takes 7 easy lines of code!
Hive uses familiar database concepts in Hadoop’s unstructured world.

So what is Hive?

  1. Data warehouse system built on top of Hadoop.
  2. Facilitates easy data summarization, ad-hoc queries, and the analysis of very large datasets stored in Hadoop.
  3. Provides SQL interface, known as HiveQL or HQL allowing easy querying of data in Hadoop. HQL has its own Data Definition and Data Manipulation languages similar to the DML and DDL many of us already have experience with.
  4. In Hive, the HQL queries are implicitly translated into one or more MapReduce jobs, shielding the user from much more advanced and time consuming programming.
  5. Hive provides a mechanism to project structure (like tables and partitions) onto the data in Hadoop and uses a metastore to map file structure to tabular form.

And What Hive is not?

  1. Hive is not a full database.
  2. Hive is not a real-time processing system rather best suited for batch jobs and huge datasets. Think heavy analytics and large aggregations.
  3. Latencies are often much higher than in a traditional database system. Hive is schema on read which provides for fast loads and flexibility, at the sacrifice of query time.
  4. Lacks full SQL support and does not provide row level inserts, updates or deletes.
  5. Does not support transactions and has limited subquery support.

Applications and organizations using Hive include (alphabetically):

CNET-We use Hive for data mining, internal log analysis and ad hoc queries.
Digg- We use Hive for data mining, internal log analysis, R&D, and reporting/analytics.
Grooveshark- We use Hive for user analytics, dataset cleaning, and machine learning R&D.
Papertrail- We use Hive as a customer-facing analysis destination for our hosted syslog and app log management service.
Scribd- We use hive for machine learning, data mining, ad-hoc querying, and both internal and user-facing analytics
VideoEgg- We use Hive as the core database for our data warehouse where we track and analyze all the usage data of the ads across our network.

References

https://cwiki.apache.org/confluence/display/Hive/PoweredBy
http://bigdatauniversity.com/