To demonstrate Hive, below is a short tutorial. The tutorial uses the Google NGrams dataset, which is available in HDFS in /var/ngrams.
# Open the interactive hive console hive --hiveconf mapreduce.job.queuename=<your_queue> # Create a table with the Google NGrams data in /var/ngrams CREATE EXTERNAL TABLE ngrams(ngram STRING, year INT, count BIGINT, volumes BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/var/ngrams'; # Look at the schema of the table DESCRIBE ngrams; # Count the total number of rows (should be 1201784959) SELECT COUNT(*) FROM ngrams; # Select the number of words, by year, that have only appeared in a single volume SELECT year, COUNT(ngram) FROM ngrams WHERE volumes = 1 GROUP BY year; # Optional: delete your ngrams table DROP table ngrams; # Exit the Hive console QUIT;
The last few lines of output should look something like this:
More information can be found on the Apache website.