NEW Using Hive

NEW VERSION To demonstrate Hive, below is a short tutorial. The tutorial uses the Google NGrams dataset, which is available in HDFS in /var/ngrams.

# Open the interactive hive console
hive --hiveconf mapreduce.job.queuename=<your_queue>

# Create a table with the Google NGrams data in /var/ngrams
CREATE EXTERNAL TABLE ngrams(ngram STRING, year INT, count 
BIGINT,
     volumes BIGINT)
     ROW FORMAT DELIMITED
     FIELDS TERMINATED BY '\t'
     STORED AS TEXTFILE
     LOCATION '/var/ngrams';

# Look at the schema of the table
DESCRIBE ngrams;

# Count the total number of rows (should be 1201784959)
SELECT COUNT(*) FROM ngrams;

# Select the number of words, by year, that have only appeared in a single volume
SELECT year, COUNT(ngram) FROM ngrams WHERE 
volumes = 1
GROUP BY year;

# Optional: delete your ngrams table
DROP table ngrams;

# Exit the Hive console
QUIT;

The last few lines of output should look something like this:

hive output

More information can be found on the Apache website.