Pig is similar to Hive and can do the same thing. The Pig code to do this is a little bit longer due to its design. However, writing long Pig code is generally easier that writing multiple SQL queries that that chain together, since Pig’s language, PigLatin, allows for variables and other high-level constructs.
# Open the interactive pig console pig -Dtez.job.queuename=<your_queue> # Load the data ngrams = LOAD '/var/ngrams' USING PigStorage('\t') AS (ngram:chararray, year:int, count:long, volumes:long); # Look at the schema of the ngrams variable describe ngrams; # Count the total number of rows (should be 1430731493) ngrp = GROUP ngrams ALL; count = FOREACH ngrp GENERATE COUNT(ngrams); DUMP count; # Select the number of words, by year, that have only appeared in a single volume one_volume = FILTER ngrams BY volumes == 1; by_year = GROUP one_volume BY year; yearly_count = FOREACH by_year GENERATE group, COUNT(one_volume); DUMP yearly_count;
The last few lines of output should look like this:
More information on Pig can be found on the Apache website.