It is also possible to write a job in any programming language, such as Python or C, that operates on tab-separated key-value pairs. The same example done above with Hive and Pig can also be written in Python and submitted as a Hadoop job using Hadoop Streaming. Submitting a job with Hadoop Streaming requires writing a mapper and a reducer. The mapper reads input line by line and generates key-value pairs for the reducer to “reduce” into some sort of sensible data. For our case, the mapper will read in lines and output the year as the key and a ‘1’ as the value if the ngram in the line it reads has only appeared in a single volume. The python code to do this is:
(Save this file as map.py)
#!/usr/bin/env python2.7 import fileinput for line in fileinput.input(): arr = line.split("\t") try: if int(arr[3]) == 1: print("\t".join([arr[1], '1'])) except IndexError: pass except ValueError: pass
Now that the mapper has done this, the reduce merely needs to sum the values based on the key:
(Save this file as red.py)
#!/usr/bin/env python2.7 import fileinput data = dict() for line in fileinput.input(): arr = line.split("\t") if arr[0] not in data.keys(): data[arr[0]] = int(arr[1]) else: data[arr[0]] = data[arr[0]] + int(arr[1]) for key in data: print("\t".join([key, str(data[key])]))
Submitting this streaming job can be done by running the below command:
yarn jar $HADOOP_STREAMING \ -Dmapreduce.job.queuename=<your_queue> \ -input /var/ngrams/data \ -output ngrams-out \ -mapper map.py \ -reducer red.py \ -file map.py \ -file red.py \ -numReduceTasks 10 hdfs dfs -cat ngrams-out/* | tail -5hdfs dfs -rm -r -skipTrash /user/<your_uniqname>/ngrams-out