By | | No Comments

Fuse HDFS allows you use standard posix system commands with HDFS. This may be useful, for example, if you have a program that needs to use data that is stored in HDFS. 

To use Fuse HDFS, change directories to /hadoop-fuse/user/<your_uniqname>

Once in this directory, you can use commands on your HDFS files just as you would on any other files. For example, the ls command will list the contents of your HDFS home directory.

You could also run a Python or R program that uses a file in HDFS.

You can save the below file and run it as you would regularly run a python program to access an example data file we have available to all users in HDFS.

f = open("/hadoop-fuse/var/examples/romeojuliet.txt", "r")
data = f.read()
d = {}
for word in data.split(' '):
        if word in d:
                d[word] += 1
                d[word] = 1
for word, count in d.items():
        print word + str(count)

Using Hadoop and HDFS

By | | No Comments

Hadoop consists of two components; HDFS, a filesystem built for high read speeds, and YARN, a resource manager. HDFS is not a POSIX filesystem, so normal command line tools like “cp” and “mv” will not work. Most of the common tools have been reimplemented for HDFS and can be run using the “hdfs dfs” command. All data must be in HDFS for jobs to be able to read it.

Here are a few basic commands:

# List the contents of your HDFS home directory
hdfs dfs -ls

# Copy local file data.csv to your HDFS home directory
hdfs dfs -put data.csv data.csv

# Copy HDFS file data.csv back to your local home directory
hdfs dfs -get data.csv data2.csv

A complete reference of HDFS commands can be found on the Apache website.

Understanding MapReduce And Tez

By | | No Comments

Writing Hadoop MapReduce code in Java is the lowest level way to program against a Hadoop cluster. Hadoop’s libraries do not contain any abstractions, like Spark RDDs or a Hive or Pig-like higher level language. All code must implement the MapReduce paradigm.

This video provides a great introduction to MapReduce. This documentation provides a written explanation and an example.

However, some of our tools, mainly Hive and Pig, run on Tez rather than MapReduce. Some key points about the differences between Tez and Spark (with much credit to this post):

How Tez works (very similar to Spark):
Tez is a DAG (Directed acyclic graph) architecture.
1. Execute the plan but no need to read data from disk.
2. Once ready to do some calculations (similar to actions in spark), get the data from disk and perform all steps and produce output.
Notice the efficiency introduced by not going to disk multiple times. Intermediate results are stored in memory (not written to disks). On top of that there is vectorization (process batch of rows instead of one row at a time). All this adds to efficiencies in query time.
Only one read and one write.


How Mapreduce works:
1. Read data from file –>one disk access
2. Run mappers
3. Write map output –> second disk access
4. Run shuffle and sort –> read map output, third disk access
5. write shuffle and sort –> write sorted data for reducers –> fourth disk access
6. Run reducers which reads sorted data –> fifth disk output
7. Write reducers output –>sixth disk access