The ARC-TS Data Science Platform is an upgraded Hadoop cluster currently available as a technology preview with no associated charges to U-M researchers. The ARC-TS Hadoop cluster is an on-campus resource that provides a different service level than most cloud-based Hadoop offerings, including:
- high-bandwidth data transfer to and from other campus data storage locations with no data transfer costs
- very high-speed inter-node connections using 40Gb/s Ethernet
The cluster provides 112TB of total usable disk space, 40GbE inter-node networking, Hadoop version 2.3.0, and several additional data science tools.
Aside from Hadoop and its Distributed File System, the ARC-TS data science service includes:
- Pig, a high-level language that enables substantial parallelization, allowing the analysis of very large data sets.
- Hive, data warehouse software that facilitates querying and managing large datasets residing in distributed storage using a SQL-like language called HiveQL.
- Sqoop, a tool for transferring data between SQL databases and the Hadoop Distributed File System.
- Rmr, an extension of the R Statistical Language to support distributed processing of large datasets stored in the Hadoop Distributed File System.
- Spark, a general processing engine compatible with Hadoop data
- mrjob, allows MapReduce jobs in Python to run on Hadoop
The software versions are as follows:
Title | Version |
Hadoop | 2.5.0 |
Hive | 0.13.1 |
Sqoop | 1.4.5 |
Pig | 0.12.0 |
R/rhdfs/rmr | 3.0.3 |
Spark | 1.2.0 |
mrjob | 0.4.3-dev, commit
226a741548cf125ecfb549b7c50d52cda932d045 |
If a cloud-based system is more suitable for your research, ARC-TS can support your use of Amazon cloud resources through MCloud, the UM-ITS cloud service.
For more information on the Hadoop cluster, please see this documentation or contact us at data-science-support@umich.edu.
A Flux account is required to access the Hadoop cluster. Visit the Establishing a Flux allocation page for more information.