Accessing the Internet from ARC-TS compute nodes

By |

Normally, compute nodes on ARC-TS clusters cannot directly access the Internet because they have private IP addresses. This increases cluster security while reducing the costs (IPv4 addresses are limited, and ARC-TS clusters do not currently support IPv6). However, this also means that jobs cannot install software, download files, or access databases on servers located outside of University of Michigan networks: the private IP addresses used by the cluster are routable on-campus but not off-campus.

If your work requires these tasks, there are three ways to allow jobs running on ARC-TS clusters to access the Internet, described below. The best method to use depends to a large extent on the software you are using. If your software supports HTTP proxying, that is the best method. If not, SOCKS proxying or SSH tunneling may be suitable.

HTTP proxying

HTTP proxying, sometimes called “HTTP forward proxying”  is the simplest and most robust way to access the Internet from ARC-TS clusters. However, there are two main limitations:

  • Some software packages do not support HTTP proxying.
  • HTTP proxying only supports HTTP, HTTPS and FTP protocols.

If either of these conditions apply (for example, if your software needs a database protocol such as MySQL), users should explore SOCKS proxying or SSH tunneling, described below.

Some popular software packages that support HTTP proxying include:

HTTP proxying is automatically set up when you log in to ARC-TS clusters and it should be used by any software which supports HTTP proxying without any special action on your part.

Here is an example that shows installing the Python package opencv-python from within an interactive job running on a Great Lakes compute node:

 


[user@gl-login ~]$ module load python3.7-anaconda/2019.07
[user@gl-login ~]$ srun --pty --account=test /bin/bash
[user@gl3288 ~]$ pip install --user opencv-python
Collecting opencv-python
Downloading https://files.pythonhosted.org/packages/34/a3/403dbaef909fee9f9f6a8eaff51d44085a14e5bb1a1ff7257117d744986a/opencv_python-4.2.0.32-cp37-cp37m-manylinux1_x86_64.whl (28.2MB)
|████████████████████████████████| 28.2MB 3.2MB/s
Requirement already satisfied: numpy>=1.14.5 in /sw/arcts/centos7/python3.7-anaconda/2019.07/lib/python3.7/site-packages (from opencv-python) (1.16.4)
Installing collected packages: opencv-python
Successfully installed opencv-python-4.2.0.32

If HTTP proxying were not supported by pip (or was otherwise not working), you’d be unable to access the Internet to install the opencv-python package and receive “Connection timed out”, “No route to host”, or “Connection failed” error messages when you tried to install it.

Information for advanced users

HTTP proxying is controlled by the following environment variables which are automatically set on each compute node:

export http_proxy="http://proxy1.arc-ts.umich.edu:3128/"
export https_proxy="http://proxy1.arc-ts.umich.edu:3128/"
export ftp_proxy="http://proxy1.arc-ts.umich.edu:3128/"
export no_proxy="localhost,127.0.0.1,.localdomain,.umich.edu"
export HTTP_PROXY="${http_proxy}"
export HTTPS_PROXY="${https_proxy}"
export FTP_PROXY="${ftp_proxy}"
export NO_PROXY="${no_proxy}"

Once these are set in your environment, you can access the Internet from compute nodes — for example, you can install Python and R libraries from compute nodes. There’s no need to start any daemons as is needed with the first two solutions above. The HTTP proxy server proxy.arc-ts.umich.edu does support HTTPS but does not terminate the TLS session at the proxy; traffic is encrypted by the software the user runs and the traffic is not decrypted until it reaches the destination server on the Internet.

To prevent software from using HTTP proxying, run the following command:

unset http_proxy https_proxy ftp_proxy no_proxy HTTP_PROXY HTTPS_PROXY FTP_PROXY NO_PROXY

The above command will only affect software started from the current shell.  If you start a new shell (for example, if you open a new window or log in again) you’ll need to re-run the command above each time.  To permanently disable HTTP proxying for all software, add the command above to the end of your ~/.bashrc file.

Finally, note that HTTP proxying (which is forward proxying) should not be confused with reverse proxying.  Reverse proxying, which is done by the ARC Connect service, allows researchers to start web applications (including Jupyter notebooks, RStudio sessions, and Bokeh apps) on compute nodes and then access those web applications through the ARC Connect.

SOCKS

A second solution is available for any software that either supports the SOCKS protocol or that can be “made to work” with SOCKS. Most software does not support SOCKS, but here is an example using curl (which does have built-in support for SOCKS) to download a file from the Internet from inside an interactive job running on a Great Lakes compute node. We use “ssh -D” to set up a “quick and dirty” SOCKS proxy server for curl to use:

[user@gl-login ~]$ module load python3.7-anaconda/2019.07
[user@gl-login ~]$ srun --pty --account=test /bin/bash
[user@gl3288 ~]$ ssh -f -N -D 1080 greatlakes-xfer.arc-ts.umich.edu
[user@gl3288 ~]$ curl --socks localhost -O ftp://ftp.gnu.org/pub/gnu/bc/bc-1.06.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 272k 100 272k 0 0 375k 0 --:--:-- --:--:-- --:--:-- 375k
[user@gl3288 ~]$ ls -l bc-1.06.tar.gz
-rw-rw-r-- 1 user user 278926 Feb 3 16:09 bc-1.06.tar.gz

A limitation of “ssh -D” is that it only handles TCP traffic, not UDP traffic (including DNS lookups, which happen over UDP). However, if you have a real SOCKS proxy accessible to you elsewhere on the U-M network (such as on a server in your lab), you can specify its hostname instead of “localhost” above and omit the ssh command in order to have UDP traffic handled.

Local SSH tunneling (“ssh -L”)

A final option for accessing the Internet from an ARC-TS  compute node is to set up a local SSH tunnel using the “ssh -L” command. This provides a local port on the compute node that processes can connect to to access a single specific remote port on a single specific host on a non-UM network.

scp example

Here is an example that shows how to use a local tunnel to copy a file using scp from a remote system (residing on a non-UM network) named “far-away.example.com” onto an ARC-TS cluster from inside a job running on a compute node.

You should run the following commands inside an interactive Slurm job the first time so that you can respond to prompts to accept various keys, as well as enter your password for far-away.example.com when prompted.

# Start the tunnel so that port 2222 on the compute node connects to port 22 on far-away.example.com:
ssh -N -L 2222:far-away.example.com:22 greatlakes-xfer.arc-ts.umich.edu &
# Give the tunnel time to completely start up:
sleep 5
# Copy the file “my-data-set.csv” from far-away.example.com to the compute node:
# Replace “your-user-name” with the username by which far-away.example.com knows you.
# If you don’t have public key authentication set up from the cluster for far-away.example.com, you’ll
# be prompted for your far-away.example.com password
scp -P 2222 your-user-name@localhost:my-data-set.csv .
# When you are all done using it, tear down the tunnel:
kill %1

Once you have run these commands once, interactively, from a compute node, they can then be used in non-interactive Slurm batch jobs, if you’ve also set up public key authentication for far-away.example.com.

Data Science Platform (Hadoop)

By |

The ARC-TS Data Science Platform is an upgraded Hadoop cluster currently available as a technology preview with no associated charges to U-M researchers. The ARC-TS Hadoop cluster is an on-campus resource that provides a different service level than most cloud-based Hadoop offerings, including:

  • high-bandwidth data transfer to and from other campus data storage locations with no data transfer costs
  • very high-speed inter-node connections using 40Gb/s Ethernet

The cluster provides 112TB of total usable disk space, 40GbE inter-node networking, Hadoop version 2.3.0, and several additional data science tools.

Aside from Hadoop and its Distributed File System, the ARC-TS data science service includes:

  • Pig, a high-level language that enables substantial parallelization, allowing the analysis of very large data sets.
  • Hive, data warehouse software that facilitates querying and managing large datasets residing in distributed storage using a SQL-like language called HiveQL.
  • Sqoop, a tool for transferring data between SQL databases and the Hadoop Distributed File System.
  • Rmr, an extension of the R Statistical Language to support distributed processing of large datasets stored in the Hadoop Distributed File System.
  • Spark, a general processing engine compatible with Hadoop data
  • mrjob, allows MapReduce jobs in Python to run on Hadoop

The software versions are as follows:

Title Version
Hadoop 2.5.0
Hive 0.13.1
Sqoop 1.4.5
Pig 0.12.0
R/rhdfs/rmr 3.0.3
Spark 1.2.0
mrjob 0.4.3-dev, commit

226a741548cf125ecfb549b7c50d52cda932d045

If a cloud-based system is more suitable for your research, ARC-TS can support your use of Amazon cloud resources through MCloud, the UM-ITS cloud service.

For more information on the Hadoop cluster, please see this documentation or contact us at data-science-support@umich.edu.

A Flux account is required to access the Hadoop cluster. Visit the Establishing a Flux allocation page for more information.