Software Library
Take a look at our full list of software.
Armis is currently offered as a pilot program.
To request an Armis account: Fill Out Form
Armis is operated in much the same way as Flux, except for access to the login nodes described below.
Limit/Default | Value |
---|---|
Max Walltime | 28 days |
Max CPUs per account | 180 cores |
Max memory per account | 960540 MB |
Default memory per CPU (if not specified) | 768 MB |
The login node is the front end to the cluster. It is accessible from the Ann Arbor, Dearborn, and Flint campus IP addresses and from the U-M VPN network only and require a valid user account and Duo authentication to log in. They are a shared resource and, as such, it is expected that users do not monopolize them.
The Armis login node is accessible via the following hostnames.
Appropriate uses for the login nodes:
Any other uses of the login node may result in the termination of the process in violation. Any production processes (including post processing) should be submitted through the batch system to the cluster. If interactive use is required then you should submit an interactive job to the cluster.
Take a look at our full list of software.
Almost all software requires that you modify your environment in some way. Your environment consists of the running shell, typically bash
on Flux, and the set of environment variables that are set. The most familiar environment variable ot most people is the PATH
variable, which lists all the directories in which the shell will search for a command, but there may be many others, depending on the particular software package.
Beginning in July 2016, Flux uses a program called Lmod to resolve the changes needed to accommodate having many versions of the same software installed. We use Lmod to help manage conflicts among the environment variables across the spectrum of software packages. Lmod can be used to modify your own default environment settings, and it is also useful if you install software for your own use.
Lmod provides the module
command, an easy mechanism for changing the environment as needed to add or remove software packages from your environment.
This should be done before submitting a job to the cluster and not from within a PBS submit script.
A module is a collection of environment variable settings that can be loaded or unloaded. When you first log into Flux, the system will look to see if you have defined a default module set, and if you have, it will restore that set of modules. See below for information about module sets and how to create them. To see which modules are currently loaded, you can use the command
$ module list Currently Loaded Modules: 1) intel/16.0.3 2) openmpi/1.10.2/intel/16.0.3 3) StdEnv
We try to make the names of the modules as close to the official name of the software as we can, so you can see what is available by using, for example,
$ module av matlab ------------------------ /sw/arcts/centos7/modulefiles ------------------------- matlab/R2016a Use "module spider" to find all possible modules. Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".
where av
stands for avail (
available). To make the software found available for use, you use
$ module load matlab
(you can also use add
instead of load
, if you prefer.) If you need to use software that is incompatible with Matlab, you would remove it using
$ module unload matlab
In the output from module av matlab
, module suggests a couple of alternate ways to search for software. When you use module av
, it will match the search string anywhere in the module name; for example,
$ module av gcc ------------------------ /sw/arcts/centos7/modulefiles ------------------------- fftw/3.3.4/gcc/4.8.5 hdf5-par/1.8.16/gcc/4.8.5 fftw/3.3.4/gcc/4.9.3 (D) hdf5-par/1.8.16/gcc/4.9.3 (D) gcc/4.8.5 hdf5/1.8.16/gcc/4.8.5 gcc/4.9.3 hdf5/1.8.16/gcc/4.9.3 (D) gcc/5.4.0 (D) openmpi/1.10.2/gcc/4.8.5 gromacs/5.1.2/openmpi/1.10.2/gcc/4.9.3 openmpi/1.10.2/gcc/4.9.3 gromacs/5.1.2/openmpi/1.10.2/gcc/5.4.0 (D) openmpi/1.10.2/gcc/5.4.0 (D) Where: D: Default Module
However, if you are looking for just gcc
, that is more than you really want. So, you can use one of two commands. The first is
$ module spider gcc ---------------------------------------------------------------------------- gcc: ---------------------------------------------------------------------------- Description: GNU compiler suite Versions: gcc/4.8.5 gcc/4.9.3 gcc/5.4.0 Other possible modules matches: fftw/3.3.4/gcc gromacs/5.1.2/openmpi/1.10.2/gcc hdf5-par/1.8.16/gcc ... ---------------------------------------------------------------------------- To find other possible module matches do: module -r spider '.*gcc.*' ---------------------------------------------------------------------------- For detailed information about a specific "gcc" module (including how to load the modules) use the module's full name. For example: $ module spider gcc/5.4.0 ----------------------------------------------------------------------------
That is probably more like what you are looking for if you really are searching just for gcc
. That also gives suggestions for alternate searching, but let us return to the first set of suggestions, and see what we get with keyword searching.
At the time of writing, if you were to use module av
to look for Python, you would get this result.
[bennet@flux-build-centos7 modulefiles]$ module av python ------------------------ /sw/arcts/centos7/modulefiles ------------------------- python-dev/3.5.1
However, we have Python distributions that are installed that do not have python
as part of the module name. In this case, module spider
will also not help. Instead, you can use
$ module keyword python ---------------------------------------------------------------------------- The following modules match your search criteria: "python" ---------------------------------------------------------------------------- anaconda2: anaconda2/4.0.0 Python 2 distribution. anaconda3: anaconda3/4.0.0 Python 3 distribution. epd: epd/7.6-1 Enthought Python Distribution python-dev: python-dev/3.5.1 Python is a general purpose programming language ---------------------------------------------------------------------------- To learn more about a package enter: $ module spider Foo where "Foo" is the name of a module To find detailed information about a particular package you must enter the version if there is more than one version: $ module spider Foo/11.1 ----------------------------------------------------------------------------
That displays all the modules that have been tagged with the python
keyword or where python appears in the module name.
Note that Lmod will indicate the default version in the output from module av
, which will be loaded if you do not specify the version.
$ module av gromacs ------------------------ /sw/arcts/centos7/modulefiles ------------------------- gromacs/5.1.2/openmpi/1.10.2/gcc/4.9.3 gromacs/5.1.2/openmpi/1.10.2/gcc/5.4.0 (D) Where: D: Default Module
When loading modules with complex names, for example, gromacs/5.1.2/openmpi/1.10.2/gcc/5.4.0
, you can specify up to the second-from-last element to load the default version. That is,
$ module load gromacs/5.1.2/openmpi/1.10.2/gcc
will load gromacs/5.1.2/openmpi/1.10.2/gcc/5.4.0
To load a version other than the default, specify the version as it is displayed by the module av
command; for example,
$ module load gromacs/5.1.2/openmpi/1.10.2/gcc/4.9.3
When unloading a module, only the base name need be given; for example, if you loaded either gromacs
module,
$ module unload gromacs
Some modules rely on other modules. For example, the gromacs
module has many dependencies, some of which conflict with the default modules. To load it, you might first clear all modules with module purge
, then load the dependencies, then finally load gromacs
.
$ module list Currently Loaded Modules: 1) intel/16.0.3 2) openmpi/1.10.2/intel/16.0.3 3) StdEnv $ module purge $ module load gcc/5.4.0 openmpi/1.10.2/gcc/5.4.0 boost/1.61.0 mkl/11.3.3 $ module load gromacs/5.1.2/openmpi/1.10.2/gcc/5.4.0 $ module list Currently Loaded Modules: 1) gcc/5.4.0 4) mkl/11.3.3 2) openmpi/1.10.2/gcc/5.4.0 5) gromacs/5.1.2/openmpi/1.10.2/gcc/5.4.0 3) boost/1.61.0
That’s a lot to do each time. Lmod provides a way to store a set of modules and give it a name. So, once you have the above list of modules loaded, you can use
$ module save my_gromacs
to save the whole list under the name my_gromacs
. We recommend that you make each set fully self-contained, and that you use the full name/version for each module (to prevent problems if the default version of one of them changes), then use the combination
$ module purge $ module restore my_gromacs Restoring modules to user's my_gromacs
To see a list of the named sets you have (which are stored in ${HOME}/.lmod.d
, use
$ module savelist Named collection list: 1) my_gromacs
and to see which modules are in a set, use
$ module describe my_gromacs Collection "my_gromacs" contains: 1) gcc/5.4.0 4) mkl/11.3.3 2) openmpi/1.10.2/gcc/5.4.0 5) gromacs/5.1.2/openmpi/1.10.2/gcc/5.4.0 3) boost/1.61.0
We try to provide some helpful information about the modules. For example,
$ module help openmpi/1.10.2/gcc/5.4.0 ------------- Module Specific Help for "openmpi/1.10.2/gcc/5.4.0" -------------- OpenMPI consists of a set of compiler 'wrappers' that include the appropriate settings for compiling MPI programs on the cluster. The most commonly used of these are mpicc mpic++ mpif90 Those are used in the same way as the regular compiler program, for example, $ mpicc -o hello hello.c will produce an executable program file, hello, from C source code in hello.c. In addition to adding the OpenMPI executables to your path, the following environment variables set by the openmpi module. $MPI_HOME
For some generic information about the program you can use
$ module whatis openmpi/1.10.2/gcc/5.4.0 openmpi/1.10.2/gcc/5.4.0 : Name: openmpi openmpi/1.10.2/gcc/5.4.0 : Description: OpenMPI implementation of the MPI protocol openmpi/1.10.2/gcc/5.4.0 : License information: https://www.open-mpi.org/community/license.php openmpi/1.10.2/gcc/5.4.0 : Category: Utility, Development, Core openmpi/1.10.2/gcc/5.4.0 : Package documentation: https://www.open-mpi.org/doc/ openmpi/1.10.2/gcc/5.4.0 : ARC examples: /scratch/data/examples/openmpi/ openmpi/1.10.2/gcc/5.4.0 : Version: 1.10.2
and for information about what the module will set in the environment (in addition to the help text), you can use
$ module show openmpi/1.10.2/gcc/5.4.0 [ . . . . Help text edited for space -- see above . . . . ] whatis("Name: openmpi") whatis("Description: OpenMPI implementation of the MPI protocol") whatis("License information: https://www.open-mpi.org/community/license.php") whatis("Category: Utility, Development, Core") whatis("Package documentation: https://www.open-mpi.org/doc/") whatis("ARC examples: /scratch/data/examples/openmpi/") whatis("Version: 1.10.2") prereq("gcc/5.4.0") prepend_path("PATH","/sw/arcts/centos7/openmpi/1.10.2-gcc-5.4.0/bin") prepend_path("MANPATH","/sw/arcts/centos7/openmpi/1.10.2-gcc-5.4.0/share/man") prepend_path("LD_LIBRARY_PATH","/sw/arcts/centos7/openmpi/1.10.2-gcc-5.4.0/lib") setenv("MPI_HOME","/sw/arcts/centos7/openmpi/1.10.2-gcc-5.4.0")
where the lines to attend to are the prepend_path()
, setenv()
, and prereq()
. There is also an append_path()
function that you may see. The prereq()
function sets the list of other modules that must be loaded before the one being displayed. The rest set or modify the environment variable listed as the first argument; for example,
prepend_path("PATH", "/sw/arcts/centos7/openmpi/1.10.2-gcc-5.4.0/bin")
adds /sw/arcts/centos7/openmpi/1.10.2-gcc-5.4.0/bin
to the beginning of the PATH
environment variable.
Libraries are collections of functions that are already compiled and that can be included in your program without your having to write the functions yourself or compile them separately.
Saving yourself time by not having to write the functions is one obvious reason to use a library. Additionally, many of the libraries focus on high performance and accuracy. Many of the libraries are very well-tested and proven. Others can add parallelism to computationally intensive functions without you having to write your own parallel code. In general, libraries can provide significant performance or accuracy dividends with a relatively low investment of time. They can also be cited in publications to assure readers that the fundamental numerical components of your work are fully tested and stable.
To use libraries you must link them with your own code. When you write your own code, the compiler turns that into object code, which is understandable by the machine. Even though most modern compilers hide it from you, there is a second step where the object code it created for you must be glued together with all the standard functions you include, and any external libraries, and that is called linking.When linking libraries that are not included with your compiler, you must tell the compiler/linker where to find the file that contains the library – typically .so and/or .a files. For libraries that require prototypes (C/C++, etc.) you must also tell the preprocessor/compiler where to find the header (.h) files. Fortran modules are also needed, if you are compiling Fortran code.
When we install libraries on Flux, we usually create modules for them that will set the appropriate environment variables to make it easier for you to provide the right information to the compiler and the linker.The naming scheme is, typically, a prefix indicating the library, for example, FFTW, followed by a suffix to indicate the variable–s function, for example, _INCLUDE for the directory containing the header files. So, for example, the module for FFTW3 includes the variables FFTW_INCLUDE and FFTW_LIB for the include and library directories, respectively. We also, typically, set a variable to the top level of the library path, for example, FFTW_ROOT. Some configuration schemes want that and infer the rest of the directory structure relative to it.Libraries can often be tied to specific versions of a compiler, so you will want to run
$ module av
to see which compilers and versions are supported.One other variable that is often set by the library module is the LD_LIBRARY_PATH variable, which is used when you run the program to tell it where to find the libraries needed at run time. If you compile and link against an external library, you will almost always need to load the library module when you want to run the program so that this variable gets set.To see the variable names that a module provides you can use the show option to the module command to show what is being set by the module. Here is an edited example of what that would print if you were to run it for FFTW3.
[markmont@flux-login2 ~]$ module show fftw/3.3.4/gcc/4.8.5 ------------------------------------------------------------------------------- /sw/arcts/centos7/modulefiles/fftw/3.3.4/gcc/4.8.5.lua: ------------------------------------------------------------------------------- help([[ FFTW consists of libraries for computation of the discrete Fourier transform in one or more dimensions. In addition to adding entries to the PATH, MANPATH, and LD_LIBRARY_PATH, the following environment variables are created. FFTW_ROOT The root of the FFTW installation folder FFTW_INCLUDE The FFTW3 include file folder FFTW_LIB The FFTW3 library folder, which includes single (float), double, and long-double versions of the library, as well as OpenMP and MPI versions. To use the MPI libary, you must load the corresponding OpenMPI module. An example of usage of those variables on a compilation command is, for gcc and icc, $ gcc -o fftw3_prb fftw3_prb-c -I${FFTW_INCLUDE} -L${FFTW_LIB} -lfftw3 -lm $ icc -o fftw3_prb fftw3_prb-c -I${FFTW_INCLUDE} -L${FFTW_LIB} -lfftw3 -lm ]]) whatis("Name: fftw") whatis("Description: Libraries for computation of discrete Fourier transform.") whatis("License information: http://www.fftw.org/fftw3_doc/License-and-Copyright.html") whatis("Category: Library, Development, Core") whatis("Package documentation: http://www.fftw.org/fftw3_doc/") whatis("Version: 3.3.4") prepend_path("PATH","/sw/arcts/centos7/fftw/3.3.4-gcc-4.8.5/bin") prepend_path("MANPATH","/sw/arcts/centos7/fftw/3.3.4-gcc-4.8.5/share/man") prepend_path("LD_LIBRARY_PATH","/sw/arcts/centos7/fftw/3.3.4-gcc-4.8.5/lib") prepend_path("FFTW_ROOT","/sw/arcts/centos7/fftw/3.3.4-gcc-4.8.5") prepend_path("FFTW_INCLUDE","/sw/arcts/centos7/fftw/3.3.4-gcc-4.8.5/include") prepend_path("FFTW_LIB","/sw/arcts/centos7/fftw/3.3.4-gcc-4.8.5/lib") setenv("FFTW_HOME","/sw/arcts/centos7/fftw/3.3.4-gcc-4.8.5") [markmont@flux-login2 ~]$
In addition to the environment variables being set, the show option also displays the names of other modules with which FFTW3 conflicts (in this case, just itself), and there may be links to documentation and the vendor web site (not shown above).
Here is an example of compiling and linking a C program with the FFTW3 libraries.
gcc -I$FFTW_INCLUDE -L$FFTW_LIB mysource.c -lfftw3 -o myprogram
Here is a breakdown of the components of that command.
Sometimes you will need or want to compile some files without creating the final executable program, for example, if you have many smaller source files that all combine to make a complete executable. Here is an example.
gcc -c -I$FFTW_INCLUDE source1.c gcc -c -I$FFTW_INCLUDE source2.c gcc -L$FFTW_LIB source1.o source2.o -o myprogram -lfftw3
The -c compiler option tells the compiler to compile an object file only. Note that only the -I option is needed if you are not linking. The header files are needed to create the object code, which contain references to the functions in the library.The last line does not actually compile anything, rather, it links the components. The -L and -l options are the same as on the one-step compilation and linkage command and specifies where the binary library files are located. The -o option specifies the name of the final executable, in this case source.The location of the header files are only needed before linking. Thus the -I flags can be left off for the final step. The same is true for the -L and -l flags, which are only needed for the final link step, and so can be left off the compilation. Note that all the object files to be linked need to be named.
You will typically see this method used in large, complex projects, with many functions spread across many files with lots of interdepenencies. This method minimizes the amount of time it takes to recompile and relink a program if only a small part of it is changed. This is best managed with make and make files.
Researchers in the College of Literature, Science, and the Arts have four options for using the Great Lakes High-Performance Computing cluster:
Cost | Wait for jobs to start | Special usage limits | Notes | |
Public LSA accounts | Free (paid for by LSA) |
•••• | Yes (cores per user, jobs per user, job size, maximum walltime) | Only for researchers who do not have access to another account on Great Lakes.
Resources are shared by researchers College-wide and so there may frequently be waits for jobs to start. |
Department or multiple-group account | $ (paid for by department or cooperating research groups) |
•• | Optional (up to purchaser) | Resources are shared only between researchers within the department or groups; whether a job waits to start depends on how the account has been sized relative to the needs of the researchers. |
Private account | $$$ (paid for by researcher) |
• | Optional (up to purchaser) | Resources are shared only to whoever is specified as an authorized user by the account owner. This could be a single individual, several collaborators, or a larger research group. |
Lighthouse | $$$ (paid for by researcher) |
• | Optional (up to purchaser) | Typically used with external grants that require the purchase of computing hardware rather than services. Researchers purchase specific hardware for their exclusive use for 4 years. Custom hardware configurations (e.g., amount of memory per node) are possible when approved by ARC. |
The College of Literature, Science, and the Arts provides three public Great Lakes accounts to LSA researchers at no cost. A researcher can use one of the public accounts if they do not have access to another account on Great Lakes.
Account name | Great Lakes Partition | Size | Usage limits |
lsa1 |
standard | 120 cores and 600 GB memory | Size (maximums, per person): Up to 24 cores and 120 GB RAM. Runtime: maximum 5 core*months remaining across all running jobs (per person) |
lsa2 |
gpu | 2 GPUs | Only for jobs that require use of a GPU. Size: 1 GPU, 2 cores, 10 GB RAM per person Walltime: maximum 1 day per job |
lsa3 |
largemem | 36 cores and 180 GB memory | Only for jobs that require more memory or cores per node than possible under lsa1 .Size: 36 cores / 180 GB per person Walltime: maximum 1 week per job |
Uses of these accounts include but are not limited to:
The LSA public accounts are neither intended to replace nor supplement other Great Lakes accounts. Research groups who need more computation than is provided under the public account usage limits, or who need their jobs to start running faster than under the public accounts, should obtain their own Great Lakes paid accounts. Shared accounts can also be obtained for use of multiple research groups across departments, centers, institutes, or other units. Graduate students in particular may want to use Rackham Graduate Student Research Grants to provision their own, private Great Lakes paid accounts.
The LSA public accounts (lsa1
, lsa2, lsa3
) are not meant for use by anyone who has their own Great Lakes account, nor by those who have access to another shared account such as a departmental account or an account for their center or institute.
LSA has imposed additional usage limits on its public accounts in order to avoid a single user (or a small group of users) monopolizing the accounts for extended periods of time to the detriment of other researchers who want to use the accounts.
LSA HPC support staff will periodically monitor jobs which are running under the LSA public accounts. Users who have running jobs in contradiction with the usage limits or policies will receive an email asking them to remove any inappropriate jobs. Users who receive four or more such emails within 120 days may be temporarily or permanently removed from the accounts, at the discretion of LSA HPC support staff.
To check the list of accounts you have access to run the following:
my_accounts
If you are listed on no other accounts than lsa1
, lsa2
, or lsa3
, then you are eligible to use them within the limits specified.
Users of lsa1
can use up to 24 cores or up to 120 GB of memory across all of their running jobs at any point in time.
Additionally, individual users are restricted to having no more than 4 core*months (2,880 core*hours) worth of jobs running at any one time. This limit is calculated by summing the product of the remaining walltime and number of cores for all of a given users’ running jobs, as shown by the command “squeue -u $USER -t r"
). Four (4) core*months are sufficient to run a 4 core job for 30 days, a 8 core job for 15 days, a 16 core job for 7 days, and many other combinations.
Each user of lsa2
can run one job at a time, using a single GPU, up to 2 cores, and up to 10 GB RAM for a maximum of 1 day (24 hours). Use of lsa2
is restricted to jobs that require use of a GPU.
The lsa3
account is for running jobs on the largemem partition and there is a limit per user of 36 cores and 180GB of memory.
The requested walltime for each job under lsa3 must be no more than 1 week (168 hours). This permits a single researcher to use the full lsa3 account (all 36 cores / 180 GB RAM) in a single job, but it can also result in very long waits for jobs to start. Researchers who need jobs to start more quickly should either open their own paid account, use lsa1 (if they need 120 GB RAM or less), or use ACCESS.
Use of lsa3
is restricted to jobs that require more memory or more cores per node than is possible under lsa1
.
Run the command my_accounts
on a Great Lakes login node. If you see the LSA public account names listed, and no other accounts listed, then you are able to run jobs under them.
If you are a member of LSA but the public account names do not show up in the my_accounts
output for your Great Lakes user login, please contact arc-support@umich.edu and ask to be added to the accounts.
To use lsa1
, include the following in your Slurm script:
#SBATCH --account=lsa1
#SBATCH --partition=standard
To use lsa2
, include the following in your Slurm script:
#SBATCH --account=lsa2
#SBATCH --partition=gpu
To use lsa3
, include the following in your Slurm script:
#SBATCH --account=lsa3
#SBATCH --partition=largemem
For more information about Slurm scripts, see the Slurm User Guide.
Run the following commands on a Great Lakes login node to see what jobs are running or pending under lsa1
:
squeue -A lsa1 -t running
squeue -A lsa1 -t pending
Replace “lsa1
” above with “lsa2
” or “lsa3
” as desired.
Because the LSA accounts are public resources, they can often be in high demand, resulting in jobs taking hours or even days to start, even if each individual user is under the usage limits. Options for getting jobs to start more quickly include:
The usage limits for the LSA public accounts are automatically enforced wherever possible. In rare cases when this does not occur, exceptions which are identified will be handled by LSA HPC staff.
A variety of options are available to manage your usage:
Yes. You can send an email to arc-support@umich.edu and ask to be removed from the other accounts in order to become eligible to use one or more of the LSA public accounts.
Please send any questions or requests to arc-support@umich.edu.
The ARC-TS Data Science Platform is an upgraded Hadoop cluster currently available as a technology preview with no associated charges to U-M researchers. The ARC-TS Hadoop cluster is an on-campus resource that provides a different service level than most cloud-based Hadoop offerings, including:
The cluster provides 112TB of total usable disk space, 40GbE inter-node networking, Hadoop version 2.3.0, and several additional data science tools.
Aside from Hadoop and its Distributed File System, the ARC-TS data science service includes:
The software versions are as follows:
Title | Version |
Hadoop | 2.5.0 |
Hive | 0.13.1 |
Sqoop | 1.4.5 |
Pig | 0.12.0 |
R/rhdfs/rmr | 3.0.3 |
Spark | 1.2.0 |
mrjob | 0.4.3-dev, commit
226a741548cf125ecfb549b7c50d52cda932d045 |
If a cloud-based system is more suitable for your research, ARC-TS can support your use of Amazon cloud resources through MCloud, the UM-ITS cloud service.
For more information on the Hadoop cluster, please see this documentation or contact us at data-science-support@umich.edu.
A Flux account is required to access the Hadoop cluster. Visit the Establishing a Flux allocation page for more information.
Normally, compute nodes on ARC-TS clusters cannot directly access the Internet because they have private IP addresses. This increases cluster security while reducing the costs (IPv4 addresses are limited, and ARC-TS clusters do not currently support IPv6). However, this also means that jobs cannot install software, download files, or access databases on servers located outside of University of Michigan networks: the private IP addresses used by the cluster are routable on-campus but not off-campus.
If your work requires these tasks, there are three ways to allow jobs running on ARC-TS clusters to access the Internet, described below. The best method to use depends to a large extent on the software you are using. If your software supports HTTP proxying, that is the best method. If not, SOCKS proxying or SSH tunneling may be suitable.
HTTP proxying, sometimes called “HTTP forward proxying” is the simplest and most robust way to access the Internet from ARC-TS clusters. However, there are two main limitations:
If either of these conditions apply (for example, if your software needs a database protocol such as MySQL), users should explore SOCKS proxying or SSH tunneling, described below.
Some popular software packages that support HTTP proxying include:
install.packages()
download.file()
RCurl
cpan
git
curl
wget
nc
links / elinks
web browsersHTTP proxying is automatically set up when you log in to ARC-TS clusters and it should be used by any software which supports HTTP proxying without any special action on your part.
Here is an example that shows installing the Python package opencv-python
from within an interactive job running on a Great Lakes compute node:
[user@gl-login ~]$ module load python3.7-anaconda/2019.07
[user@gl-login ~]$ srun --pty --account=test /bin/bash
[user@gl3288 ~]$ pip install --user opencv-python
Collecting opencv-python
Downloading https://files.pythonhosted.org/packages/34/a3/403dbaef909fee9f9f6a8eaff51d44085a14e5bb1a1ff7257117d744986a/opencv_python-4.2.0.32-cp37-cp37m-manylinux1_x86_64.whl (28.2MB)
|████████████████████████████████| 28.2MB 3.2MB/s
Requirement already satisfied: numpy>=1.14.5 in /sw/arcts/centos7/python3.7-anaconda/2019.07/lib/python3.7/site-packages (from opencv-python) (1.16.4)
Installing collected packages: opencv-python
Successfully installed opencv-python-4.2.0.32
If HTTP proxying were not supported by pip
(or was otherwise not working), you’d be unable to access the Internet to install the opencv-python
package and receive “Connection timed out”, “No route to host”, or “Connection failed” error messages when you tried to install it.
HTTP proxying is controlled by the following environment variables which are automatically set on each compute node:
export http_proxy="http://proxy1.arc-ts.umich.edu:3128/"
export https_proxy="http://proxy1.arc-ts.umich.edu:3128/"
export ftp_proxy="http://proxy1.arc-ts.umich.edu:3128/"
export no_proxy="localhost,127.0.0.1,.localdomain,.umich.edu"
export HTTP_PROXY="${http_proxy}"
export HTTPS_PROXY="${https_proxy}"
export FTP_PROXY="${ftp_proxy}"
export NO_PROXY="${no_proxy}"
Once these are set in your environment, you can access the Internet from compute nodes — for example, you can install Python and R libraries from compute nodes. There’s no need to start any daemons as is needed with the first two solutions above. The HTTP proxy server proxy.arc-ts.umich.edu
does support HTTPS but does not terminate the TLS session at the proxy; traffic is encrypted by the software the user runs and the traffic is not decrypted until it reaches the destination server on the Internet.
To prevent software from using HTTP proxying, run the following command:
unset http_proxy https_proxy ftp_proxy no_proxy HTTP_PROXY HTTPS_PROXY FTP_PROXY NO_PROXY
The above command will only affect software started from the current shell. If you start a new shell (for example, if you open a new window or log in again) you’ll need to re-run the command above each time. To permanently disable HTTP proxying for all software, add the command above to the end of your ~/.bashrc
file.
Finally, note that HTTP proxying (which is forward proxying) should not be confused with reverse proxying. Reverse proxying, which is done by the ARC Connect service, allows researchers to start web applications (including Jupyter notebooks, RStudio sessions, and Bokeh apps) on compute nodes and then access those web applications through the ARC Connect.
A second solution is available for any software that either supports the SOCKS protocol or that can be “made to work” with SOCKS. Most software does not support SOCKS, but here is an example using curl
(which does have built-in support for SOCKS) to download a file from the Internet from inside an interactive job running on a Great Lakes compute node. We use “ssh -D
” to set up a “quick and dirty” SOCKS proxy server for curl
to use:
[user@gl-login ~]$ module load python3.7-anaconda/2019.07
[user@gl-login ~]$ srun --pty --account=test /bin/bash
[user@gl3288 ~]$ ssh -f -N -D 1080 greatlakes-xfer.arc-ts.umich.edu
[user@gl3288 ~]$ curl --socks localhost -O ftp://ftp.gnu.org/pub/gnu/bc/bc-1.06.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 272k 100 272k 0 0 375k 0 --:--:-- --:--:-- --:--:-- 375k
[user@gl3288 ~]$ ls -l bc-1.06.tar.gz
-rw-rw-r-- 1 user user 278926 Feb 3 16:09 bc-1.06.tar.gz
A limitation of “ssh -D
” is that it only handles TCP traffic, not UDP traffic (including DNS lookups, which happen over UDP). However, if you have a real SOCKS proxy accessible to you elsewhere on the U-M network (such as on a server in your lab), you can specify its hostname instead of “localhost
” above and omit the ssh command in order to have UDP traffic handled.
A final option for accessing the Internet from an ARC-TS compute node is to set up a local SSH tunnel using the “ssh -L
” command. This provides a local port on the compute node that processes can connect to to access a single specific remote port on a single specific host on a non-UM network.
Here is an example that shows how to use a local tunnel to copy a file using scp
from a remote system (residing on a non-UM network) named “far-away.example.com
” onto an ARC-TS cluster from inside a job running on a compute node.
You should run the following commands inside an interactive Slurm job the first time so that you can respond to prompts to accept various keys, as well as enter your password for far-away.example.com when prompted.
# Start the tunnel so that port 2222 on the compute node connects to port 22 on far-away.example.com:
ssh -N -L 2222:far-away.example.com:22 greatlakes-xfer.arc-ts.umich.edu &
# Give the tunnel time to completely start up:
sleep 5
# Copy the file “my-data-set.csv” from far-away.example.com to the compute node:
# Replace “your-user-name” with the username by which far-away.example.com knows you.
# If you don’t have public key authentication set up from the cluster for far-away.example.com, you’ll
# be prompted for your far-away.example.com password
scp -P 2222 your-user-name@localhost:my-data-set.csv .
# When you are all done using it, tear down the tunnel:
kill %1
Once you have run these commands once, interactively, from a compute node, they can then be used in non-interactive Slurm batch jobs, if you’ve also set up public key authentication for far-away.example.com.