Summer 2022 Maintenance

Due to maintenance, the high-performance computing (HPC) clusters and their storage systems (/home and /scratch) will be unavailable:

Great Lakes: Monday, August 8 to Wednesday, August 10
Turbo: Monday, August 8 to Wednesday, August 10
Armis2 and Lighthouse: Tuesday, August 9 to Wednesday, August 10 

More detail is available on the ITS website:

Attention

  • No jobs will run that cannot be completed by the beginning of maintenance. 
    • Great Lakes maintenance begins on Monday, August 8. 
    • Armis2 and Lighthouse maintenance begins on Tuesday, August 9.
  • Make a copy of any files that you might need during maintenance that are in /home or /scratch, which will not be available during maintenance, to your local drive prior to maintenance using Globus File Transfer or other file transfer (see the User Guide for the cluster).
  • All running and queued jobs will be deleted at the start of maintenance.

ARC recommends that you test if you can

Testing is an important component of the maintenance process. Discovering broken code early will ensure that there is enough time to put a fix in place so that your research is not disrupted.

Module names and the available versions of installed software have changed, so we recommend that you log in to check on the availability of software you are accustomed to using. It may be important to determine whether existing code will run with the versions that will be available after maintenance.

SPECIAL NOTICE: If you plan to test your software and/or workflows prior to maintenance on the preview clusters, directories under /home and /nfs on all clusters are shared between the preview and the current production cluster.

Because directories are shared, we highly recommend you test using copies of data and programs.

Software in the following categories are especially susceptible to change in software version.

Accessing the preview clusters

Great Lakes:

Hardware  
  • 10 standard compute nodes (36 cores, 192G memory)
  • 2 standard on-campus nodes
  • 1 GPU node with 3 V100 GPUs
  • The large memory partition is there for testing scripts, but is using a standard compute node (is not a large memory node)
Limits  
  • 1 running job per user
  • Standard jobs: 6 hour max job time, 18 cpus, 90G memory
  • GPU jobs: 2 hour max job time and 1 GPU per root account
  • 100 hours of compute for the root account (please only use this for testing)

Armis2:

Hardware  
  • 10 standard compute nodes (24 cores, 120G memory)
  • 1 GPU node with 4 Titan V GPUs
  • The large memory partition is there for testing scripts, but is using a standard compute node (is not a large memory node)
Limits  
  • 1 running job per user
  • Standard jobs: 6 hour max job time, 18 cpus, 90G memory
  • GPU jobs: 2 hour max job time and 1 GPU per root account
  • 100 hours of compute for the root account (please only use this for testing)

Lighthouse:

Hardware  
  • 2 standard compute nodes (20 cores,  62G memory)
  • No GPU nodes available. We recommend testing on Great Lakes or using one of your team’s nodes (see next bullet)
  • If you would like to test your codes on one or more of your compute nodes, email arc-support@umich.edu and we can migrate those nodes to the Lighthouse preview cluster for team’s testing.
Limits  
  • 1 running job per user
  • Standard jobs: 6 hour max job time, 18 cpus, 90G memory
  • 100 hours of compute for the root account (please only use this for testing)

How to get help 

  • Send an email to arc-support@umich.edu.
  • Attend a virtual, drop-in office hour session (CoderSpaces) to get hands-on help from experts, available 9:30-11 a.m. and 2-3:30 p.m. on Tuesdays and Thursdays.  

Changes

This maintenance period will include significant updates including a change to the operating system, OFED drivers, NVIDIA drivers, and software and Slurm versions on all three clusters: Great Lakes, Armis2, and Lighthouse. See details below. 

System software changes

NEW version in BOLD OLD version
Red Hat 8.4
  • Kernel 4.18.0-305.45.1.el8_4.x86_64
  • glibc 2.28-151
  • ucx 1.12.0-1.55103 (OFED provided)
  • gcc-8.4.1-1.el8
CentOS 7.9
  • kernel 3.10.0-1160.45.1
  • glibc 2.17-325.el7_9
  • ucx 1.11.1-1.54303 (OFED provided)
  • gcc-4.8.5-44.el7
Mlnx-ofa_kernel-modules 
  • OFED 5.5.1.0.3.1.
  • kver.4.18.0_305.25.1.el8_4
Mlnx-ofa_kernel-modules
  • OFED 5.4-3.0.3.0
    • kver.3.10.0-1160.45.1
Slurm 21.08.8-2 copiles with:
    • PMIx
      • /opt/pmix/2.2.5
      • /opt/pmix/3.2.3
      • /opt/pmix/4.1.2
  • hwloc 2.2.0-1 (OS provided)
  • ucx 1.12.0-1.55103 (OFED provided)
  • slurmrestd
  • slurm-libpmi
  • slurm-contribs
Slurm 21.08.4 compiles with:
  • PMIx
    • /opt/pmix/2.2.5
    • /opt/pmix/3.2.3
    • /opt/pmix/4.1.0
  • hwloc 1.11.8-4 (OS provided)
  • ucx 1.11.1 (OFED provided)
  • slurmrestd
  • slurm-libpmi
  • slurm-contribs
PMIx LD config /opt/pmix/2.2.5/lib PMIx LD config /opt/pmix/2.2.5/lib
PMIx versions available in /opt :
  • 2.2.5
  • 3.2.3
  • 4.1.2
PMIx versions available in /opt :
  • 2.2.5
  • 3.2.3 
  • 4.1.0
Singularity CE (Sylabs.io)
  • 3.9.8
Singularity (Sylabs.io)
  • 3.7.3
  • 3.8.4
NVIDIA driver 510.73.08 NVIDIA driver 495.44
Open OnDemand 2.0.23-1 Open OnDemand 2.0.20

 

User software changes

New Software Versions  Deprecated versions (RIP)
Python
  • Version 3.6.8 is system provided; ARC will provide newer versions via modules
Python
  • 2.x

 

FAQ

Q: When is summer maintenance? 

A: Summer 2022 maintenance is happening August 8-10. The high-performance computing (HPC) clusters and their storage systems (/home and /scratch) will be unavailable:

  • Great Lakes: Monday, August 8 to Wednesday, August 10 
  • Armis2 and Lighthouse: Tuesday, August 9 to Wednesday, August 10 

 

Q: How should I prepare for the summer 2022 maintenance? 

A: There are a number of actions you can take ahead of maintenance, including: 

  • Use the preview clusters to test the code. 
  • Make a copy of any files that you might need during maintenance that are in /home or /scratchto your local drive prior to maintenance using Globus File Transfer or other file transfer (see the User Guide for the cluster).

 

Q: Can I run any of my jobs during maintenance? 

A: No. You can submit your jobs anytime but jobs that did not complete prior to maintenance will need to be resubmitted.

 

Q: Will I have access to the clusters during the maintenance?

A: No. The clusters and their storage systems will be unavailable during maintenance, including files, jobs, and the command line. 

 

Q: Will there be any changes to my jobs after maintenance?

A: Use the preview cluster to recompile and test your code. You may need to recompile your code if you didn’t get a chance to recompile or test on the preview cluster prior to maintenance.

 

Q: Where can I get help?

A: You can send an email to arc-support@umich.edu. Or attend a virtual, drop-in office hour session (CoderSpaces) to get hands-on help from experts, available 9:30-11 a.m. and 2-3:30 p.m. on Tuesdays and Thursdays. 

 

Q: How do I access the preview clusters? 

Great Lakes:

  • Command line: ssh gl8-login.arc-ts.umich.edu
  • Open OnDemand: https://greatlakes8.arc-ts.umich.edu

Armis2:

  • Command line: ssh a28-login.arc-ts.umich.edu
  • Open OnDemand: https://armis28.arc-ts.umich.edu

Lighthouse:

  • Command line: ssh lh8-login.arc-ts.umich.edu
  • Open OnDemand: https://lighthouse8.arc-ts.umich.edu

 

Q: What are the technical changes happening during maintenance?

A: See the chart above. 

 

Q: Will Python version 2.x work after maintenance?

A: No. ARC is updating Python to version 3.6.8. Be sure to test and/or update your processes, codes, and/or libraries.