HPC Updates (Great Lakes, Armis2, and Lighthouse)

ARC will be having our summer maintenance the week of June 3rd, 2024. The /home and /scratch storage systems connected will be unavailable during this time. We will be updating the operating system, the Slurm job scheduler, Open OnDemand, and the ethernet network, as well as performing some data center preventative maintenance. In addition, we will be updating the storage Globus services, although these services will remain accessible during the update. Details of the maintenance will be posted on the ARC maintenance status page.

Schedule:

Great Lakes:

  • June 3-4

Armis2 and Lighthouse:

  • June 3-5 

Globus for the ARC storage services:

  • June 4-5 (On June 4, there will be a short outage in the AM, see below)

Maintenance notes:

  • /home and /scratch storage systems on all HPC clusters will be unavailable
  • No downtime for ARC storage systems maintenance (Turbo, Locker, and Data Den).
  • Copy any data and files that may be needed during maintenance to your local drive using Globus File Transfer before maintenance begins. 

Countdown to maintenance

For HPC jobs, use the command “maxwalltime” to discover the amount of time remaining until maintenance begins. Jobs that request more walltime than remains until maintenance will automatically be queued and start once maintenance is complete.

Status updates and additional information

  • Status updates will be available on the ARC maintenance status page throughout the course of the maintenance.
  • ARC will send an email to all HPC users when the maintenance has been completed. 

How can we help you?

For assistance or questions, please contact ARC at arc-support@umich.edu

System software changes

Great Lakes, Armis2 and Lighthouse

NEW version in BOLD OLD version

Red Hat 8.8 EUS

  • Kernel 4.18.0-477.51.1.el8_8.x86_64

  • glibc-2.28-225

  • ucx-1.14.0-1.58415.x86_64 (OFED LTS provided)

  • gcc-8.5.0-18.2.el8

Red Hat 8.6 EUS

  • Kernel 4.18.0-372.75.1.el8_6.x86_64

  • glibc-2.28-189.6

  • ucx-1.15.0-1.59056 (OFED provided)

  • gcc-8.5.0-10.1.el8

Mlnx-ofa_kernel-modules 

  • OFED 5.8.4.1.5.1

    • 4.18.0_477.51.1.el8

Mlnx-ofa_kernel-modules

  • OFED 5.9.0.5.5.1

    • kver.4.18.0_372.51.1.el8_6

Slurm 23.11.6 compiles with:

  • PMIx

    • /opt/pmix/4.2.9

    • /opt/pmix/5.0.2

  • hwloc 2.2.0-3 (OS provided)

  • ucx-1.14.0-1.58415.x86_64 (OFED LTS provided)

  • slurm-libpmi

  • slurm-contribs

Slurm 23.02.5 compiles with:

  • PMIx

    • /opt/pmix/3.2.5

    • /opt/pmix/4.2.6

  • hwloc 2.2.0-3 (OS provided)

  • ucx-1.15.0-1.59056 (OFED provided)

  • slurm-libpmi

  • slurm-contribs

PMIx LD config /opt/pmix/3.2.5/lib PMIx LD config /opt/pmix/3.2.5/lib

PMIx versions available in /opt :

  • 4.2.9

  • 5.0.2

PMIx versions available in /opt :

  • 3.2.5

  • 4.2.6

Singularity CE (Sylabs.io)

  • 3.11.5

  • 4.1.2

Singularity CE (Sylabs.io)
  • 3.10.4

  • 3.11.1

NVIDIA driver 550.54.15 NVIDIA driver 545.23.06
Open OnDemand 3.1.4 Open OnDemand 3.0.3

Slurm Changes

  • OOD Globus integration. From the file browser you can now navigate to the cluster’s corresponding Globus Endpoint.

  • SLURM 23.11
    • Private /tmp and /dev/shm directories using the Slurm job job_container/tmpfs. Filesystem namespace will be created for each job with a unique, private instance of /tmp and /dev/shm for the job to use contents of these directories will be removed at job termination
    • Mutual exclusivity of –cpus-per-gpu and –cpus-per-task. If both options are specified either in the command line or in the environment, it will result in an error. If one of the options is set in the command line, it will override the corresponding setting in the environment.
    • New –external-launcher option in srun which allows various MPI implementations (like orte, hydra) to utilize their launcher within a special step. This step has access to all allocated node resources without consuming them, enabling other steps to run concurrently.
    • The environment variable SRUN_CPUS_PER_TASK has been replaced with SLURM_CPUS_PER_TASK to revert to the functionality prior to Slurm version 22.05. Previously, starting with Slurm 22.05, using –cpus-per-task automatically implied –exact, which required modifications to prevent srun from reading SLURM_CPUS_PER_TASK. With the introduction of the new external launcher feature (srun –external-launcher), srun can now recognize this environment variable within an allocation. This means that even if -c1 is specified, mpirun will execute without being restricted to a single CPU.

Storage Updates (Turbo, Locker, and Data Den)

Globus

There will be network switch updates that will affect all of the Turbo, Locker, and Data Den transfers from 5 am to 7 am Tuesday morning, June 4th for about 10-15 minutes.  Transfers will pickup again immediately after the outages are over.