HPC Updates (Great Lakes, Armis2, and Lighthouse)

We are currently planning the exact times of the maintenance, so we may be updating the duration as our planning continues. Due to maintenance, the high-performance computing (HPC) clusters and their storage systems (/home and /scratch) will be unavailable:

  • Great Lakes, Armis2Lighthouse : Monday, June 5, 2023, 8am – Friday, June 9, 2023, 5pm

Copy any files you might need during the maintenance window to your local drive using Globus File Transfer.

Use the command “maxwalltime” to discover the amount of time remaining until maintenance begins at the command-line of any cluster login node. Jobs that request more walltime than remains until maintenance will automatically be queued and start once maintenance is complete.

Contact arc-support@umich.edu if you have any questions.

System hardware changes

Great Lakes

  • Recabling the networking for the single-precision GPUs (SGPGU)
  • Potentially compete update to the ethernet networking
  • Networking switch removals

Armis2 and Lighthouse

  • Maintenance on the Modular Data Center (MDC) (power will be out)
  • Hardware reorganization at the MDC
  • Upgrades to the ethernet and InfiniBand networking systems

System software changes

Great Lakes, Armis2 and Lighthouse

NEW version in BOLD OLD version
Red Hat 8.6
    • Kernel 4.18.0-372.51.1.el8_6.x86_64
  • glibc-2.28-189.5
  • ucx-1.15.0-1.59056 (OFED provided)
  • gcc-8.5.0-10.1.el8
Red Hat 8.4
  • Kernel 4.18.0-305.65.1.el8_4.x86_64
  • glibc 2.28-151
  • ucx 1.12.0-1.55103 (OFED provided)
  • gcc-8.4.1-1.el8
Mlnx-ofa_kernel-modules 
  • OFED 5.9.0.5.5.1
    • kver.4.18.0_372.51.1.el8_6
Mlnx-ofa_kernel-modules 
  • OFED 5.5.1.0.3.1.
    • kver.4.18.0_305.25.1.el8_4
Slurm 23.02.1 copiles with:
  • PMIx
      • /opt/pmix/2.2.5
      • /opt/pmix/3.2.3
  • /opt/pmix/4.2.3
  • hwloc 2.2.0-3 (OS provided)
  • ucx-1.15.0-1.59056 (OFED provided)
  • slurm-libpmi
  • slurm-contribs
Slurm 21.08.8-2 copiles with:
  • PMIx
    • /opt/pmix/2.2.5
    • /opt/pmix/3.2.3
    • /opt/pmix/4.1.2
  • hwloc 2.2.0-1 (OS provided)
  • ucx 1.12.0-1.55103 (OFED provided)
  • slurm-libpmi
  • slurm-contribs
PMIx LD config /opt/pmix/2.2.5/lib PMIx LD config /opt/pmix/2.2.5/lib
PMIx versions available in /opt :
    • 2.2.5
    • 3.2.3
  • 4.2.3
PMIx versions available in /opt :
  • 2.2.5
  • 3.2.3
  • 4.1.2
Singularity CE (Sylabs.io)
    • 3.10.4
  • 3.11.1
Singularity CE (Sylabs.io)
  • 3.10.4
NVIDIA driver 530.30.02 NVIDIA driver 520.61.05
Open OnDemand 3.0 Open OnDemand 2.0.29

Slurm Changes

Changes in behavior:

  • — srun will no longer read in SLURM_CPUS_PER_TASK. This means you will implicitly have to specify –cpus-per-task on your srun calls, or set the new SRUN_CPUS_PER_TASK env var to accomplish the same thing
  • — srun –ntasks-per-core now applies to job and step allocations. Now, use of –ntasks-per-core=1 implies –cpu-bind=cores and –ntasks-per-core>1 implies –cpu-bind=threads.
  • — srun –overlap now allows the step to share all resources (CPUs, memory, and GRES), where previously –overlap only allowed the step to share CPUs with other steps

NEW Features:

  • GPU Compute Mode option in srun/sbatch 
    • –gpu_cmode=<shared|exclusive|prohibited>
      • Set the GPU compute mode on the allocated GPUs to shared, exclusive or prohibited. Default is exclusive

 

  • Email checkbox
    • Prior all email was enabled, now users can opt out with a check-box at the top of the form that is enabled by default.
    • All apps have been updated with a new auto account feature that will collect a list of the account the user has access to and create a pull down list instead of having the user type the account.
  • Auto Partition list
    • Partitions are automatically detected, if a new PI is added to LH or A2, or if a new partition is added to GL. 
  • Add SLURM_JOB_START_TIME and SLURM_JOB_END_TIME environment variables.
  • extra=<string>
    • An arbitrary string enclosed in double quotes if using spaces or some special characters.  Available in job details.

      Example:

      srun –time=00:10:00 –gpus-per-node=1 –nodes=1  -A hpcstaff –extra=”Extra Metadata Field :)” –pty /bin/bash

      scontrol show job 151 | grep -i extra

      Extra=Extra Metadata Field 🙂

  • tres-per-task 
    • For salloc/sbatch/srun. Unified tres in one call
    • Example:
      • without new option:
        • srun –time=00:10:00  –nodes=1  -A hpcstaff -p ‘spgpu’ –ntasks=1 –cpus-per-task=8 –gpus-per-task=1 –pty /bin/bash
      •    with new option:
        • srun –time=00:10:00  –nodes=1  -A hpcstaff -p ‘spgpu’ –ntasks=1 tres-per-task=cpu:8,gres:gpu:1 –pty /bin/bash

SLURM Install path, now default

  • Since new paths are system default paths, SLURM will be available without having to set up an env path. If you have hard-coded the `/opt` path make sure you remove it
  • The following are the updated path changes:

 

Old path  New path (system default)
/opt/slurm/bin            

/opt/slurm/etc

/opt/slurm/include

/opt/slurm/include/slurm

/opt/slurm/lib64

/opt/slurm/share  

/opt/slurm/share/man

/usr/bin/

/etc/slurm/

/usr/include/

/usr/include/slurm

/usr/lib64/slurm

/usr/share

/usr/share/man

User software changes

  • Unknown at this point

Storage Updates (Turbo, Locker, and Data Den)

We are not planning any major changes to the storage services themselves, the main reason for storage outages will be a network migration to the new MACC core network.  This date is still flexible and subject to change.  If you need access to any data during the outage, you should copy any that you need to your desktop and/or laptop using Globus File Transfer and work on it directly there during the maintenance.

  • Storage (Turbo, Locker and Data Den): Tuesday, June 6, 2023, 8am to Thursday, June 8, 2023, 6 pm.

All Storage Services

  • Network migration for all storage systems