All Posts By

msbritt

2024 ARC Winter Maintenance

By | Feature, General Interest, Great Lakes, HPC, News, Systems and Services

Winter maintenance is coming up! See the details below. Reach out to arc-support@umich.edu with questions or if you need help. 

HPC

Like last year, we will have a rolling update which, outside of a few brief interruptions, should keep the clusters in production, Here is the schedule: 

December 6, 11 p.m.

  • Update Slurm Controllers: Expect a brief 1-minute interruption when querying Slurm. All jobs will continue to run.
  • Update Open OnDemand Servers: Expect a few seconds of interruption if you are using Open OnDemand.
  • Login Servers Update: We will begin updating our login servers. This update is not expected to impact any users.

 December 7:

  • Compute Rolling Updates: We will start rolling updates across all clusters a few nodes at a time, so there should be minimal impact on access to resources.

December 19, 10:30a.m.:

  • Update Globus transfer (xfer) nodes: as these nodes are in pairs for each cluster. For the Globus transfer nodes, we will take one node of each pair down at a time, so all Globus services will remain working. If you are using scp/sftp, your jobs may be interrupted, so please schedule these transfers accordingly or use Globus. Total maintenance time should be approximately one hour.

January 3, 8 a.m.:

  • Reboot Slurm Controller Nodes: This will cause an approximately 10-minute Slurm outage. All running jobs will continue to run.
  • Armis2 Open OnDemand Node: We will reload and reboot the Armis2 Open OnDemand node. This will take approximately 1 hour.
  • Great Lakes and Lighthouse Open OnDemand Nodes: These nodes will be down approximately 10 to 15 minutes.
  • Globus Transfer (xfer) Nodes: These nodes will be rebooted. This will take approximately 15 minutes.
  • Sigbio Login Reload/Reboot: This will take approximately 1 hour.
  • Some Armis2 Faculty Owned Equipment (FOE) nodes will require physical and configuration updates. Expected downtime is 4 hours.

HPC Maintenance Notes:

  • Open OnDemand (OOD) users will need to re-login. Any existing jobs will continue to run and can be reconnected in the OOD portal.
  • Login servers will be updated, and the maintenance should not have any effect on most users. Those who are affected will be contacted directly by ARC. 
  • New viz partition : there will be a new partition called viz with 16 new GPUs, which can support exactly one GPU per job .
  • The –cpus-per-gpu Slurm bug has been fixed.

HPC Maintenance Details:

NEW version in BOLD

OLD version

Red Hat 8.6 EUS

  • Kernel 4.18.0-372.75.1.el8_6.x86_64

  • glibc-2.28-189.6

  • ucx-1.15.0-1.59056 (OFED provided)

  • gcc-8.5.0-10.1.el8

Red Hat 8.6 EUS

  • Kernel 4.18.0-372.51.1.el8_6.x86_64

  • glibc-2.28-189.6

  • ucx-1.15.0-1.59056 (OFED provided)

  • gcc-8.5.0-10.1.el8

Mlnx-ofa_kernel-modules

  • OFED 5.9.0.5.5.1

    • kver.4.18.0_372.51.1.el8_6

Mlnx-ofa_kernel-modules

  • OFED 5.9.0.5.5.1

    • kver.4.18.0_372.51.1.el8_6

Slurm 23.02.6 copiles with:

  • PMIx

    • /opt/pmix/3.2.5

    • /opt/pmix/4.2.6

  • hwloc 2.2.0-3 (OS provided)

  • ucx-1.15.0-1.59056 (OFED provided)

  • slurm-libpmi

  • slurm-contribs

Slurm 23.02.5 copiles with:

  • PMIx

    • /opt/pmix/3.2.5

    • /opt/pmix/4.2.6

  • hwloc 2.2.0-3 (OS provided)

  • ucx-1.15.0-1.59056 (OFED provided)

  • slurm-libpmi

  • slurm-contribs

PMIx LD config /opt/pmix/3.2.5/lib

PMIx LD config /opt/pmix/3.2.5/lib

PMIx versions available in /opt :

  • 3.2.5

  • 4.2.6

PMIx versions available in /opt :

  • 3.2.5

  • 4.2.6

Singularity CE (Sylabs.io)

  • 3.10.4

  • 3.11.1

Singularity CE (Sylabs.io)

  • 3.10.4

  • 3.11.1

NVIDIA driver 545.23.06

NVIDIA driver 530.30.02

Open OnDemand 3.0.3

Open OnDemand 3.0.3

 

Storage

There is no scheduled downtime for Turbo, Locker, or  Data Den.

 

Secure Enclave Service (SES)

  • SES team to add details here

Maintenance notes:

  • No downtime for ARC storage systems maintenance (Turbo, Locker, and Data Den).
  • Open OnDemand (OOD) users will need to re-login. Any existing jobs will continue to run and can be reconnected in the OOD portal.
  • Login servers will be updated, and the maintenance should not have any effect on most users. Those who are affected will be contacted directly by ARC. 
  • Copy any data and files that may be needed during maintenance to your local drive using Globus File Transfer before maintenance begins. 

Status updates and additional information

  • Status updates will be available on the ARC Twitter feed and ITS service status page,  throughout the course of the maintenance.
  • ARC will send an email to all HPC users when the maintenance has been completed. 

How can we help you?

For assistance or questions, please contact ARC at arc-support@umich.edu.

Open OnDemand Update on Great Lakes and Lighthouse May 21, 2020

By | Great Lakes, HPC, News

We are migrating Open OnDemand from version 1.4 to 1.6 to fix a security issue on May 21, 2020. Users will not be able to use the service during the upgrade process but running jobs should continue to run, based on our testing. If you need access during this period of time, we recommend ending your existing job and resubmitting when the service is restored.

Lighthouse will be upgraded from 9 a.m. to 12:00 p.m.  (ITS Status Page Link)

Great Lakes will be upgraded from 1 p.m. to 5:00 p.m.  (ITS Status Page Link)

Armis2 Update : May 2020 (Increased Compute/GPU Capacity and Limits)

By | Armis2, News

ARC-TS is pleased to announce the addition of compute resources in the standard, large memory, and GPU partitions, new V100 GPUs (graphics processing units), and increased Slurm root account limits for Armis2 effective May 20, 2020. 

 

Additional Compute Capability added

ARC-TS will be adding 93 standard compute nodes, 4 large memory nodes, and 3 new GPU nodes (each with 4 NVIDIA K40x GPUs).  These nodes are the same hardware type as the existing Armis2 nodes.  We plan on migrating the new hardware on May 20, 2020.

 

New GPUs added

ARC-TS has added five nodes, each with three V100 GPUs for faster service to the GPU partition. These are the same types of GPU nodes that are in the Great Lakes HPC cluster. Learn more about the V100 GPU

 

What do I need to do? 

You can access the new GPUs by submitting your jobs to the Armis2 gpu partition. Refer to the Armis2 user guide, section 1.2 Getting started, Part 5 “Submit a job” or contact arcts-support@umich.edu to get help or if you have questions. 

 

Resources

 

How do I get help? 

Contact arcts-support@umich.edu to get help or if you have questions. 

 

Slurm default resource limits increased

ARC-TS will be raising the default Slurm resource limits (set at the per-PI/project root account level) to give each researcher up to 33% of the resources in the standard partition, and 25% of the resources in the largemem and gpu partitions, to better serve your research needs. This will happen on May 20, 2020.

 

What do I need to do? 

Review, enable, or modify limits on your Armis2 Slurm accounts. Because of the higher cpu limit, your researchers will be able to run more jobs, which could generate a larger bill. Contact arcts-support@umich.edu if you would like to modify or add any limits. 

 

What is a Slurm root account?

A per-principal investigator (PI) or per-project root account contains one or more Slurm sub-accounts, each with their own users, limits, and shortcode(s). The entire root account has limits for overall cluster and /scratch usage in addition to any limits put on the sub-accounts.

 

What is the new Slurm root account limit? 

Each PI’s or project’s collection of Slurm accounts will be increased to 1,032 cores and 5,160GB of memory, and 10 GPUs, effective May 20, 2020. The Slurm root account level limit is currently set to 90 cores. We will document all of the updated limits, including large memory and GPU limits, on the Armis2 website when they go into effect.

 

Resources

 

How do I get help? 

Armis2 is available for general access

By | Armis2, HPC, Systems and Services

U-M Armis2: Now available

What is Armis2?

The Armis2 service is a HIPAA-aligned, HPC platform for all University of Michigan researchers and is a successor to the current Armis cluster. It is based on the same hardware as the current Armis system, but uses the Slurm resource manager rather than the Torque/Moab environment. 

If your data falls under some non-HIPAA, restricted use agreement, contact arcts-support@umich.edu to discuss whether or not you can run your jobs on Armis2. 

Key features of Armis2 

  • 24 standard nodes using the Intel Haswell-processor, each with 24 cores. More capacity will be added in the coming weeks
  • Slurm provides the resource manager and scheduler 
  • The scratch storage system will provide high-performance temporary storage for compute. See the User Guide for quotas and the file purge policy
  • The EDR InfiniBand network is 100Gb/s to each node
  • Large Memory Nodes have 1.5TB memory per node
  • GPU Nodes with NVidia K40GPUs (4GPUs per node)

ARC-TS will be adding more standard, large memory, and GPU nodes in the coming weeks during the transition from Armis to Armis2, as well as migrating hardware from Flux to Armis2. For more information on the technical aspects of Armis2, see the Armis2 configuration page.

When will Armis2 be available?

Armis2 is available now; you can log in at: armis2.arc-ts.umich.edu.  

Using Armis2

Armis2 has a simplified accounting structure. On Armis2, you can use the same account and simply request the resources you need, including standard, GPU, and large memory nodes.

Active accounts have been migrated from Armis to Armis2. To see which accounts you have access to, type my_accounts to view the list of accounts. See the User Guide for more information on accounts and partitions.

Armis2 rates

Armis2 is a “pay only for what you use” model. We will be sharing rates shortly. Send us an email at hpc-support@umich.edu if you have any questions. 

Previously, Armis was in tech-preview mode and you were not billed for the service. Use of Armis2 is currently free, but beginning on December 2, 2019, all jobs run on Armis2 will be subject to applicable rates.

View the Armis2 rates page for more information. 

How does this change impact me?

All migrations from Armis to Armis2 must be completed by November 25, 2019, as Armis will not run any jobs beyond that date. View the Armis2 HPC timeline.

The primary difference between Armis2 and Armis is the resource manager. You will have to update your job submission scripts to work with Slurm; see the Armis2 User Guide for details on how to do this. Additionally, you’ll need to migrate any data and software from Armis to Armis2.

How do I learn how to use Armis2?

To learn how to use Slurm and for a list of policies, view the documentation on the Armis2 website

Additionally, ARC-TS and academic unit support teams will be offering training sessions around campus. As information becomes available, there will be a schedule on the ARC-TS website as well as Twitter and email.

 

Great Lakes Update: August 2019

By | General Interest, Great Lakes, Happenings, HPC, News

Great Lakes cluster is available for general access

What is the current status of the Great Lakes cluster

Now that we have completed Early User testing, the Great Lakes cluster is available for general access to the University community. Until the migration from Flux is complete on November 25, 2019, there will be no charge for using the Great Lakes cluster.

Noteworthy Features

  • The Great Lakes cluster compute nodes use the new Intel Skylake processor. In particular, the Skylake CPUs on the standard and large memory compute nodes will provide researchers more consistent performance, regardless of how many other jobs are on the machine. 
  • The Great Lakes cluster has 20 GPU nodes, each of which contains two NVidia V100 GPUs which are significantly faster than the K20 and K40 GPUs on Flux.
  • The HDR100 InfiniBand network will provide consistent 100Gb/s performance across all nodes. On Flux, this ranged from 40-100Gb/s, depending on the node your job used.
  • The high performance GPFS /scratch system, with a capacity of approximately two petabytes, is significantly faster than /scratch on Flux. 
  • The Torque-based batch job submission environment has been replaced with the Slurm resource manager. We expect this system to be significantly more responsive and quicker at starting jobs than was the case on Flux.
  • For web-based job submission, the Open OnDemand system will replace the ARC Connect environment for providing web based file access, job submission, remote desktop, graphical Matlab, Jupyter Notebooks, and more. For more information, see the web-based access section in our user guide. 

How do I get access?

Every Flux user has a login on the Great Lakes cluster; you should be able to log in via ssh to greatlakes.arc-ts.umich.edu. We have created Slurm accounts for each PI or project based on the current Flux accounts. You can see what Slurm accounts you have access to by running the command `my_accounts.`  

Additionally, you can access the Great Lakes cluster via the web through our Open OnDemand portal. Here you can submit jobs, see submitted jobs, create Jupyter Notebooks and more. Please see the Great Lakes Cluster User Guide for more information.

Where do I read more about the Great Lakes cluster and how to use it?

The current documentation for the Great Lakes cluster, including configuration, user guides, and known issues can be found at https://arc-ts.umich.edu/greatlakes.

There is a schedule for upcoming training sessions on the CSCAR website, and we will communicate new sessions through Twitter and email.

Software

Almost all of the software packages available on Flux have been recompiled on the Great Lakes cluster for improved performance anticipated from the Intel SkyLake architecture. In most cases, the latest software version available is being provided. If you need older versions or need additional packages, let us know via email at arcts-support@umich.edu.

We have also reorganized the software module structure to make it easier to find packages you want to load as well as automatically loading prerequisites. To search for packages, use the “module spider” command along with the name of the package or keywords. In many cases we combined similar packages into “Collections” such as Chemistry and BioInformatics. The command “module load Chemistry” will make any Chemistry package available to you and packages in the Chemistry collection will then be discoverable via the “module available” command. After loading a specific collection, you must then load any individual packages within that collection that you would like to use.

What are the rates? 

We are working with ITS and UM Finance for approved service rates. Current plans are to have proposed1 rates identified by end of August. As soon as this information is more concrete, we will provide an update on the Great Lakes cluster website and in our email communication. We understand that this information is necessary for planning purposes and apologize for any impacts this has had on your budget planning. 

What can be shared at this time is the new approach to billing that will be used for the Great Lakes cluster. Unlike Flux, there are no monthly allocations with fixed fees regardless of whether they are used or not. On the Great Lakes cluster, the monthly charge for an account will be calculated based on the resources used by jobs each month. The cost calculation for each job will be based on the amount and type of resources the job reserves and how long the job runs. This should be a significantly more flexible system and won’t require updating allocations as your computing needs change over time.

1 Rates are not considered final until they have been formally approved by OFA.

Flux to the Great Lakes cluster transition efforts

If you have not already, you should be developing a plan to migrate your work from Flux to the Great Lakes cluster.  If you need help in developing a plan, please contact us and we can provide assistance during this migration period. 

  • ARC-TS and academic unit support teams will be offering training sessions around campus. We will have a training sessions schedule on the ARC-TS website. We also communicate new sessions through Twitter and email.
  • To assist your transition, if you have any Turbo or MiStorage NFS mounts on Flux, those mounts will also be available on the Great Lakes cluster.  If you would prefer to not have those volumes mounted on the Great Lakes cluster, email us at arcts-support@umich.edu.

Ensure that your migration from Flux to the Great Lakes cluster is completed by November 25, 2019. No jobs on Flux will run after November 25, 2019.

Additional Information

We will be adding new capabilities in the coming weeks and months and will continue to communicate these capabilities by email as they become available. If you have any questions, email us at arcts-support@umich.edu.

Modular Data Center Electrical Work

By | Flux, Systems and Services, Uncategorized

[Update 2019-05-17 ] The MDC electrical work was completed successfully and Flux has been returned to full production.

 

The Modular Data Center (MDC), which houses Flux, Flux Hadoop, and other HPC resources, has an electrical issue which requires us to bring the power usage below 50% for the some racks in order to resolve the problem.  In order to do this, we have put reservations on some of the nodes to reduce the power draw so the issue can be fixed by ITS Data Centers.  Once we hit the target power level and the issue is resolved, we will remove the reservations and return Flux and Flux Hadoop back into full production level.