Great Lakes Update: March 2019

By March 20, 2019 May 12th, 2020 Flux, General Interest, Great Lakes, Happenings, HPC, News

ARC-TS previously shared much of this information through the December 2018 ARC Newsletter and on the ARC-TS website. We have added some additional details surrounding the timeline for Great Lakes as well as for users who would like to participate in Early User testing.

What is Great Lakes?

The Great Lakes service is a next generation HPC platform for University of Michigan researchers, which will provide several performance advantages compared to Flux. Great Lakes is built around the latest Intel CPU architecture called Skylake and will have standard, large memory, visualization, and GPU-accelerated nodes.  For more information on the technical aspects of Great Lakes, please see the Great Lakes configuration page.

Key Features:

  • Approximately 13,000 Intel Skylake Gold processors providing AVX512 capability providing over 1.5 TFlop of performance per node
  • 2 PB scratch storage system providing approximately 80 GB/s performance (compared to 8 GB/s on Flux)
  • New InfiniBand network with improved architecture and 100 Gb/s to each node
  • Each compute node will have significantly faster I/O via SSD-accelerated storage
  • Large Memory Nodes with 1.5 TB memory per node
  • GPU Nodes with NVidia Volta V100 GPUs (2 GPUs per node)
  • Visualization Nodes with Tesla P40 GPUs

Great Lakes will be using Slurm as the resource manager and scheduler, which will replace Torque and Moab on Flux. This will be the most immediate difference between the two clusters and will require some work on your part to transition from Flux to Great Lakes.

Another significant change is that we are making Great Lakes easier to use through a simplified accounting structure.  Unlike Flux where you need an account for each resource, on Great Lakes you can use the same account and simply request the resources you need, from GPUs to large memory.

There will be two primary ways to get access to compute time: 1) the on-demand model, which adds up the account’s job charges (reserved resources multiplied by the time used) and is billed monthly, similar to Flux On-Demand and 2) node purchases.  In the node purchase model, you will own the hardware which will reside in Great Lakes through the life of the cluster. You will receive an equivalent credit which you can use anywhere on the cluster, including on GPU and large memory nodes. We believe this will be preferable to buying actual hardware in the FOE model, as your daily computational usage can increase and decrease as your research requires. Send us an email at arcts-support@umich.edu if you have any questions or are interested in purchasing hardware on Great Lakes.

When will Great Lakes be available?

The ARC-TS team will prepare the cluster in April 2019 for an Early User period beginning in May, which will continue for approximately 4 weeks to ensure sufficient time to address any issues. General availability of Great Lakes should occur in June 2019.  We have a timeline for the Great Lakes project which will have more detail.

How does this impact me? Why Great Lakes?

After being the primary HPC cluster for the University for 8 years, Flux will be retired in September 2019.  Once Great Lakes becomes available to the University community, we will provide a few months to transition from Flux to Great Lakes.  Flux will be retired after that period due to aging hardware as well as expiring service contracts and licenses. We highly recommend preparing to migrate as early as possible so your research will not be interrupted.  Later in this email, we have suggestions for what you can do to make this migration process as easy as possible.

When Great Lakes becomes generally available to the University community, we will no longer be accepting new Flux accounts or allocations.  All new work should be focused on Great Lakes.

You can see the HPC timeline, including Great Lakes, Beta and Flux, here.

What is the current status of Great Lakes?

Today, the Great Lakes HPC compute hardware and high-performance Storage System has been fully installed and configured. In parallel with this work, the ARC-TS and Unit Support team members have been readying the new service with new software, modules as well as developing training to support the transition onto Great Lakes. A key feature of the new Great Lakes service is the just released HDR InfiniBand from Mellanox. Today, the hardware is installed but the firmware is still in its final stages of testing with the supplier with a target delivery of of mid-April 2019. Given the delays, ARC-TS and the suppliers have discussed an adjusted plan that allows quicker access to the cluster while supporting the future update once the firmware becomes available.

We are working with ITS Finance to define rates for Great Lakes.  We will update the Great Lakes documentation when we have final rates and let everyone know in subsequent communications.

What should I do to transition to Great Lakes?

We hope the transition from Flux to Great Lakes will be relatively straightforward, but to minimize disruptions to your research, we recommend you do your testing early.  In October 2018, we announced availability of the HPC cluster Beta in order to help users with this migration. Primarily, it allows users to migrate their PBS/Torque job submission scripts to Slurm.  You can and should also see the new Modules environments, as they have changed from their current configuration on Flux. Beta is using the same generation of hardware as Flux, so your performance will be similar to that on Flux. You should continue to use Flux for your production work; Beta is only to help test your Slurm job scripts and not for any production work.

Every user on Flux has an account on Beta.  You can login into Beta at beta.arc-ts.umich.edu.  You will have a new home directory on Beta, so you will need to migrate any scripts and data files you need to test your workloads into this new directory.  Beta should not be used for any PHI, HIPAA, Export Controlled, or any sensitive data!  We highly recommend that you use this time to convert your Torque scripts to Slurm and test that everything works as you would expect it to.  

To learn how to use Slurm, we have provided documentation on our Beta website.  Additionally, ARC-TS and academic unit support teams will be offering training sessions around campus. We will have a schedule on the ARC-TS website as well as communicate new sessions through Twitter and email.

If you have compiled software for use on Flux, we highly recommend that you recompile on Great Lakes once it becomes available.  Great Lakes is using the latest CPUs from Intel and by recompiling, your code may get performance gains by taking advantage of new capabilities on the new CPUs.

Questions? Need Assistance?

Contact arcts-support@umich.edu.