Slurm partitions represent collections of nodes for a computational purpose, and are equivalent to Torque queues. For more Armis2 hardware specifications, see the Configuration page.
- debug: The goal of debug is to allow users to run jobs quickly for debugging purposes.
- Max walltime: 4 hours
- Max jobs per user: 1
- Higher scheduling priority
- standard: Standard compute nodes used for most work.
- Max walltime: 14 days
- Default partition if none specified
- gpu: Allows use of NVIDIA Tesla V100 GPUs.
- Max walltime: 14 days
- largemem: Allows use of a compute node with 1.5 TB of RAM.
- Max walltime: 14 days
In order to facilitate fairness between accounts, we have set resource limits on each Armis2 root account which are described here.
Limits can be set on a Slurm association or on an Slurm account. This allows a PI to limit individual users or the collective set of users in an account as the PI sees fit. The following values can be used to limit either an account or user association, unless noted otherwise below:
Current Armis2 partition limits:
- Maximum number of jobs allowed to run at one time
- Account example: testaccount can have 10 simultaneously running jobs (testuser1 has 8 running jobs and testuser2 has 2 running jobs for a total of 10 running jobs)
- Association example: testuser can have 2 simultaneously running jobs
- Maximum duration of a job
- Account example: all users on testaccount can run jobs for up to 3 days
- Association example: testuser’s jobs can run up to 3 days
- MaxTRES (CPU, Memory, GPU or billing units)
- Maximum number of TRES the running jobs can simultaneously use
- NOTE: CPU, Memory, and GPU can also be limited on a user’s individual job
- Account example: testaccount’s running jobs can collectively use up to 5 GPUs (testuser1’s jobs are using 3 GPUs and testuser2’s jobs are using 2 GPUs for a total of 5 GPUs)
- Association example: testuser’s running jobs can collectively use up to 10 cores
- Job example: testuser can run a single job using up to 10 cores
- GrpTRESMins (billing units)
- The total number of TRES minutes that can possibly be used by past, present and future jobs. This is primarily used for setting spending limits
- Account example: all users on testaccount share a spending limit of $1000
- Association example: testuser has a spending limit of $1000
- The total number of TRES minutes used by all running jobs. This takes into consideration the time limit of running jobs. If the limit is reached no new jobs are started until other jobs finish.
- Account example: all users on testaccount share a pool of 1000 CPU minutes for running jobs (users have 10 serial jobs each with 100 minutes remaining to completion)
- Association example: testuser can have up to 100 CPU minutes of running jobs (1 job with 100 CPU minutes remaining, 2 with 50 minutes remaining, etc.)
Periodic Spending Limits
The PI has the ability to set a monthly or yearly (fiscal year) spending limit on a Slurm account. Spending limits will be updated at the beginning of each month. As an example, if the testaccount account has a monthly spending limit of $1000 and this is used up on January 22nd, jobs will be unable to run until February 1st when the limit will reset with another $1000 to spend.
Please contact ARC if you would like to implement any of these limits.
ARC operates our HPC clusters to the best of our abilities, but there can be events, both within and outside of our control, which may cause interruptions to your jobs. You are responsible for due diligence around your use of the ARC HPC resources and taking measures to maximize your research. These actions may include:
- Backing up data to permanent storage locations
- Checkpointing your code to minimize impacts from job interruptions
- Error checking in your scripts
- Understanding the operation of the system and the user guide for the HPC cluster, including per job charges which may be greater than expected
Any refunds (if any) are at the discretion of ARC and will only only be enacted during system-wide preventable issues. This does not include hardware failure, power failures, job failures, or similar issues.
ARMIS2 TERMS OF USAGE
- This service is for sensitive data only. Be advised that you should not move sensitive data off of this system, unless it is to another service or machine that has been approved for hosting the same types of sensitive data.
- Limited data restoration. The data in your home directory can be restored from snapshots going back 3 days. Anything beyond 3 days can not be retrieved. Data stored on outside your home directory such as a group share will be subject to other data-lifetime policies that is setup at the time of purchasing the respective Turbo NFS volume. You are responsible for mitigating your own risk. We suggest you store copies of hard-to-reproduce data in your home directory or on HIPAA-aligned storage you own or purchased from Turbo.
- System usage is tracked and is used for billing reports and capacity planning. Job metadata (example: walltime, resource utilization, software accessed) is stored and used to generate usage reports and to analyze patterns and trends. ARC may report this metadata, including your individual metadata data, to your adviser, department head, dean, or other administrator or supervisor for billing or capacity planning purposes.
- Maintaining the overall stability of the system is paramount to us. While we make every effort to ensure that every job completes with the most efficient and accurate way possible, the good of the whole is more important to us than the good of an individual. This may affect you, but mostly we hope it benefits you. System availability is based on our best efforts. We are staffed to provide support during normal business hours. We try very hard to provide support as broadly as possible, but cannot guarantee support on a 24 hour per day basis. Additionally, we perform system maintenance on a periodic basis, driven by the availability of software updates, staffing availability, and input from the user community. We do our best to schedule around your needs, but there will be times when the system is unavailable. For scheduled outages, we will announce them at least one month in advance on the ARC home page; for unscheduled outages we will announce them as quickly as we can with as much detail as we have on that same page. You can also track ARC at Twitter name @umichARC.
- Armis2 is intended only for non-commercial, academic research and instruction. Commercial use of some of the software on Armis2 is prohibited by software licensing terms. Prohibited uses include product development or validation, software use supporting any service for which a fee is charged, and, in some cases, research involving proprietary data that will not be made available publicly regardless whether the research is published . Please contact email@example.com if you have any questions about this policy, or about whether your work may violate these terms.
- Data subject to export control and HIPAA regulations may be stored or processed on the cluster. The appropriate storage solution for storing export controlled information or PHI that can be accessed on the Armis2 cluster is the Turbo-NFSv4 with Kerberos offering(See the Sensitive Data Restrictions for Turbo-NFSv4 with Kerberos for further details). It is your responsibility, not ARC’s, to be aware of and comply with all applicable laws, regulations, and universities policies (e.g., ITAR, EAR, HIPAA) as part of any research activity that may raise compliance issues under those laws. For assistance with export controlled research, contact the U-M Export Control Officer at firstname.lastname@example.org. For assistance with HIPAA-related computational research, contact the ARC liaison to the Medical School at email@example.com.
Users should make requests by email to firstname.lastname@example.org:
- One day in advance, request users be added to Armis2 accounts you may administer. All users need approval to be added to an account on Armis2 before they can have a user login created on the cluster.
Users are responsible for security and compliance related to sensitive code and/or data. Security and compliance are shared responsibilities. If you process or store sensitive university data, software, or libraries on the cluster, you are responsible for understanding and adhering to any relevant legal, regulatory or contractual requirements.
Users are responsible for maintaining MCommunity groups used for MReport authorizations.
Users must manage PHI (protected health information) appropriately and can use the following locations:
- /home (80 GB quota)
- /scratch (more information below)
- Any appropriate PHI-compliant NFS volume mounted on Armis2
SCRATCH STORAGE POLICIES
Every user has a /scratch directory for every Slurm account they are a member of. Additionally for that account, there is a shared data directory for collaboration with other members of that account. The account directory group ownership is set using the Slurm account-based UNIX groups, so all files created in the /scratch directory are accessible by any group member, to facilitate collaboration.
There is a 10 TB quota on /scratch per root account (a PI or project account), which is shared between child accounts (individual users).
If you are in need of more scratch space for your account please email us at email@example.com. Please note that these requests need to come from an administrator on the account and should include an explanation of why the increase is required.
Users should keep in mind that scratch has an auto-purge policy on unaccessed files, which means that any unaccessed data will be automatically deleted by the system after 60 days. Scratch file systems are not backed up. Critical files should be backed up to another location.
LOGIN NODE POLICIES
Appropriate uses for the login nodes:
- Transferring small files to and from the cluster
- Creating, modifying, and compiling code and submission scripts
- Submitting and monitoring the status of jobs
- Testing executables to ensure they will run on the cluster and its infrastructure. Processes are limited to a maximum of 15 minutes of CPU time to prevent runaway processes and overuse.
Any other uses of the login node may result in the termination of the process in violation. Any production processes (including post processing) should be submitted through the batch system to the cluster. If interactive use is required then you should submit an interactive job to the cluster.