NFS latency and MATLAB directories
In a cluster environment, your home directory and other data directories are on a shared filesystem that is accessed via the network. This can lead to circumstances in which updates to the network files do not propagate sufficiently quickly to all the nodes that use them. This is called a problem with latency. MATLAB uses files to store information about its configuration and about parallel jobs, and there are a couple of error situations that can be triggered by the latency.
Depending on the nature of the parallel session, there are some configuration options that can be used to ameliorate the situation, which will be detailed below. There are two main classes of MATLAB parallel jobs on the cluster: single-machine jobs and multi-machine jobs. Changing the preferences is done using a combination of environment variables that are set prior to running MATLAB (and which MATLAB inherits) or by issuing commands to modify MATLAB settings within your MATLAB scripts.
In what follows, we assume that your job will use one CPU per task, i.e., you will have
#SBATCH --cpus-per-task=1
in your job script; this is also the default on ARC clusters. If you change that, you will need to adjust the code below that sets the value of NP
from the environment.
Changing the preferences directory
The first thing you can change is the preferences directory. By default, MATLAB will use a directory called .matlab
under your home directory to store preferences for the MATLAB session.
In the cluster environment, however, instances of MATLAB might be running in many jobs at the same time. This could lead to circumstances where information from one might overwrite information from another.
To change this behavior, you can create a unique folder for the MATLAB preferences for each job, then provide that location to MATLAB at startup. This is done in your Slurm job script with the following command.
# Create parent matlab data directory mkdir -p $HOME/matlabdata # If the directory doesn't exist, exit with large error code test -d $HOME/matlabdata || exit 999 # Check whether we are in a Slurm job and set preference directory # to either something random or the JobID if [ "$SLURM_JOBID" == "" ] ; then export MATLAB_PREFDIR=$(mktemp -d $HOME/matlabdata/matlab-prefs-XXXXXXXX) else mkdir $HOME/matlabdata/$SLURM_JOBID export MATLAB_PREFDIR=$HOME/matlabdata/$SLURM_JOBID fi # Finish by setting the MATLAB_CLUSTER_WORKDIR to the MATLAB_PREFDIR export MATLAB_CLUSTER_WORKDIR=$MATLAB_PREFDIR
Changing the job storage location within MATLAB
Using network storage
The safest way to do this is, in your Slurm job script, to create the shared location before starting MATLAB, then run MATLAB using it, then when MATLAB has finished, remove it. This can be done with
mkdir ${HOME}/matlabdata/${SLURM_JOBID} sleep 5 matlab -nodisplay -r my_script rm -rf ${HOME}/matlabdata/${SLURM_JOBID}
Then, in your MATLAB script, create the pool like this
% This assumes a multinode job % Set the value for the job storage location JSL = fullfile(getenv('HOME'), 'matlabdata', getenv('SLURM_JOBID')) % If not inside a Slurm job, use 4 processors; assumes you want all % the processors assigned to the Slurm job if isempty(getenv('SLURM_NTASKS')) NP = 4; else NP = str2double(getenv('SLURM_NTASKS')); end % Initialize ARCTS cluster 'current' profile setupUmichClusters % Create the cluster object, set the job storage location, start pool myCluster = parcluster('current') myCluster.JobStorageLocation = JSL myPool = parpool(myCluster, NP); [ . . . . Put your MATLAB code here . . . . ] delete(myPool); exit
Using local disk
If you are using the local profile and staying on one physical node, then you may see a performance increase by using local disk for the shared files. In that case, use
##### Request free space on /tmp for your job data; using 10gb ##### as an example mkdir -p /tmp/${USER}/${SLURM_JOBID} matlab -nodisplay -r my_script rm -rf /tmp/${USER}/${SLURM_JOBID}
and this for your Matlab code
% This assumes a local job % Set the value for the job storage location JSL = fullfile('/tmp', getenv('USER'), getenv('SLURM_JOBID')) % If not inside a Slurm job, use 4 processors; assumes you want all % the processors assigned to the Slurm job if isempty(getenv('SLURM_NTASKS')) NP = 4; else NP = str2double(getenv('SLURM_NTASKS')); end % Create the cluster object, set the job storage location, start pool myCluster = parcluster('local') myCluster.JobStorageLocation = JSL myPool = parpool(myCluster, NP); [ . . . . Put your MATLAB code here . . . . ] delete(myPool); exit