How distributed jobs work

Distributed jobs, in this case independent distributed jobs, are regular PBS jobs that Matlab submits for you to PBS. Each iteration over the loop will be submitted as a separate PBS job, to which the usual rules apply just as if you had submitted the job from outside Matlab. Distributed jobs should not be assumed to run in real time as part of the job that submits them.

The spectral processing example

We will use the spec function from the spectral processing example to demonstrate how this can be done. The complete file can be found on Flux at

/scratch/data/examples/matlab/ovarian/spec_distributed.m

For this example, we begin by setting variables to contain the file names and precreate some data structures to hold the final results. The first command from spec_distributed.m that pertains to the distributed jobs is

% create a cluster object to use to submit PBS jobs
myCluster = parcluster('flux')

% use myCluster.SubmitArguments to set the qsub arguments, which can be put
% into a struct for easier housekeeping

qsub.account   = ' -A example_flux'
qsub.q         = ' -q flux'
qsub.qos       = ' -l qos=flux'
qsub.walltime  = ' -l walltime=15:00'
qsub.joe       = ' -j oe'
qsub.env       = ' -V'
qsub.mail      = [ ' -M ', getenv('USER'), '@umich.edu -m n' ]

myCluster.SubmitArguments = [qsub.account, qsub.q,  qsub.qos,... 
     qsub.walltime, qsub.joe, qsub.env, qsub.mail ]

The parcluster command is used to create a Matlab cluster object, which is used to keep track of the attributes of jobs submitted to the cluster. For distributed jobs, the correct parallel profile is flux, which sets things up so that jobs will get submitted properly.

We also need to set many of the parameters that you would normally include as part of your PBS script so that Matlab includes them when it submits the job. See our web page on Torque and PBS options for information about each item there. We have chosen here to name them to match PBS options as closely as possible. Once they have all been set, they are add to the myCluster object as SubmitArguments.

Matlab will store information about jobs in the directory stored in the JobStorageLocation property of the cluster object. In the example file, we show how to specify a different location, using a subdirectory of the default location. You should make sure that the directory exists prior to setting it.

mkdir( fullfile(myCluster.JobStorageLocation, 'ovarian'));
myCluster.JobStorageLocation = fullfile(myCluster.JobStorageLocation, 'ovarian');

The last of the “housekeeping” details, and that is to create a job, which is a collection of tasks to be performed.

myJob = createJob(myCluster);

The heart of things is this loop,

for k = 1:N
   data_file = [ data_repository files{k} ];
   createTask(myJob, @spec, 1, {data_file, k});
end

which creates one task for each value of the index, where each task is assigned to myJob and is a call to the spec function, which will return one value, and which takes two arguments which are passed to it via the array {data_file ,k}.

Finally, once the tasks are created, the job is submitted using

submit(myJob);

Once the job has been submitted, each task will show up as a single PBS job, and you can see them with

$ qstat -u $USER

If you have allowed sufficient walltime for the Matlab that submits the job, you can use myJob.wait, and Matlab will wait until all the jobs have completed before returning control to the session that submitted the jobs. Alternately, you can quit Matlab and monitor the progress of your jobs with qstat.

If you did not stop Matlab, then you can collect the results with

myResults = myJob.fetchOutputs;

which will read each task’s output into an element in the myResults array, and those can be fetched out with something like

for k = 1:N
    Y(:,i) = myResults{i};
end

where, in this case, the output from each task should be a column in the Y output vector.

Finally, you should always be tidy and, once you have collected all the output from all the tasks, delete the job with

myJob.delete

Collecting job results after exiting the Matlab session

If you did end your Matlab session, then you need to restart Matlab, then recreate the cluster object, and if you set the storage location, reset that

myCluster = parcluster('flux')
myCluster.JobStorageLocation = fullfile(myCluster.JobStorageLocation, 'ovarian');

You can then use

% Re-establish the job definition
myJob = myCluster.findJob;
% Collect the results into an output structure
myResults = myJob.fetchOutputs;

as was used above when we waited for the jobs to complete.