Running R on HPC clusters

Author

Marie-Hélène Burle

This section will show you how to use R once you have logged in to a remote cluster via SSH.

Log in to our temporary training cluster

As mentioned previously, once you have developed your code in an interactive fashion, running scripts allows you to request large hardware resources only when you need them (i.e. when your code is executed).

This prevents these heavy resources from sitting idle when not in use (for instance when you think or type code), saving you money on commercial clusters or waiting time on Alliance clusters.

So let’s log in to the temporary training cluster.

Loading modules

Once in the cluster, we need to make the modules (software) we need available.

C compiler

To compile R packages, we need a C compiler.

In theory, one could use the proprietary Intel compiler which is loaded by default on the Alliance clusters, but it is recommended to replace it with the GCC compiler (R packages can be compiled by any C compiler—also including Clang and LLVM—but the default GCC compiler is the best way to avoid headaches).

It is thus much simpler to always load a gcc module before loading an r module.

To see which gcc versions are available on a cluster, run:

module spider gcc

To see the dependencies of a particular version (e.g. gcc/11.3.0), run:

module spider gcc/11.3.0

This shows us that we need StdEnv/2020 to load gcc/11.3.0.

R module

Now, of course, we also need an R module.

As before, to see which versions of R are available on a cluster, run:

module spider r

To see the dependencies of a particular version (e.g. r/4.2.2), run:

module spider r/4.2.2

Load the modules

Now that we know what modules we need, let’s load them. The order is important: the dependencies (here StdEnv/2020) must be listed before the modules which depend on them.

module load StdEnv/2020 gcc/11.3.0 r/4.2.2

Installing R packages

For this course, all packages have already been installed in a communal library. You thus don’t have to install anything.

To install a package, launch the interactive R console with:

R

In the R console, run:

install.packages("<package_name>", repos="<url-cran-mirror>")

The first time you install a package, R will ask you whether you want to create a personal library in your home directory. Answer yes to both questions. Your packages will now install under ~/.

Some packages require additional modules to be loaded before they can be installed. Other packages need additional R packages as dependencies. In either case, you will get explicit error messages. Adding the argument dependencies = T helps in the second case, but you will still have to add packages manually from time to time.

To leave the R console, press <Ctrl+D>.

Running R jobs

There are two types of jobs that can be launched on an Alliance cluster: interactive jobs and batch jobs. We will practice both and discuss their respective merits and when to use which.

For this course, I purposefully built a rather small cluster (10 nodes with 4 CPUs and 30GB each) to give a tangible illustration of the constraints of resource sharing.

Interactive jobs

While it is fine to run R on the login node when you install packages, you must start a SLURM job before any heavy computation.

To run R interactively, you should launch an salloc session.

Example:

salloc --time=1:10:00 --mem-per-cpu=3700M --ntasks=8

This takes you to a compute node where you can now launch R to run computations:

R

This however leads to the same inefficient use of resources as happens when running an RStudio server: all the resources that you requested are blocked for you while your job is running, whether you are making use of them (running heavy computations) or not (thinking, typing code, running computations that use only a fraction of the requested resources).

Interactive jobs are thus best kept to develop code.

Scripts

To run an R script called <your_script>.R, you first need to write a job script:

Example:

<your_job>.sh
#!/bin/bash
#SBATCH --account=def-<your_account>
#SBATCH --time=15
#SBATCH --mem-per-cpu=3000M
#SBATCH --cpus-per-task=4
#SBATCH --job-name="<your_job>"
module load StdEnv/2020 gcc/11.3.0 r/4.2.2
Rscript <your_script>.R   # Note that R scripts are run with the command `Rscript`

Then launch your job with:

sbatch <your_job>.sh

You can monitor your job with sq (an alias for squeue -u $USER $@).

Batch jobs are the best approach to run parallel computations, particularly when they require a lot of hardware.

It will save you lots of waiting time (Alliance clusters) or money (commercial clusters).