(My way of) Running Jobs on Alvis
Published:
This post describes how to run jobs on Alvis, the HPC cluster at Chalmers University of Technology, and how to run a Jupyter Notebooks from a container.
There two main schools of thought on how to run jobs on Alvis:
- one using the
module
system and virtual environments - the other using containers.
I reccommend using virtual environments and limit to only use pip-installed packages when possible, as it allows for easier sharing of code and environments with others, for example via requirements.txt
files (pip freeze > requirements.txt
to create it, and pip install -r requirements.txt
to install the packages from it).
Using containers allows for better local reproducibility and isolation of dependencies, but it might be more complex to set up initially and hard to share with others. I reccommend using containers when the code to run has complex dependencies that are hard to install via pip
or conda
, for example non-Python libraries or packages that require specific system configurations.
Using the Module System and Virtual Environments
To create and use a virtual environment on Alvis with Python 3.11, first ssh
and login, then go the directory where to create the virtual environment and run:
module load Python/3.11.3-GCCcore-12.3.0
module load PyTorch/2.1.2-foss-2023a-CUDA-12.1.1
# NOTE: If you try to run the following command without pre-loading the above
# modules, Alvis will complain and tell that python3 is not even an available
# command.
python3 -m venv my-env-name
The environment will be created at my-env-name
. To install packages, run the following:
source path/to/my/environment/my-env-name/bin/activate
pip install --upgrade pip
pip install jupyter
pip install <what you need>
pip install bitsandbytes
pip install optuna
pip install h5py
pip install optuna
pip install rdkit
pip install pynvml tqdm jsonargparse nltk rouge_score evaluate
pip install pandas tensorboard tabulate scikit-learn
pip install datasets
pip install tokenizers
pip install accelerate
pip install huggingface
pip install bitsandbytes
pip install trl
Once eveything is ready, one can “wrap” the above setup steps into a bash script, for example named setup_environment.sh
, containing:
#!/bin/bash
# Load necessary modules
module load Python/3.11.3-GCCcore-12.3.0
module load PyTorch/2.1.2-foss-2023a-CUDA-12.1.1
# Source the virtual environment
source path/to/my/environment/my-env-name/bin/activate
The script can be run via the command:
source setup_environment.sh
One must use the source
command, making the script executable is not enough.
Running Jobs via Slurm
Setting up the modules and virtual environment, i.e., source setup_environment.sh
, can directly be placed in sbatch script files, for example:
#!/bin/bash
#SBATCH --account=my-account-number
#SBATCH --partition=alvis
#SBATCH --gpus-per-node=T4:1
#SBATCH --job-name=my-awesom-job
#SBATCH --time=2:00:00
source setup_environment.sh
python my_python_script.py
deactivate
Alternatively, in a less concise way:
#!/bin/bash
#SBATCH --account=my-account-number
#SBATCH --partition=alvis
#SBATCH --gpus-per-node=T4:1
#SBATCH --job-name=my-awesom-job
#SBATCH --time=2:00:00
# Load Python modules
module load Python/3.11.3-GCCcore-12.3.0
module load PyTorch/2.1.2-foss-2023a-CUDA-12.1.1
# Activate environment
source path/to/my/environment/my-env-name/bin/activate
# Run my stuff
python my_python_script.py
# Deactivate the environment
deactivate
Using Jupyter Notebooks
To run Jupyter Notebooks on Alvis using the module system and virtual environments, first create a virtual environment as described above, install Jupyter via pip install jupyter
, and then run:
source setup_environment.sh
jupyter notebook --no-browser
Copy the link provided by Jupyter Notebook and provide that to your local browser to access the notebooks.
The Containers Approach
The idea is to have a container that contains all the dependencies needed for running the code, and use that to develop code and run jobs on Alvis.
Very brutally speaking, a container is like a virtual machine, meaning that I can run “its terminal” and run commands inside it, but the code in it can access and interact with the host filesystem (i.e., the files on Alvis).
Alvis uses Apptainer for containerization. Apptainer is the “manager” running the containers, but one can have different container files, each built in a different way (e.g., with different pip
-installed packages).
Create the Container
To create a container, Alvis already provides some starting points, which are available in /apps/containers/Conda/
. For example, one can use the miniconda-22.11.1.sif
file as a base image. Below is an example of a content of a recipe file (container_file.def
) to create a “container file”:
Bootstrap: localimage
From: /apps/containers/Conda/miniconda-22.11.1.sif
# NOTE: The line above can be modified to use a different base image if needed,
# for example: "From <my/local/folders/container/file>.sif"
%post
# Put all your installation commands here, all the following commands will
# be executed when building the container. None of them are required, it's
# just an example of how to install packages and libraries.
apt-get -y update
apt-get -y install git-lfs
/opt/conda/bin/conda install -y pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
/opt/conda/bin/conda install -y -c huggingface transformers tokenizers datasets
/opt/conda/bin/conda install -y -c conda-forge accelerate pandas evaluate tensorboard tabulate scikit-learn
pip install pynvml tqdm jsonargparse nltk rouge_score
conda install -c conda-forge tabulate
conda install -y -c conda-forge scikit-learn
conda install -y -c anaconda scikit-learn
pip install rdkit
pip install -U "huggingface_hub[cli]"
conda remove transformers -y
pip install transformers
conda remove datasets -y
pip install datasets
pip install evaluate
pip install peft
[...]
To build the container, one shall run the following command:
apptainer build <path/to/destination/container/file>.sif <path/to/recipe/file>.def
Using Jupyter Notebooks on Alvis
Bash script (saved to, for example, run_jupyter_in_apptainer.sh
) to run a Jupyter Notebook from a container:
#!/bin/bash
# Check if Apptainer path is provided
if [ "$#" -lt 1 ] || [ "$#" -gt 2 ]; then
echo "Usage: $0 <path-to-apptainer-image> [path-to-work-directory]"
exit 1
fi
# Path to the Apptainer image
IMAGE_PATH="$1"
# Optional path to the work directory
if [ -n "$2" ]; then
WORK_DIR="$2"
# Check if directory exists
if [ ! -d "$WORK_DIR" ]; then
echo "The specified directory does not exist: $WORK_DIR"
exit 1
fi
# Change to the specified directory
cd "$WORK_DIR"
fi
# Check if Apptainer is installed
if ! command -v apptainer &> /dev/null; then
echo "Apptainer is not installed. Please install it to continue."
exit 1
fi
# Run the Apptainer image with Jupyter Notebook
apptainer exec "$IMAGE_PATH" jupyter notebook --no-browser --ip=0.0.0.0 --port=8889 --allow-root
Example of usage: ./run_jupyter_in_apptainer.sh ./containers/container_file.sif
It will output something like the following:
[ribes@alvis2-02 ~]$ ./run_jupyter_in_apptainer.sh ./containers/container_file.sif
[I 2025-03-27 09:21:33.125 ServerApp] jupyter_lsp | extension was successfully linked.
[I 2025-03-27 09:21:33.129 ServerApp] jupyter_server_terminals | extension was successfully linked.
[I 2025-03-27 09:21:33.133 ServerApp] jupyterlab | extension was successfully linked.
[I 2025-03-27 09:21:33.137 ServerApp] notebook | extension was successfully linked.
[I 2025-03-27 09:21:33.448 ServerApp] notebook_shim | extension was successfully linked.
[I 2025-03-27 09:21:33.469 ServerApp] notebook_shim | extension was successfully loaded.
[I 2025-03-27 09:21:33.471 ServerApp] jupyter_lsp | extension was successfully loaded.
[I 2025-03-27 09:21:33.472 ServerApp] jupyter_server_terminals | extension was successfully loaded.
[I 2025-03-27 09:21:33.474 LabApp] JupyterLab extension loaded from /opt/conda/lib/python3.10/site-packages/jupyterlab
[I 2025-03-27 09:21:33.474 LabApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
[I 2025-03-27 09:21:33.474 LabApp] Extension Manager is 'pypi'.
[I 2025-03-27 09:21:33.483 ServerApp] jupyterlab | extension was successfully loaded.
[I 2025-03-27 09:21:33.486 ServerApp] notebook | extension was successfully loaded.
[I 2025-03-27 09:21:33.487 ServerApp] Serving notebooks from local directory: /cephyr/users/ribes/Alvis
[I 2025-03-27 09:21:33.487 ServerApp] Jupyter Server 2.14.2 is running at:
[I 2025-03-27 09:21:33.487 ServerApp] http://alvis2-02:8889/tree?token=d263b3a10fa6558d1f673a646128a5d847948f6972df27d8
[I 2025-03-27 09:21:33.487 ServerApp] http://127.0.0.1:8889/tree?token=d263b3a10fa6558d1f673a646128a5d847948f6972df27d8
[I 2025-03-27 09:21:33.487 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 2025-03-27 09:21:33.501 ServerApp]
To access the server, open this file in a browser:
file:///cephyr/users/ribes/Alvis/.local/share/jupyter/runtime/jpserver-65940-open.html
Or copy and paste one of these URLs:
http://alvis2-02:8889/tree?token=d263b3a10fa6558d1f673a646128a5d847948f6972df27d8
http://127.0.0.1:8889/tree?token=d263b3a10fa6558d1f673a646128a5d847948f6972df27d8
[I 2025-03-27 09:21:33.517 ServerApp] Skipped non-installed server(s): bash-language-server, dockerfile-language-server-nodejs, javascript-typescript-langserver, je
di-language-server, julia-language-server, pyright, python-language-server, python-lsp-server, r-languageserver, sql-language-server, texlab, typescript-language-se
rver, unified-language-server, vscode-css-languageserver-bin, vscode-html-languageserver-bin, vscode-json-languageserver-bin, yaml-language-server
In VSCode, with an open notebook, with the Select kernel
button in the top-right corner, one can point to the existing running kernel by pasting the link that shows up when running the jupyter server. In practice, look at the output for a link that looks something like this: http://alvis2-02:8889/tree?token=d263b3a10fa6558d1f673a646128a5d847948f6972df27d8
(it’s in the example above).
For having a jupyter notebook with GPU support, one shall create a session on Alvis on-demand, then open a terminal, and finally follow the same steps above.
Running Jobs via Slurm
Example of sbatch script (CPU-only):
#!/usr/bin/env bash
#SBATCH -A NAISS2024-5-630 -p alvis
#SBATCH -N 1
#SBATCH -C NOGPU
#SBATCH --cpus-per-task=32
#SBATCH -t 0-2:00:00
#SBATCH -J "slurm-score-predictions"
#SBATCH -o slurm-score-predictions.log
cd $SLURM_SUBMIT_DIR
echo "Running score_predictions.py"
export PYTHONPATH=$PYTHONPATH:/my/local/dir/containing/code
apptainer exec ~/containers/container_file.sif python $SLURM_SUBMIT_DIR/scripts/score_predictions.py --num_proc=32 --skip_if_log_exists
Example of sbatch script with GPU:
#!/usr/bin/env bash
#SBATCH -N 1 --gpus-per-node=T4:1
#SBATCH -t 0-2:00:00
#SBATCH -J "slurm-predict-MyModel"
#SBATCH -o slurm-predict-MyModel.log # Output log file
echo "Starting at `date`"
echo "Running on hosts: $SLURM_NODELIST"
echo "Running on $SLURM_NNODES nodes."
echo "Running $SLURM_NTASKS tasks."
echo "Job id is $SLURM_JOBID"
echo "Job submission directory is: $SLURM_SUBMIT_DIR"
cd $SLURM_SUBMIT_DIR
local_dir=my/local/dir/containing/code
# echo "-------------------------------------------------------------------------"
# apptainer exec ~/containers/container_file.sif accelerate env
# echo "-------------------------------------------------------------------------"
export PYTHONPATH=$PYTHONPATH:my/local/dir/containing/code/
apptainer exec ~/containers/container_file.sif python ${local_dir}/scripts/collect_llm_predictions.py \
--model_name=MyModel \
--batch_size=32 \
--num_proc=16 \
--eval_gen_strategies=true