Cerebras CS-2
Introduction
The Cerebras CS-2 Wafer-scale cluster (WSC) uses the Ultra2 system as a host system which provides login services, access to files, the SLURM batch system etc.
Connecting to the cluster
To gain access to the CS-2 WSC you need to login to the host system, Ultra2. See the documentation for Ultra2.
Running Jobs
All jobs must be run via SLURM to avoid inconveniencing other users of the system. An example job is shown below.
SLURM example
This is based on the sample job from the Cerebras documentation Cerebras documentation - Execute your job
#!/bin/bash
#SBATCH --job-name=Example # Job name
#SBATCH --cpus-per-task=2 # Request 2 cores
#SBATCH --output=example_%j.log # Standard output and error log
#SBATCH --time=01:00:00 # Set time limit for this job to 1 hour
#SBATCH --gres=cs:1 # Request CS-2 system
source venv_cerebras_pt/bin/activate
python run.py \
CSX \
--params params.yaml \
--num_csx=1 \
--model_dir model_dir \
--mode {train,eval,eval_all,train_and_eval} \
--mount_dirs {paths to modelzoo and to data} \
--python_paths {paths to modelzoo and other python code if used}
See the 'Troubleshooting' section below for known issues.
Creating an environment
To run a job on the cluster, you must create a Python virtual environment (venv) and install the dependencies. The Cerebras documentation contains generic instructions to do this Cerebras setup environment docs however our host system is slightly different so we recommend the following:
Create the venv
python3.8 -m venv venv_cerebras_pt
Install the dependencies
source venv_cerebras_pt/bin/activate
pip install --upgrade pip
pip install cerebras_pytorch==2.2.1
Validate the setup
source venv_cerebras_pt/bin/activate
cerebras_install_check
Paths, PYTHONPATH and mount_dirs
There can be some confusion over the correct use of the parameters supplied to the run.py script. There is a helpful explanation page from Cerebras which explains these parameters and how they should be used. Python, paths and mount directories.