Running a PyTorch task
Requirements
It is recommended that users complete Getting started with Kubernetes and Requesting persistent volumes With Kubernetes before proceeding with this tutorial.
Overview
In the following lesson, we'll build a CNN neural network and train it using the EIDF GPU Service.
The model was taken from the PyTorch Tutorials.
The lesson will be split into three parts:
- Requesting a persistent volume and transferring code/data to it
- Creating a pod with a PyTorch container downloaded from DockerHub
- Submitting a job to the EIDF GPU Service and retrieving the results
Load training data and ML code into a persistent volume
Create a persistent volume
Request memory from the Ceph server by submitting a PVC to K8s (example pvc spec yaml below).
kubectl -n <project-namespace> create -f <pvc-spec-yaml>
Example PyTorch PersistentVolumeClaim
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: pytorch-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
storageClassName: csi-rbd-sc
Transfer code/data to persistent volume
-
Check PVC has been created
kubectl -n <project-namespace> get pvc <pv-name>
-
Create a lightweight job with pod with PV mounted (example job below)
kubectl -n <project-namespace> create -f lightweight-pod-job.yaml
-
Download the PyTorch code
wget https://github.com/EPCCed/eidf-docs/raw/main/docs/services/gpuservice/training/resources/example_pytorch_code.py
-
Copy the Python script into the PV
kubectl -n <project-namespace> cp example_pytorch_code.py lightweight-job-<identifier>:/mnt/ceph_rbd/
-
Check whether the files were transferred successfully
kubectl -n <project-namespace> exec lightweight-job-<identifier> -- ls /mnt/ceph_rbd
-
Delete the lightweight job
kubectl -n <project-namespace> delete job lightweight-job-<identifier>
Example lightweight job specification
apiVersion: batch/v1
kind: Job
metadata:
name: lightweight-job
labels:
kueue.x-k8s.io/queue-name: <project namespace>-user-queue
spec:
completions: 1
template:
metadata:
name: lightweight-pod
spec:
containers:
- name: data-loader
image: busybox
args: ["sleep", "infinity"]
resources:
requests:
cpu: 1
memory: '1Gi'
limits:
cpu: 1
memory: '1Gi'
volumeMounts:
- mountPath: /mnt/ceph_rbd
name: volume
restartPolicy: Never
volumes:
- name: volume
persistentVolumeClaim:
claimName: pytorch-pvc
Creating a Job with a PyTorch container
We will use the pre-made PyTorch Docker image available on Docker Hub to run the PyTorch ML model.
The PyTorch container will be held within a pod that has the persistent volume mounted and access a MIG GPU.
Submit the specification file below to K8s to create the job, replacing the queue name with your project namespace queue name.
kubectl -n <project-namespace> create -f <pytorch-job-yaml>
Example PyTorch Job Specification File
apiVersion: batch/v1
kind: Job
metadata:
name: pytorch-job
labels:
kueue.x-k8s.io/queue-name: <project namespace>-user-queue
spec:
completions: 1
template:
metadata:
name: pytorch-pod
spec:
restartPolicy: Never
containers:
- name: pytorch-con
image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
command: ["python3"]
args: ["/mnt/ceph_rbd/example_pytorch_code.py"]
volumeMounts:
- mountPath: /mnt/ceph_rbd
name: volume
resources:
requests:
cpu: 2
memory: "1Gi"
limits:
cpu: 4
memory: "4Gi"
nvidia.com/gpu: 1
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb
volumes:
- name: volume
persistentVolumeClaim:
claimName: pytorch-pvc
Reviewing the results of the PyTorch model
This is not intended to be an introduction to PyTorch, please see the online tutorial for details about the model.
-
Check that the model ran to completion
kubectl -n <project-namespace> logs <pytorch-pod-name>
-
Spin up a lightweight pod to retrieve results
kubectl -n <project-namespace> create -f lightweight-pod-job.yaml
-
Copy the trained model back to your access VM
kubectl -n <project-namespace> cp lightweight-job-<identifier>:mnt/ceph_rbd/model.pth model.pth
Using a Kubernetes job to train the pytorch model multiple times
A common ML training workflow may consist of training multiple iterations of a model: such as models with different hyperparameters or models trained on multiple different data sets.
A Kubernetes job can create and manage multiple pods with identical or different initial parameters.
NVIDIA provide a detailed tutorial on how to conduct a ML hyperparameter search with a Kubernetes job.
Below is an example job yaml for running the pytorch model which will continue to create pods until three have successfully completed the task of training the model.
apiVersion: batch/v1
kind: Job
metadata:
name: pytorch-job
labels:
kueue.x-k8s.io/queue-name: <project namespace>-user-queue
spec:
completions: 3
template:
metadata:
name: pytorch-pod
spec:
restartPolicy: Never
containers:
- name: pytorch-con
image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
command: ["python3"]
args: ["/mnt/ceph_rbd/example_pytorch_code.py"]
volumeMounts:
- mountPath: /mnt/ceph_rbd
name: volume
resources:
requests:
cpu: 2
memory: "1Gi"
limits:
cpu: 4
memory: "4Gi"
nvidia.com/gpu: 1
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb
volumes:
- name: volume
persistentVolumeClaim:
claimName: pytorch-pvc
Clean up
kubectl -n <project-namespace> delete pod pytorch-job
kubectl -n <project-namespace> delete pvc pytorch-pvc