Template workflow
Requirements
It is recommended that users complete Getting started with Kubernetes and Requesting persistent volumes With Kubernetes before proceeding with this tutorial.
Overview
An example workflow for code development using K8s is outlined below.
In theory, users can create docker images with all the code, software and data included to complete their analysis.
In practice, docker images with the required software can be several gigabytes in size which can lead to unacceptable download times when ~100GB of data and code is then added.
Therefore, it is recommended to separate code, software, and data preparation into distinct steps:
-
Data Loading: Loading large data sets asynchronously.
-
Developing a Docker environment: Manually or automatically building Docker images.
-
Code development with K8s: Iteratively changing and testing code in a job.
The workflow describes different strategies to tackle the three common stages in code development and analysis using the EIDF GPU Service.
The three stages are interchangeable and may not be relevant to every project.
Some strategies in the workflow require a GitHub account and Docker Hub account for automatic building (this can be adapted for other platforms such as GitLab).
Data loading
The EIDF GPU service contains GPUs with 40Gb/80Gb of on board memory and it is expected that data sets of > 100 Gb will be loaded onto the service to utilise this hardware.
Persistent volume claims need to be of sufficient size to hold the input data, any expected output data and a small amount of additional empty space to facilitate IO.
Read the requesting persistent volumes with Kubernetes lesson to learn how to request and mount persistent volumes to pods.
It often takes several hours or days to download data sets of 1/2 TB or more to a persistent volume.
Therefore, the data download step needs to be completed asynchronously as maintaining a contention to the server for long periods of time can be unreliable.
Asynchronous data downloading with a lightweight job
-
Check a PVC has been created.
kubectl -n <project-namespace> get pvc template-workflow-pvc
-
Write a job yaml with PV mounted and a command to download the data. Change the curl URL to your data set of interest.
apiVersion: batch/v1 kind: Job metadata: name: lightweight-job labels: kueue.x-k8s.io/queue-name: <project-namespace>-user-queue spec: completions: 1 parallelism: 1 template: metadata: name: lightweight-job spec: restartPolicy: Never containers: - name: data-loader image: alpine/curl:latest command: ['sh', '-c', "cd /mnt/ceph_rbd; curl https://archive.ics.uci.edu/static/public/53/iris.zip -o iris.zip"] resources: requests: cpu: 1 memory: "1Gi" limits: cpu: 1 memory: "1Gi" volumeMounts: - mountPath: /mnt/ceph_rbd name: volume volumes: - name: volume persistentVolumeClaim: claimName: template-workflow-pvc
-
Run the data download job.
kubectl -n <project-namespace> create -f lightweight-pod.yaml
-
Check if the download has completed.
kubectl -n <project-namespace> get jobs
-
Delete the lightweight job once completed.
kubectl -n <project-namespace> delete job lightweight-job
Asynchronous data downloading within a screen session
Screen is a window manager available in Linux that allows you to create multiple interactive shells and swap between then.
Screen has the added benefit that if your remote session is interrupted the screen session persists and can be reattached when you manage to reconnect.
This allows you to start a task, such as downloading a data set, and check in on it asynchronously.
Once you have started a screen session, you can create a new window with ctrl-a c
, swap between windows with ctrl-a 0-9
and exit screen (but keep any task running) with ctrl-a d
.
Using screen rather than a single download job can be helpful if downloading multiple data sets or if you intend to do some simple QC or tidying up before/after downloading.
-
Start a screen session.
screen
-
Create an interactive lightweight job session.
apiVersion: batch/v1 kind: Job metadata: name: lightweight-job labels: kueue.x-k8s.io/queue-name: <project-namespace>-user-queue spec: completions: 1 parallelism: 1 template: metadata: name: lightweight-pod spec: restartPolicy: Never containers: - name: data-loader image: alpine/curl:latest command: ['sleep','infinity'] resources: requests: cpu: 1 memory: "1Gi" limits: cpu: 1 memory: "1Gi" volumeMounts: - mountPath: /mnt/ceph_rbd name: volume volumes: - name: volume persistentVolumeClaim: claimName: template-workflow-pvc
-
Download data set. Change the curl URL to your data set of interest.
kubectl -n <project-namespace> exec <lightweight-pod-name> -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip
-
Exit the remote session by either ending the session or
ctrl-a d
. -
Reconnect at a later time and reattach the screen window.
screen -list screen -r <session-name>
-
Check the download was successful and delete the job.
kubectl -n <project-namespace> exec <lightweight-pod-name> -- ls /mnt/ceph_rbd/ kubectl -n <project-namespace> delete job lightweight-job
-
Exit the screen session.
exit
Preparing a custom Docker image
Kubernetes requires Docker images to be pre-built and available for download from a container repository such as Docker Hub.
It does not provide functionality to build images and create pods from docker files.
However, use cases may require some custom modifications of a base image, such as adding a python library.
These custom images need to be built locally (using docker) or online (using a GitHub/GitLab worker) and pushed to a repository such as Docker Hub.
This is not an introduction to building docker images, please see the Docker tutorial for a general overview.
Manually building a Docker image locally
-
Select a suitable base image (The Nvidia container catalog is often a useful starting place for GPU accelerated tasks). We'll use the base RAPIDS image.
-
Create a Dockerfile to add any additional packages required to the base image.
FROM nvcr.io/nvidia/rapidsai/base:23.12-cuda12.0-py3.10 RUN pip install pandas RUN pip install plotly
-
Build the Docker container locally (You will need to install Docker)
cd <dockerfile-folder> docker build . -t <docker-hub-username>/template-docker-image:latest
Building images for different CPU architectures
Be aware that docker images built for Apple ARM64 architectures will not function optimally on the EIDFGPU Service's AMD64 based architecture.
If building docker images locally on an Apple device you must tell the docker daemon to use AMD64 based images by passing the --platform linux/amd64
flag to the build function.
-
Create a repository to hold the image on Docker Hub (You will need to create and setup an account).
-
Push the Docker image to the repository.
docker push <docker-hub-username>/template-docker-image:latest
-
Finally, specify your Docker image in the
image:
tag of the job specification yaml file.apiVersion: batch/v1 kind: Job metadata: name: template-workflow-job labels: kueue.x-k8s.io/queue-name: <project-namespace>-user-queue spec: completions: 1 parallelism: 1 template: spec: restartPolicy: Never containers: - name: template-docker-image image: <docker-hub-username>/template-docker-image:latest command: ["sleep", "infinity"] resources: requests: cpu: 1 memory: "4Gi" limits: cpu: 1 memory: "8Gi"
Automatically building docker images using GitHub Actions
In cases where the Docker image needs to be built and tested iteratively (i.e. to check for comparability issues), git version control and GitHub Actions can simplify the build process.
A GitHub action can build and push a Docker image to Docker Hub whenever it detects a git push that changes the docker file in a git repo.
This process requires you to already have a GitHub and Docker Hub account.
-
Create an access token on your Docker Hub account to allow GitHub to push changes to the Docker Hub image repo.
-
Create two GitHub secrets to securely provide your Docker Hub username and access token.
-
Add the dockerfile to a code/docker folder within an active GitHub repo.
-
Add the GitHub action yaml file below to the .github/workflow folder to automatically push a new image to Docker Hub if any changes to files in the code/docker folder is detected.
name: ci on: push: paths: - 'code/docker/**' jobs: docker: runs-on: ubuntu-latest steps: - name: Set up QEMU uses: docker/setup-qemu-action@v3 - name: Set up Docker Buildx uses: docker/setup-buildx-action@v3 - name: Login to Docker Hub uses: docker/login-action@v3 with: username: ${{ secrets.DOCKERHUB_USERNAME }} password: ${{ secrets.DOCKERHUB_TOKEN }} - name: Build and push uses: docker/build-push-action@v5 with: context: "{{defaultContext}}:code/docker" push: true tags: <target-dockerhub-image-name>
-
Push a change to the dockerfile and check the Docker Hub image is updated.
Code development with K8s
Production code can be included within a Docker image to aid reproducibility as the specific software versions required to run the code are packaged together.
However, binding the code to the docker image during development can delay the testing cycle as re-downloading all of the software for every change in a code block can take time.
If the docker image is consistent across tests, then it can be cached locally on the EIDFGPU Service instead of being re-downloaded (this occurs automatically although the cache is node specific and is not shared across nodes).
A pod yaml file can be defined to automatically pull the latest code version before running any tests.
Reducing the download time to fractions of a second allows rapid testing to be completed on the cluster with just the kubectl create
command.
You must already have a GitHub account to follow this process.
This process allows code development to be conducted on any device/VM with access to the repo (GitHub/GitLab).
A template GitHub repo with sample code, k8s yaml files and a Docker build Github Action is available here.
Create a job that downloads and runs the latest code version at runtime
-
Write a standard yaml file for a k8s job with the required resources and custom docker image (example below)
apiVersion: batch/v1 kind: Job metadata: name: template-workflow-job labels: kueue.x-k8s.io/queue-name: <project-namespace>-user-queue spec: completions: 1 parallelism: 1 template: spec: restartPolicy: Never containers: - name: template-docker-image image: <docker-hub-username>/template-docker-image:latest command: ["sleep", "infinity"] resources: requests: cpu: 1 memory: "4Gi" limits: cpu: 1 memory: "8Gi" volumeMounts: - mountPath: /mnt/ceph_rbd name: volume volumes: - name: volume persistentVolumeClaim: claimName: template-workflow-pvc
-
Add an initial container that runs before the main container to download the latest version of the code.
apiVersion: batch/v1 kind: Job metadata: name: template-workflow-job labels: kueue.x-k8s.io/queue-name: <project-namespace>-user-queue spec: completions: 1 parallelism: 1 template: spec: restartPolicy: Never containers: - name: template-docker-image image: <docker-hub-username>/template-docker-image:latest command: ["sleep", "infinity"] resources: requests: cpu: 1 memory: "4Gi" limits: cpu: 1 memory: "8Gi" volumeMounts: - mountPath: /mnt/ceph_rbd name: volume - mountPath: /code name: github-code initContainers: - name: lightweight-git-container image: cicirello/alpine-plus-plus command: ['sh', '-c', "cd /code; git clone <target-repo>"] resources: requests: cpu: 1 memory: "4Gi" limits: cpu: 1 memory: "8Gi" volumeMounts: - mountPath: /code name: github-code volumes: - name: volume persistentVolumeClaim: claimName: template-workflow-pvc - name: github-code emptyDir: sizeLimit: 1Gi
-
Change the command argument in the main container to run the code once started. Add the URL of the GitHub repo of interest to the
initContainers: command:
tag.apiVersion: batch/v1 kind: Job metadata: name: template-workflow-job labels: kueue.x-k8s.io/queue-name: <project-namespace>-user-queue spec: completions: 1 parallelism: 1 template: spec: restartPolicy: Never containers: - name: template-docker-image image: <docker-hub-username>/template-docker-image:latest command: ['sh', '-c', "python3 /code/<python-script>"] resources: requests: cpu: 10 memory: "40Gi" limits: cpu: 10 memory: "80Gi" nvidia.com/gpu: 1 volumeMounts: - mountPath: /mnt/ceph_rbd name: volume - mountPath: /code name: github-code initContainers: - name: lightweight-git-container image: cicirello/alpine-plus-plus command: ['sh', '-c', "cd /code; git clone <target-repo>"] resources: requests: cpu: 1 memory: "4Gi" limits: cpu: 1 memory: "8Gi" volumeMounts: - mountPath: /code name: github-code volumes: - name: volume persistentVolumeClaim: claimName: template-workflow-pvc - name: github-code emptyDir: sizeLimit: 1Gi
-
Submit the yaml file to kubernetes
kubectl -n <project-namespace> create -f <job-yaml-file>