GPU Service Policies

Namespaces

Each project will be given a namespace which will have an applied quota.

Default Quota:

CPU: 100 Cores
Memory: 1TiB
GPU: 12

Kubeconfig

Each project will be assigned a kubeconfig file for access to the service which will allow operation in the assigned namespace and access to exposed service operators, for example the GPU and CephRBD operators.

Kubernetes Job Names

All Kubernetes Jobs submitted will need to use the metadata.generateName field instead of the metadata.name field. This is to ensure jobs can be identified for purporses of maintenance and troubleshooting.

Submitting jobs with name only would allow several jobs to have the same name, potentially blocking you from submitting the job until the previous one was deleted. Support would have difficulties troubleshooting as the name remains the same, but execution results can be different each time.

Important

This policy is automated, but users will need to change their job template to use the new field for the submission to work.

Kubernetes Job Time to Live

All Kubernetes Jobs submitted to the service will have a Time to Live (TTL) applied via spec.ttlSecondsAfterFinished automatically. The default TTL for jobs using the service will be 1 week (604800 seconds). A completed job (in success or error state) will be deleted from the service once one week has elapsed after execution has completed. This will reduce excessive object accumulation on the service.

Important

This policy is automated and does not require users to change their job specifications.

Important

We recommend setting a lower value, unless you absolutely need the job to remain for debugging. Completed jobs serve no other purpose and can potentially make identifying your workloads more difficult.

Kubernetes Active Deadline Seconds

All Kubernetes User Pods submitted to the service will have an Active Deadline Seconds (ADS) applied via spec.spec.activeDeadlineSeconds automatically. The default ADS for pods using the service will be 5 days (432000 seconds). A pod will be terminated 5 days after execution has begun. This will reduce the number of unused pods remaining on the service.

Important

This policy is automated and does not require users to change their job or pod specifications.

Important

The preference would be, that you lower this number unless you are confident you need the workload to run for the maximum duration. Any configuration errors in your code can lead to the container running for the whole duration, but not yielding a result and taking cluster resources away from other users.

Kueue

All jobs will be managed through the Kueue scheduling system. All pods will be required to be owned by a Kubernetes workload.

Each project will have a local user queue in their namespace. This will provide access to their cluster queue. To enable the use of the queue in your job definitions, the following will need to be added to the job specification file as part of the metadata:

   labels:
      kueue.x-k8s.io/queue-name:  <project namespace>-user-queue

Workloads without this queue name tag will be rejected.