Profiling with PopVision

Graphcore provides various tools for profiling, debugging, and instrumenting programs run on IPUs. In this tutorial we will briefly demonstrate an example using the PopVision Graph Analyser. For more information, see Profiling and Debugging and PopVision Graph Analyser User Guide.

We will reuse the same PyTorch MNIST example from lesson 1 (from https://github.com/graphcore/examples/tree/master/tutorials/simple_applications/pytorch/mnist).

To enable profiling and create IPU reports, we need to add the following line to the training script mnist_poptorch_code_only.py :

training_opts = training_opts.enableProfiling()

(for details the API, see API reference)

Save and run kubectl create -f <yaml-file> on the following:

apiVersion: graphcore.ai/v1alpha1
kind: IPUJob
metadata:
  generateName: mnist-training-profiling-
spec:
  jobInstances: 1
  ipusPerJobInstance: "1"
  workers:
    template:
      spec:
        containers:
        - name: mnist-training-profiling
          image: graphcore/pytorch:3.3.0
          command: [/bin/bash, -c, --]
          args:
            - |
              cd;
              mkdir build;
              cd build;
              git clone https://github.com/graphcore/examples.git;
              cd examples/tutorials/simple_applications/pytorch/mnist;
              python -m pip install -r requirements.txt;
              sed -i '131i training_opts = training_opts.enableProfiling()' mnist_poptorch_code_only.py;
              python mnist_poptorch_code_only.py --epochs 1;
              echo 'RUNNING ls ./training';
              ls training
          resources:
            limits:
              cpu: 32
              memory: 200Gi
          securityContext:
            capabilities:
              add:
              - IPC_LOCK
          volumeMounts:
          - mountPath: /dev/shm
            name: devshm
        restartPolicy: Never
        hostIPC: true
        volumes:
        - emptyDir:
            medium: Memory
            sizeLimit: 10Gi
          name: devshm

After completion, using kubectl logs <pod-name>, we can see the following result

...
Accuracy on test set: 96.69%
RUNNING ls ./training
archive.a
profile.pop

We can see that the training has created two Poplar report files: archive.a which is an archive of the ELF executable files, one for each tile; and profile.pop, the poplar profile, which contains compile-time and execution information about the Poplar graph.

Downloading the profile reports

To download the traing profiles to your local environment, you can use kubectl cp. For example, run

kubectl cp <pod-name>:/root/build/examples/tutorials/simple_applications/pytorch/mnist/training .

Once you have downloaded the profile report files, you can view the contents locally using the PopVision Graph Analyser tool, which is available for download here https://www.graphcore.ai/developer/popvision-tools.

From the Graph Analyser, you can analyse information including memory usage, execution trace and more.