Kubernetes and GPU nodes on AWS

April 26, 2021 - Last updated: September 18, 2021

Time for GPU and Kubernetes, the idea is to provide an instance group with Kubernetes' nodes with GPU, also to save some money we can start the autoscaling group from 0 instances.

The environment

Instance group

Create an instance group called gpu-nodes and add to these nodes a taint to avoid deploy pods that doesn't need GPU cores.

kind: InstanceGroup
role: Node
metadata:
    name: gpu-nodes
spec:
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: "true"
    k8s.io/cluster-autoscaler/<YOUR CLUSTER NAME>: "owned"
    k8s.io/cluster-autoscaler/node-template/label/zone: "eu-west-1a"
    k8s.io/cluster-autoscaler/node-template/label/instance-type: "gpu"
    k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: "true"
    k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu: "1" # Amount of GPU on each instance
  nodeLabels:
    zone: "eu-west-1a"
    instance-type: "gpu"
  machineType: g4dn.xlarge
  minSize: 0
  maxSize: 10
  taints:
  - dedicated=gpu:NoSchedule
...

GPU Operator

The GPU Operator is a toolkit to install and validate the NVIDIA driver in the K8S nodes. The idea is when the new node is provisioned the GPU Operator installs the NVIDIA driver.

To provision GPU worker nodes in a Kubernetes cluster, the following NVIDIA software components are required – the driver, container runtime, device plugin and monitoring. These components need to be manually provisioned before GPU resources are available to the cluster and also need to be managed during the operation of the cluster. The GPU Operator simplifies both the initial deployment and management of the components by containerizing all the components and using standard Kubernetes APIs for automating and managing these components including versioning and upgrades.

I use HELM to deploy the GPU Operator, these are the values values.yaml for my cluster.

nfd:
  enabled: true

operator:
  tolerations:
    - key: "node-role.kubernetes.io/master"
      operator: "Exists"
      effect: "NoSchedule"

driver:
  nodeSelector:
    instance-type: gpu
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"
  repoConfig:
    configMapName: repo-config
    destinationDir: /etc/yum.repos.d

toolkit:
  version: 1.4.0-ubi8
  nodeSelector:
    instance-type: gpu
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"

devicePlugin:
  nodeSelector:
    instance-type: gpu
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"

dcgmExporter:
  nodeSelector:
    instance-type: gpu
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"

gfd:
  nodeSelector:
    instance-type: gpu
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"

node-feature-discovery:
  worker:
    nodeSelector:
      instance-type: gpu
    tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"

My infrastructure runs in a private network so I need to set some YUM repositories to use an internal Artifactory, you can remove the two lines from the field repoConfig and avoid the following config map if you don't need it.

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: repo-config
  namespace: gpu-operator-resources
data:
  base.repo: |
    [base]
    baseurl = https://yum.dev/centos/$releasever/os/$basearch/
    enabled = 1
    gpgcheck = 1
    gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7
    name = CentOS-$releasever - Base
...

Example: Pod with GPU request

The following example is an application that does some math over GPU and returns the results, the pod is deployed, the Cluster Autoscaler detects there is not enough GPU instances and requests a new instance from the instance group, after the pod complete the job, the pod dies, and the Cluster Autoscaler detects the GPU node is idle and terminate the node.

---
apiVersion: v1
kind: Pod
metadata:
  name: diego-cuda-vector-add
spec:
  nodeSelector:
    instance-type: gpu
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"
  restartPolicy: OnFailure
  containers:
    - name: gpu
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1

Here some logs from the Kubernete's events and Cluster Autoscaler when the pod is requesting the GPU.

$ kubectl get events

LAST SEEN   TYPE      REASON                 OBJECT                           MESSAGE
27s      Warning    FailedScheduling     pod/diego-cuda-vector-add        0/150 nodes are available: 150 Insufficient nvidia.com/gpu, 2 Insufficient cpu.
1m27s    Normal     TriggeredScaleUp     pod/diego-cuda-vector-add        pod triggered scale-up: [{gpu-nodes.dev 0->1 (max: 10)}]
$ kubectl logs -f cluster-autoscaler-xxxxx -n kube-system

...
I0426 15:55:28.597388       1 event.go:278] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"system", Name:"cluster-autoscaler-status", UID:"6ff5949a-df61-4e34-8d18-c643a16083e1", APIVersion:"v1", ResourceVersion:"107891067", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: group gpu-nodes.dev size set to 1
I0426 15:55:28.597471       1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"diego-cuda-vector-add", UID:"50d113f2-9e8c-4590-98cd-739587f58578", APIVersion:"v1", ResourceVersion:"107890721", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{gpu-nodes.dev 0->1 (max: 10)}]
...
$ kubectl logs diego-cuda-vector-add

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Related posts