Time for GPU and Kubernetes, the idea is to provide an instance group with Kubernetes' nodes with GPU, also to save some money we can start the autoscaling group from 0 instances.
The environment
- AWS
- G4DN instances also tested on P2 instances
- Kubernetes 1.18 deployed via Kops
- Cluster Autoscaler 1.18
- GPU Operator
Instance group
Create an instance group called gpu-nodes
and add to these nodes a taint
to avoid deploy pods that doesn't need GPU cores.
kind: InstanceGroup
role: Node
metadata:
name: gpu-nodes
spec:
cloudLabels:
k8s.io/cluster-autoscaler/enabled: "true"
k8s.io/cluster-autoscaler/<YOUR CLUSTER NAME>: "owned"
k8s.io/cluster-autoscaler/node-template/label/zone: "eu-west-1a"
k8s.io/cluster-autoscaler/node-template/label/instance-type: "gpu"
k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: "true"
k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu: "1" # Amount of GPU on each instance
nodeLabels:
zone: "eu-west-1a"
instance-type: "gpu"
machineType: g4dn.xlarge
minSize: 0
maxSize: 10
taints:
- dedicated=gpu:NoSchedule
...
GPU Operator
The GPU Operator is a toolkit to install and validate the NVIDIA driver in the K8S nodes. The idea is when the new node is provisioned the GPU Operator installs the NVIDIA driver.
To provision GPU worker nodes in a Kubernetes cluster, the following NVIDIA software components are required – the driver, container runtime, device plugin and monitoring. These components need to be manually provisioned before GPU resources are available to the cluster and also need to be managed during the operation of the cluster. The GPU Operator simplifies both the initial deployment and management of the components by containerizing all the components and using standard Kubernetes APIs for automating and managing these components including versioning and upgrades.
I use HELM to deploy the GPU Operator, these are the values values.yaml
for my cluster.
nfd:
enabled: true
operator:
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
driver:
nodeSelector:
instance-type: gpu
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
repoConfig:
configMapName: repo-config
destinationDir: /etc/yum.repos.d
toolkit:
version: 1.4.0-ubi8
nodeSelector:
instance-type: gpu
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
devicePlugin:
nodeSelector:
instance-type: gpu
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
dcgmExporter:
nodeSelector:
instance-type: gpu
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
gfd:
nodeSelector:
instance-type: gpu
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
node-feature-discovery:
worker:
nodeSelector:
instance-type: gpu
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
My infrastructure runs in a private network so I need to set some YUM repositories to use an internal Artifactory, you can remove the two lines from the field repoConfig
and avoid the following config map if you don't need it.
---
apiVersion: v1
kind: ConfigMap
metadata:
name: repo-config
namespace: gpu-operator-resources
data:
base.repo: |
[base]
baseurl = https://yum.dev/centos/$releasever/os/$basearch/
enabled = 1
gpgcheck = 1
gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7
name = CentOS-$releasever - Base
...
Example: Pod with GPU request
The following example is an application that does some math over GPU and returns the results, the pod is deployed, the Cluster Autoscaler detects there is not enough GPU instances and requests a new instance from the instance group, after the pod complete the job, the pod dies, and the Cluster Autoscaler detects the GPU node is idle and terminate the node.
---
apiVersion: v1
kind: Pod
metadata:
name: diego-cuda-vector-add
spec:
nodeSelector:
instance-type: gpu
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
restartPolicy: OnFailure
containers:
- name: gpu
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
Here some logs from the Kubernete's events and Cluster Autoscaler when the pod is requesting the GPU.
$ kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
27s Warning FailedScheduling pod/diego-cuda-vector-add 0/150 nodes are available: 150 Insufficient nvidia.com/gpu, 2 Insufficient cpu.
1m27s Normal TriggeredScaleUp pod/diego-cuda-vector-add pod triggered scale-up: [{gpu-nodes.dev 0->1 (max: 10)}]
$ kubectl logs -f cluster-autoscaler-xxxxx -n kube-system
...
I0426 15:55:28.597388 1 event.go:278] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"system", Name:"cluster-autoscaler-status", UID:"6ff5949a-df61-4e34-8d18-c643a16083e1", APIVersion:"v1", ResourceVersion:"107891067", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: group gpu-nodes.dev size set to 1
I0426 15:55:28.597471 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"diego-cuda-vector-add", UID:"50d113f2-9e8c-4590-98cd-739587f58578", APIVersion:"v1", ResourceVersion:"107890721", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{gpu-nodes.dev 0->1 (max: 10)}]
...
$ kubectl logs diego-cuda-vector-add
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done