GCP - Kubernetes

This guide shows you how to setup Lumeo Gateways to run in a GCP Kubernetes Cluster.

Overview

This guide contains instructions to run Lumeo Gateway containers in a kubernetes cluster using the Google Kubernetes Engine in GCP.

Kubernetes Cluster Setup

Set default project

This guide assumes you are using a GCP project titled lumeo-kubernetes

gcloud config set project lumeo-kubernetes

Setup the kubernetes cluster

Helpful links: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus

🚧

Private vs Public GKE Cluster

If you plan to process RTSP streams in these Lumeo Gateways within your cluster, you will need to set up your GKE cluster as a private cluster (ie nodes don't get auto-assigned public IPs). Doing so will let you use it with Google Cloud NAT which is required to make RTSP streaming work (Cloud NAT doesn't work with public IP nodes).

Note that it is not possible to switch a GKE cluster's mode once created.

gcloud beta container clusters create "lumeo-gateways" \
      --zone "us-central1-a" \
      --machine-type "e2-medium" \
      --image-type "COS_CONTAINERD" \
      --disk-type "pd-standard" --disk-size "75" \
      --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" \
      --max-pods-per-node "48" \
      --num-nodes "1" \
      --enable-private-nodes --master-ipv4-cidr "172.17.0.0/28" --enable-master-global-access \
      --enable-ip-alias \
      --network "projects/lumeo-kubernetes/global/networks/default" \
      --subnetwork "projects/lumeo-kubernetes/regions/us-central1/subnetworks/default" \
      --no-enable-intra-node-visibility \
      --no-enable-master-authorized-networks \
      --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver \
      --enable-autoupgrade --enable-autorepair \
      --enable-shielded-nodes \
      --node-locations "us-central1-a"
gcloud beta container clusters create "lumeo-gateways" \
      --zone "us-central1-a" \
      --machine-type "e2-medium" \
      --image-type "COS_CONTAINERD" \
      --disk-type "pd-standard" --disk-size "75" \
      --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" \
      --max-pods-per-node "48" \
      --num-nodes "1" \
      --enable-ip-alias \
      --network "projects/lumeo-kubernetes/global/networks/default" \
      --subnetwork "projects/lumeo-kubernetes/regions/us-central1/subnetworks/default" \
      --no-enable-intra-node-visibility \
      --no-enable-master-authorized-networks \
      --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver \
      --enable-autoupgrade --enable-autorepair \
      --enable-shielded-nodes \
      --node-locations "us-central1-a"

Setup Google Cloud NAT

Only if you created a Private GKE cluster in the previous step.

Reference : https://cloud.google.com/nat/docs/gke-example (Skip forward to Step 6)

Create a Cloud Router

gcloud compute routers create lumeo-gateways-nat-router --network default --region us-central1

Configure the Router

gcloud compute routers nats create lumeo-gateways-nat-config \
   --router-region us-central1 \
   --router lumeo-gateways-nat-router \
   --nat-all-subnet-ip-ranges \
   --auto-allocate-nat-external-ips \
   --enable-dynamic-port-allocation

Create node pool with GPUs

Make sure that you have enough GPU quota approved by GCP.

gcloud container node-pools create "gpu-pool" \
   --zone us-central1-a --cluster lumeo-gateways \
   --machine-type "n1-standard-8" \
   --accelerator "type=nvidia-tesla-t4,count=1" \
   --disk-type "pd-standard" --disk-size "75" \
   --enable-autoupgrade --enable-autorepair \
   --enable-autoscaling --num-nodes 1 --min-nodes 1 --max-nodes 5

Setup kubectl

gcloud container clusters get-credentials lumeo-gateways --zone us-central1-a 

Setup cluster to install GPU drivers

Install GCP's Default Drivers (v535)

This is the recommended way by Google.

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml

Install v525 Nvidia Drivers

Use this approach only if GCP default drivers do not work.

Download the following file and run :

kubectl apply -f daemonset-nvidia-driver-installer.yaml
# Copyright 2022 Google Inc. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# The Dockerfile and other source for this daemonset are in
# https://cos.googlesource.com/cos/tools/+/refs/heads/master/src/cmd/cos_gpu_installer/
#
# This is the same as ../../daemonset.yaml except that it assumes that the
# docker image is present on the node instead of downloading from GCR. This
# allows easier upgrades because GKE can preload the correct image on the
# node and the daemonset can just use that image.

# Lumeo Updates: Update the Nvidia Driver version to 515.86.01 or 525.60.13
# Original file from : https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
# https://storage.googleapis.com/nvidia-drivers-us-public/tesla/525.60.13/NVIDIA-Linux-x86_64-525.60.13.run
# https://storage.googleapis.com/nvidia-drivers-us-public/tesla/515.86.01/NVIDIA-Linux-x86_64-515.86.01.run

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-driver-installer
  namespace: kube-system
  labels:
    k8s-app: nvidia-driver-installer
spec:
  selector:
    matchLabels:
      k8s-app: nvidia-driver-installer
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-driver-installer
        k8s-app: nvidia-driver-installer
    spec:
      priorityClassName: system-node-critical
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-accelerator
                operator: Exists
      tolerations:
      - operator: "Exists"
      hostNetwork: true
      hostPID: true
      volumes:
      - name: dev
        hostPath:
          path: /dev
      - name: vulkan-icd-mount
        hostPath:
          path: /home/kubernetes/bin/nvidia/vulkan/icd.d
      - name: nvidia-install-dir-host
        hostPath:
          path: /home/kubernetes/bin/nvidia
      - name: root-mount
        hostPath:
          path: /
      - name: cos-tools
        hostPath:
          path: /var/lib/cos-tools
      - name: nvidia-config
        hostPath:
          path: /etc/nvidia
      initContainers:
      - image: "gcr.io/cos-cloud/cos-gpu-installer:latest" #"cos-nvidia-installer:fixed"
        imagePullPolicy: IfNotPresent
        name: nvidia-driver-installer
        resources:
          requests:
            cpu: "0.15"
        securityContext:
          privileged: true
        env:
        - name: NVIDIA_DRIVER_VERSION
          value: "525.60.13" # or 515.86.01
        - name: NVIDIA_INSTALL_DIR_HOST
          value: /home/kubernetes/bin/nvidia
        - name: NVIDIA_INSTALL_DIR_CONTAINER
          value: /usr/local/nvidia
        - name: VULKAN_ICD_DIR_HOST
          value: /home/kubernetes/bin/nvidia/vulkan/icd.d
        - name: VULKAN_ICD_DIR_CONTAINER
          value: /etc/vulkan/icd.d
        - name: ROOT_MOUNT_DIR
          value: /root
        - name: COS_TOOLS_DIR_HOST
          value: /var/lib/cos-tools
        - name: COS_TOOLS_DIR_CONTAINER
          value: /build/cos-tools
        volumeMounts:
        - name: nvidia-install-dir-host
          mountPath: /usr/local/nvidia
        - name: vulkan-icd-mount
          mountPath: /etc/vulkan/icd.d
        - name: dev
          mountPath: /dev
        - name: root-mount
          mountPath: /root
        - name: cos-tools
          mountPath: /build/cos-tools
        #command: ['/cos-gpu-installer', 'install', '--allow-unsigned-driver', '--nvidia-installer-url=https://storage.googleapis.com/nvidia-drivers-us-public/tesla/525.60.13/NVIDIA-Linux-x86_64-525.60.13.run']
      - image: "gcr.io/gke-release/nvidia-partition-gpu@sha256:c54fd003948fac687c2a93a55ea6e4d47ffbd641278a9191e75e822fe72471c2"
        name: partition-gpus
        env:
          - name: LD_LIBRARY_PATH
            value: /usr/local/nvidia/lib64
        resources:
          requests:
            cpu: "0.15"
        securityContext:
          privileged: true
        volumeMounts:
        - name: nvidia-install-dir-host
          mountPath: /usr/local/nvidia
        - name: dev
          mountPath: /dev
        - name: nvidia-config
          mountPath: /etc/nvidia
      containers:
      - image: "gcr.io/google-containers/pause:2.0"
        name: pause

Deploy Lumeo

Create a secret with your Lumeo App ID and Access Token.

The App ID and Access Token can be found in Workspace settings in Lumeo Console. See API for details.

Warning: Access Tokens start with a '$', so ensure you use single quotes below to prevent shell substitution.

kubectl create secret generic 'replace-with-lumeo-app-id' --from-literal=LUMEO_APP_ID='replace-with-lumeo-app-id' --from-literal=LUMEO_API_KEY='replace-with-lumeo-access-token'

Replace the App ID in lumeo-gateway.yaml

Search for replace-with-lumeo-app-id in the yaml template below.

# This Service is required just for K8S to run the Stateful set. 
# See https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#limitations
# Note that Lumeo gateways do not expose any local APIs that can be used directly via a Kubernetes "Service".
apiVersion: v1
kind: Service
metadata:
  name: lumeod
  labels:
    app: lumeod
spec:
  ports:
  clusterIP: None
  selector:
    app: lumeod
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: lumeo-gateway
spec:
  selector:
    matchLabels:
      app: lumeod
  serviceName: "lumeod"
  replicas: 1
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: lumeod
    spec:
      initContainers:
        - name: lumeo-temp-fix-for-gateway-config-paths
          image: busybox
          command:
          - sh
          - -c
          - | 
            if [ ! -d "/var/lib/lumeo/models" ]; then
              chmod -R 777 /var/lib/lumeo && mkdir -p /var/lib/lumeo/upload && mkdir -p /var/lib/lumeo/models && mkdir -p /var/lib/lumeo/media && chmod -R 777 /var/lib/lumeo
            fi
            if [ ! -d "/var/lib/lumeo/tracker_configs" ]; then
              mkdir -p /var/lib/lumeo/tracker_configs && chmod -R 777 /var/lib/lumeo/tracker_configs
            fi
          volumeMounts:
          - name: lumeo-gateway-config
            mountPath: /var/lib/lumeo
      containers:
        - name: lumeo
          image: 'lumeo/gateway-nvidia-dgpu:latest'
          imagePullPolicy: Always
          envFrom:
          - secretRef:
              name: 'replace-with-lumeo-app-id'
          env:
            - name: CONTAINER_MODEL
              value: 'Kubernetes'              
          volumeMounts:
            - name: lumeo-gateway-config
              mountPath: /var/lib/lumeo
          resources:
            limits:
              nvidia.com/gpu: 1    
  volumeClaimTemplates:
  - metadata:
      name: lumeo-gateway-config
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi # Increase this if you intend to deploy large models or generate local media.
  1. Create a Stateful set.
    This creates Lumeo gateways in the Lumeo Workspace specified using the App Id above.

    kubectl apply -f lumeo-gateway.yaml
    

    Monitor stateful set creation with :

    kubectl get statefulsets
    

    Once created, these gateways will appear in your Lumeo account with names starting from lumeo-gateway-0, ...

Scaling the Cluster

kubectl scale statefulsets lumeo-gateway --replicas=2

Note:

  • Scaling up will create new gateways with consecutive names (lumeo-gateway-0, lumeo-gateway-1, ..) if they didn't exist previously.
  • Scaling down the set will remove the highest numbered gateway from the set. The gateway will go offline in the Lumeo console.
  • When scaling up after scaling down, new gateways will NOT be created for those instances that existed before. Those gateways will just go online in the Lumeo console.

Updating lumeod versions

Replace the version below with the version you wish to deploy.

kubectl set image statefulset/lumeo-gateway --selector app=lumeod  lumeo=lumeo/gateway-nvidia-dgpu:1.3.29