InferenceService Node Scheduling¶
Setup¶
The InferenceService spec supports node selector, node affinity and tolerations. To enable
              these features we must enable the knative flags (see Install
                Knative Serving Note).
Option 1: Pre-Kubeflow Install Feature Flags Setup¶
If we install KServe as part of Kubeflow manifest and would like to enable the feature flags before
              installing Kubeflow, we can do so by editing the file
              manifests/common/knative/knative-serving/base/upstream/serving-core.yaml 
              This is often a common approach that allows a reproducible configuration as the feature flags will be
              enabled everytime we install Kubeflow.
            
- Enable kubernetes.podspec-affinity
kubernetes.podspec-affinity: "enabled" - Enable kubernetes.podspec-nodeselector
kubernetes.podspec-nodeselector: "enabled" - Enable kubernetes.podspec-tolerations
kubernetes.podspec-tolerations: "enabled" 
With all features enabled we should have a data portion that looks like this :
            
data:
  kubernetes.podspec-affinity: "enabled"
  kubernetes.podspec-nodeselector: "enabled"
  kubernetes.podspec-tolerations: "enabled"
            Option 2: Post-Kubeflow Install Feature Flags Setup¶
If we don't want to enable the flags before installing kubeflow, we can enable the flags after installing kubeflow by editing the configuration using :
kubectl edit configmap config-features -n knative-serving
            data section like it was done for the pre-Kubeflow install setup.
            
            Usage¶
To use node selector/node affinity and tolerations, we can use it directly in the
              InferenceService custom resource definition.
            
Node Selector¶
Here is an example using node selector where myLabelName can be replaced by the name of the
              label that the specific node we want has, same thing for myLabelValue.
            
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "torchscript-cifar"
spec:
  predictor:
    nodeSelector:
      myLabelName: "myLabelValue"
    triton:
      storageUri: "gs://kfserving-examples/models/torchscript"
      runtimeVersion: 21.08-py3
      env:
      - name: OMP_NUM_THREADS
        value: "1"
            transformer, here is the equivalent for a
            transformer, we simply add it under the transformer spec :
            apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "torchscript-cifar"
spec:
  transformer:
    nodeSelector:
      myLabelName: "myLabelValue"
    containers:
    - image: kfserving/image-transformer-v2:latest
      name: kfserving-container
      command:
      - "python"
      - "-m"
      - "image_transformer_v2"
      args:
      - --model_name
      - cifar10
      - --protocol
      - v2
  predictor:
    triton:
      storageUri: "gs://kfserving-examples/models/torchscript"
      runtimeVersion: 21.08-py3
      env:
      - name: OMP_NUM_THREADS
        value: "1"
            GPU Node Label Selector Example¶
In this example, our predictor will only run on the node with the label
              k8s.amazonaws.com/accelerator with the value "nvidia-tesla-t4".
              You can learn more about recommended label names for GPU nodes when using kubernetes autoscaler by
              checking your cloud provider's documentation.
            
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "torchscript-cifar"
spec:
  predictor:
    nodeSelector:
      k8s.amazonaws.com/accelerator: "nvidia-tesla-t4"
    triton:
      storageUri: "gs://kfserving-examples/models/torchscript"
      runtimeVersion: 21.08-py3
      env:
      - name: OMP_NUM_THREADS
        value: "1"
      resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
            Tolerations¶
This examples shows how to add a toleration to our predictor, this will make it possible
              (not mandatory) for the predictor pod to be scheduled on any node with the matching taint.
              You can replace yourTaintKeyHere with the taint key from your node taint.
            
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "torchscript-cifar"
spec:
  predictor:
    tolerations:
      - key: "yourTaintKeyHere"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
    triton:
      storageUri: "gs://kfserving-examples/models/torchscript"
      runtimeVersion: 21.08-py3
      env:
      - name: OMP_NUM_THREADS
        value: "1"
            transformer.
            apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "torchscript-cifar"
spec:
  transformer:
    tolerations:
      - key: "yourTaintKeyHere"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
    containers:
    - image: kfserving/image-transformer-v2:latest
      name: kfserving-container
      command:
      - "python"
      - "-m"
      - "image_transformer_v2"
      args:
      - --model_name
      - cifar10
      - --protocol
      - v2
  predictor:
    triton:
      storageUri: "gs://kfserving-examples/models/torchscript"
      runtimeVersion: 21.08-py3
      env:
      - name: OMP_NUM_THREADS
        value: "1"
            Important Note On Tolerations for GPU Nodes¶
It's important to use the conventional taint nvidia.com/gpu for NVIDIA GPU nodes because if
              we use a custom taint, the nvidia-device-plugin will not be able to be scheduled on the
              GPU node. Therefore our node would not be able to expose its GPUs to kubernetes making it a plain CPU only
              node. This would prevent us from scheduling any GPU workload on it. 
The nvidia-device-plugin automatically tolerates the nvidia.com/gpu taint, see this
                commit. Therefore by using this conventional taint, we ensure that the nvidia-device-plugin will
              work and allow our node to expose its GPUs.
Using this taint on a GPU node also has the advantage that every pods scheduled on this GPU node will
              automatically have the toleration for this taint if it requests GPU resources.
              For instance, if we deploy an InferenceService with a predictor that requests 1
              GPU, then kubernetes will detect a request of 1 GPU and add to the predictor pod the
              nvidia.com/gpu toleration automatically. If on the other hand, our predictor (or
              other pod spec like transformer) does not request GPUs and has a node affinity/node selector
              for the GPU node then since the pod did not request GPUs, the toleration to nvidia.com/gpu
              will not be added to the pod. This is to prevent CPU only workload from preventing the GPU node to scale
              down for instance. Note that this feature of automatically adding toleration to pods requesting GPU
              resources is enabled via the ExtendedResourceToleration admission controller which was
              added in kubernetes 1.19.
              You can learn more about dedicated node pools and ExtendedResourceToleration admission controller here.
            
Node Selector + Tolerations¶
As described in the Overview we can combine node
              selector/node affinity and tolerations to force a pod to be scheduled on a node and to force a node to
              only accept pods with a matching toleration.
              Here is an exemple where we want our transformer to run on a node with the label
              myLabel1=true, we also want our transformer to tolerate nodes with the taint
              myTaint1. We want our predictor to run on a node with the label myLabel2=true,
              we also want our predictor to tolerate nodes with the taint myTaint2.
            
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: torch-transformer
spec:
  transformer:
    nodeSelector:
      myLabel1: "true"
    tolerations:
      - key: "myTaint1"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
    containers:
    - image: kfserving/image-transformer-v2:latest
      name: kfserving-container
      command:
      - "python"
      - "-m"
      - "image_transformer_v2"
      args:
      - --model_name
      - cifar10
      - --protocol
      - v2
  predictor:
    nodeSelector:
      myLabel2: "true"
    tolerations:
      - key: "myTaint2"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
    triton:
      storageUri: gs://kfserving-examples/models/torchscript
      runtimeVersion: 20.10-py3
      env:
      - name: OMP_NUM_THREADS
        value: "1"
      args:
      - --log-verbose=1
            GPU Example¶
This applies to other pod spec like transformer but if we want our predictor to
              run on a GPU node and if the predictor requests GPUs, then we should make sure our GPU node
              has the taint nvidia.com/gpu. As described earlier, this allows us to leverage kubernetes
              ExtendedResourceToleration and simply omit the toleration for our GPU pod given that we have a kubernetes
              version that supports it. 
The result is the same as before but we removed the toleration for the pod requesting GPUs (here the
              predictor) :
            
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: torch-transformer
spec:
  transformer:
    nodeSelector:
      myLabel1: "true"
    tolerations:
      - key: "myTaint1"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
    containers:
    - image: kfserving/image-transformer-v2:latest
      name: kfserving-container
      command:
      - "python"
      - "-m"
      - "image_transformer_v2"
      args:
      - --model_name
      - cifar10
      - --protocol
      - v2
  predictor:
    nodeSelector:
      myLabel2: "true"
    triton:
      storageUri: gs://kfserving-examples/models/torchscript
      runtimeVersion: 20.10-py3
      env:
      - name: OMP_NUM_THREADS
        value: "1"
      args:
      - --log-verbose=1
      resources:
        limits:
          nvidia.com/gpu: 1
        requests:
          nvidia.com/gpu: 1