创建和管理实例类型以高效利用计算资源 - Azure Machine Learning

什么是实例池？

实例类型是一种 Azure 机器学习概念，它允许将某些类型的计算节点作为训练和推理工作负载的目标。对于 Azure VM，实例类型的一个示例是 STANDARD_D2_V3 。

在 Kubernetes 群集中，实例类型在随 Azure 机器学习扩展一起安装的自定义资源定义 (CRD) 中表示。 Azure 机器学习扩展中的两个元素表示实例类型： nodeSelector 和 resources 。

简而言之， nodeSelector 让你可以指定 Pod 应在哪个节点上运行。节点必须具有相应的标签。在 resources 节中，可为 Pod 设置计算资源（CPU、内存和 NVIDIA GPU）。

如果在部署 Azure 机器学习扩展时指定了 nodeSelector ，则 nodeSelector 将应用于所有实例类型。这表示：

对于创建的每个实例类型，指定的 nodeSelector 应该是扩展指定的 nodeSelector 的子集。

如果将实例类型与 nodeSelector 一起使用，则工作负载将在与扩展指定的 nodeSelector 和实例类型指定的 nodeSelector 都匹配的任何节点上运行。

如果使用不带 nodeSelector 的实例类型，则工作负载将在与扩展指定的 nodeSelector 匹配的任何节点上运行。

默认实例类型

默认情况下，将 Kuberenetes 群集附加到 Azure 机器学习工作区时，会创建一个使用以下定义的 defaultinstancetype ：

如果不应用


    nodeSelector

，则意味着可以在任何节点上计划 Pod。

对于请求，请为工作负载的 Pod 分配带有 0.1 个 CPU 核心、2 GB 内存和 0 个 GPU 的默认资源。

工作负载的 Pod 使用的资源限制为 2 个 CPU 核心和 8 GB 内存：

resources:
  requests:
    cpu: "100m"
    memory: "2Gi"
  limits:
    cpu: "2"
    memory: "2Gi"
    nvidia.com/gpu: null
默认实例类型有意使用很少的资源。  为确保使用适当的资源（例如 GPU 资源）运行所有 ML 工作负载，强烈建议创建自定义实例类型。
              运行命令 kubectl get instancetype 时，defaultinstancetype 不会作为 InstanceType 自定义资源显示在群集中，而是显示在所有客户端（UI、CLI、SDK）中。
              可将 defaultinstancetype 替代为与 defaultinstancetype 同名的自定义实例类型定义（请参阅创建自定义实例类型部分）
创建自定义实例类型
若要创建新的实例类型，请为实例类型 CRD 创建新的自定义资源。  例如：
kubectl apply -f my_instance_type.yaml
对于 my_instance_type.yaml：
apiVersion: amlarc.azureml.com/v1alpha1
kind: InstanceType
metadata:
  name: myinstancetypename
spec:
  nodeSelector:
    mylabel: mylabelvalue
  resources:
    limits:
      cpu: "1"
      nvidia.com/gpu: 1
      memory: "2Gi"
    requests:
      cpu: "700m"
      memory: "1500Mi"
以下步骤创建具有标记行为的实例类型：
Pod 仅在带有 mylabel: mylabelvalue 标签的节点上计划。
为 Pod 分配 700m CPU 和 1500Mi 内存的资源请求。
为 Pod 分配 1 个 CPU、2Gi 内存和 1 个 NVIDIA GPU 的资源限制。
创建自定义实例类型必须满足以下参数和定义规则，否则实例类型创建会失败：
对于 my_instance_type_list.yaml：
apiVersion: amlarc.azureml.com/v1alpha1
kind: InstanceTypeList
items:
  - metadata:
      name: cpusmall
    spec:
      resources:
        requests:
          cpu: "100m"
          memory: "100Mi"
        limits:
          cpu: "1"
          nvidia.com/gpu: 0
          memory: "1Gi"
  - metadata:
      name: defaultinstancetype
    spec:
      resources:
        requests:
          cpu: "1"
          memory: "1Gi" 
        limits:
          cpu: "1"
          nvidia.com/gpu: 0
          memory: "1Gi"
上面的示例创建两种实例类型：cpusmall 和 defaultinstancetype。  此 defaultinstancetype 定义替代将 Kubernetes 群集附加到 Azure 机器学习工作区时创建的 defaultinstancetype 定义。
如果提交没有实例类型的训练或推理工作负载，它将使用 defaultinstancetype。  若要为 Kubernetes 群集指定默认实例类型，请创建名称为 defaultinstancetype 的实例类型。  它会自动识别为默认类型。
选择实例类型以提交训练作业
Azure CLI
Python SDK
若要使用 CLI (V2) 为训练作业选择某个实例类型，请将该类型的名称指定为作业 YAML 中 resources 属性节的一部分。  例如：
command: python -c "print('Hello world!')"
environment:
  image: library/python:latest
compute: azureml:<Kubernetes-compute_target_name>
resources:
  instance_type: <instance_type_name>
若要使用 SDK (V2) 为训练作业选择某个实例类型，请为 command 类中的 instance_type 属性指定该类型的名称。  例如：
from azure.ai.ml import command
# define the command
command_job = command(
    command="python -c "print('Hello world!')"",
    environment="AzureML-lightgbm-3.2-ubuntu18.04-py37-cpu@latest",
    compute="<Kubernetes-compute_target_name>",
    instance_type="<instance_type_name>"
在以上示例中，请将 <Kubernetes-compute_target_name> 替换为 Kubernetes 计算目标的名称，将 <instance_type_name> 替换为要选择的实例类型的名称。 如果未指定 instance_type 属性，系统将使用 defaultinstancetype 提交作业。
选择实例类型以部署模型
Azure CLI
Python SDK
若要使用 CLI (V2) 为模型部署选择某个实例类型，请为部署 YAML 中的 instance_type 属性指定该类型的名称。  例如：
name: blue
app_insights_enabled: true
endpoint_name: <endpoint name>
model: 
  path: ./model/sklearn_mnist_model.pkl
code_configuration:
  code: ./script/
  scoring_script: score.py
instance_type: <instance type name>
environment: 
  conda_file: file:./model/conda.yml
  image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:20210727.v1
若要使用 SDK (V2) 为模型部署选择某个实例类型，请为 KubernetesOnlineDeployment 类中的 instance_type 属性指定该类型的名称。  例如：
from azure.ai.ml import KubernetesOnlineDeployment,Model,Environment,CodeConfiguration
model = Model(path="./model/sklearn_mnist_model.pkl")
env = Environment(
    conda_file="./model/conda.yml",
    image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:20210727.v1",
# define the deployment
blue_deployment = KubernetesOnlineDeployment(
    name="blue",
    endpoint_name="<endpoint name>",
    model=model,
    environment=env,
    code_configuration=CodeConfiguration(
        code="./script/", scoring_script="score.py"
    instance_count=1,
    instance_type="<instance type name>",
在以上示例中，请将 <instance_type_name> 替换为要选择的实例类型的名称。 如果未指定 instance_type 属性，系统将使用 defaultinstancetype 部署模型。
资源部分验证
如果使用 resource section 定义模型部署的资源请求和限制，例如：
Azure CLI
Python SDK
environment: 
  conda_file: file:./model/conda.yml
  image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:20210727.v1
resources:
  requests:
    cpu: "0.1"
    memory: "0.2Gi"
  limits:
    cpu: "0.2"
    #nvidia.com/gpu: 0
    memory: "0.5Gi"
instance_type: <instance type name>
model = Model(path="./model/sklearn_mnist_model.pkl")
env = Environment(
    conda_file="./model/conda.yml",
    image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:20210727.v1",
requests = ResourceSettings(cpu="0.1", memory="0.2G")
limits = ResourceSettings(cpu="0.2", memory="0.5G", nvidia_gpu="1")
resources = ResourceRequirementsSettings(requests=requests, limits=limits)
# define the deployment
blue_deployment = KubernetesOnlineDeployment(
    name="blue",
    endpoint_name="<endpoint name>",
    model=model,
    environment=env,
    code_configuration=CodeConfiguration(
        code="./script/", scoring_script="score.py"
    resources=resources,
    instance_count=1,
    instance_type="<instance type name>",
limits:
cpu:
可选
（仅在需要 GPU 时才要求）
字符串值，不能为 0 或空。 
可以指定 CPU（以毫核为单位），例如 100m，或指定为完整数字，例如 "1" 等效于 1000m。
limits:
memory:
可选
（仅在需要 GPU 时才要求）
字符串值，不能为 0 或空。 
可以将内存指定为完整数字 + 后缀，例如 1024Mi 表示 1024 MiB。
limits:
nvidia.com/gpu:
可选
（仅在需要 GPU 时才要求）
整数值，不能为空，只能在 limits 部分指定。 
有关详细信息，请参阅 Kubernetes 文档。 
如果只需要 CPU，可以省略整个 limits 部分。
instance type 是模型部署所必需的。 如果已经定义了资源部分，并且将根据实例类型对其进行验证，则规则如下：
使用有效的资源部分定义，资源限制必须小于实例类型限制，否则部署将失败。
如果用户未定义实例类型，defaultinstancetype 将用于通过资源部分进行验证。
如果用户没有定义资源部分，实例类型将用于创建部署。
Azure 机器学习推理路由器和连接要求
保护 AKS 推理环境