Deploy Amazon SageMaker pipelines using AWS Controllers for Kubernetes

Kubernetes is a popular orchestration platform for managing containers. Its scalability and load-balancing capabilities make it ideal for handling the variable workloads typical of machine learning (ML) applications. DevOps engineers often use Kubernetes to manage and scale ML applications, but before an ML model is available, it must be trained and evaluated and, if the quality of the obtained model is satisfactory, uploaded to a model registry.

Amazon SageMaker provides capabilities to remove the undifferentiated heavy lifting of building and deploying ML models. SageMaker simplifies the process of managing dependencies, container images, auto scaling, and monitoring. Specifically for the model building stage, Amazon SageMaker Pipelines automates the process by managing the infrastructure and resources needed to process data, train models, and run evaluation tests.

A challenge for DevOps engineers is the additional complexity that comes from using Kubernetes to manage the deployment stage while resorting to other tools (such as the AWS SDK or AWS CloudFormation) to manage the model building pipeline. One alternative to simplify this process is to use AWS Controllers for Kubernetes (ACK) to manage and deploy a SageMaker training pipeline. ACK allows you to take advantage of managed model building pipelines without needing to define resources outside of the Kubernetes cluster.

In this post, we introduce an example to help DevOps engineers manage the entire ML lifecycle—including training and inference—using the same toolkit.

Solution overview

We consider a use case in which an ML engineer configures a SageMaker model building pipeline using a Jupyter notebook. This configuration takes the form of a Directed Acyclic Graph (DAG) represented as a JSON pipeline definition. The JSON document can be stored and versioned in an Amazon Simple Storage Service (Amazon S3) bucket. If encryption is required, it can be implemented using an AWS Key Management Service (AWS KMS) managed key for Amazon S3. A DevOps engineer with access to fetch this definition file from Amazon S3 can load the pipeline definition into an ACK service controller for SageMaker, which is running as part of an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. The DevOps engineer can then use the Kubernetes APIs provided by ACK to submit the pipeline definition and initiate one or more pipeline runs in SageMaker. This entire workflow is shown in the following solution diagram.

Prerequisites

To follow along, you should have the following prerequisites:

An EKS cluster where the ML pipeline will be created.
A user with access to an AWS Identity and Access Management (IAM) role that has IAM permissions (iam:CreateRole, iam:AttachRolePolicy, and iam:PutRolePolicy) to allow creating roles and attaching policies to roles.
The following command line tools on the local machine or cloud-based development environment used to access the Kubernetes cluster:

Install the SageMaker ACK service controller

The SageMaker ACK service controller makes it straightforward for DevOps engineers to use Kubernetes as their control plane to create and manage ML pipelines. To install the controller in your EKS cluster, complete the following steps:

Configure IAM permissions to make sure the controller has access to the appropriate AWS resources.
Install the controller using a SageMaker Helm Chart to make it available on the client machine.

The following tutorial provides step-by-step instructions with the required commands to install the ACK service controller for SageMaker.

Generate a pipeline JSON definition

In most companies, ML engineers are responsible for creating the ML pipeline in their organization. They often work with DevOps engineers to operate those pipelines. In SageMaker, ML engineers can use the SageMaker Python SDK to generate a pipeline definition in JSON format. A SageMaker pipeline definition must follow the provided schema, which includes base images, dependencies, steps, and instance types and sizes that are needed to fully define the pipeline. This definition then gets retrieved by the DevOps engineer for deploying and maintaining the infrastructure needed for the pipeline.

The following is a sample pipeline definition with one training step:

{
  "Version": "2020-12-01",
  "Steps": [
  {
    "Name": "AbaloneTrain",
    "Type": "Training",
    "Arguments": {
      "RoleArn": "<<YOUR_SAGEMAKER_ROLE_ARN>>",
      "HyperParameters": {
        "max_depth": "5",
        "gamma": "4",
        "eta": "0.2",
        "min_child_weight": "6",
        "objective": "multi:softmax",
        "num_class": "10",
        "num_round": "10"
     },
     "AlgorithmSpecification": {
     "TrainingImage": "683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.7-1",
     "TrainingInputMode": "File"
   },
   "OutputDataConfig": {
     "S3OutputPath": "s3://<<YOUR_BUCKET_NAME>>/sagemaker/"
   },
   "ResourceConfig": {
     "InstanceCount": 1,
     "InstanceType": "ml.m4.xlarge",
     "VolumeSizeInGB": 5
   },
   "StoppingCondition": {
     "MaxRuntimeInSeconds": 86400
   },
   "InputDataConfig": [
   {
     "ChannelName": "train",
     "DataSource": {
       "S3DataSource": {
         "S3DataType": "S3Prefix",
         "S3Uri": "s3://<<YOUR_BUCKET_NAME>>/sagemaker/xgboost/train/",
         "S3DataDistributionType": "
       }
     },
     "ContentType": "text/libsvm"
   },
   {
     "ChannelName": "validation",
     "DataSource": {
       "S3DataSource": {
         "S3DataType": "S3Prefix",
         "S3Uri": "s3://<<YOUR_BUCKET_NAME>>/sagemaker/xgboost/validation/",
         "S3DataDistributionType": "FullyReplicated"
       }
     },
     "ContentType": "text/libsvm"
   }]
  }
 }]
}

With SageMaker, ML model artifacts and other system artifacts are encrypted in transit and at rest. SageMaker encrypts these by default using the AWS managed key for Amazon S3. You can optionally specify a custom key using the KmsKeyId property of the OutputDataConfig argument. For more information on how SageMaker protects data, see Data Protection in Amazon SageMaker.

Furthermore, we recommend securing access to the pipeline artifacts, such as model outputs and training data, to a specific set of IAM roles created for data scientists and ML engineers. This can be achieved by attaching an appropriate bucket policy. For more information on best practices for securing data in Amazon S3, see Top 10 security best practices for securing data in Amazon S3.

Create and submit a pipeline YAML specification

In the Kubernetes world, objects are the persistent entities in the Kubernetes cluster used to represent the state of your cluster. When you create an object in Kubernetes, you must provide the object specification that describes its desired state, as well as some basic information about the object (such as a name). Then, using tools such as kubectl, you provide the information in a manifest file in YAML (or JSON) format to communicate with the Kubernetes API.

Refer to the following Kubernetes YAML specification for a SageMaker pipeline. DevOps engineers need to modify the .spec.pipelineDefinition key in the file and add the ML engineer-provided pipeline JSON definition. They then prepare and submit a separate pipeline execution YAML specification to run the pipeline in SageMaker. There are two ways to submit a pipeline YAML specification:

Pass the pipeline definition inline as a JSON object to the pipeline YAML specification.
Convert the JSON pipeline definition into String format using the command line utility jq. For example, you can use the following command to convert the pipeline definition to a JSON-encoded string:

jq -r tojson <pipeline-definition.json>

In this post, we use the first option and prepare the YAML specification (my-pipeline.yaml) as follows:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: Pipeline
metadata:
  name: my-kubernetes-pipeline
spec:
  parallelismConfiguration:
  	maxParallelExecutionSteps: 2
  pipelineName: my-kubernetes-pipeline
  pipelineDefinition: |
  {
    "Version": "2020-12-01",
    "Steps": [
    {
      "Name": "AbaloneTrain",
      "Type": "Training",
      "Arguments": {
        "RoleArn": "<<YOUR_SAGEMAKER_ROLE_ARN>>",
        "HyperParameters": {
          "max_depth": "5",
          "gamma": "4",
          "eta": "0.2",
          "min_child_weight": "6",
          "objective": "multi:softmax",
          "num_class": "10",
          "num_round": "30"
        },
        "AlgorithmSpecification": {
          "TrainingImage": "683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.7-1",
          "TrainingInputMode": "File"
        },
        "OutputDataConfig": {
          "S3OutputPath": "s3://<<YOUR_S3_BUCKET>>/sagemaker/"
        },
        "ResourceConfig": {
          "InstanceCount": 1,
          "InstanceType": "ml.m4.xlarge",
          "VolumeSizeInGB": 5
        },
        "StoppingCondition": {
          "MaxRuntimeInSeconds": 86400
        },
        "InputDataConfig": [
        {
          "ChannelName": "train",
          "DataSource": {
            "S3DataSource": {
              "S3DataType": "S3Prefix",
              "S3Uri": "s3://<<YOUR_S3_BUCKET>>/sagemaker/xgboost/train/",
              "S3DataDistributionType": "FullyReplicated"
            }
          },
          "ContentType": "text/libsvm"
        },
        {
          "ChannelName": "validation",
          "DataSource": {
            "S3DataSource": {
              "S3DataType": "S3Prefix",
              "S3Uri": "s3://<<YOUR_S3_BUCKET>>/sagemaker/xgboost/validation/",
              "S3DataDistributionType": "FullyReplicated"
            }
          },
          "ContentType": "text/libsvm"
        }
      ]
    }
  }
]}
pipelineDisplayName: my-kubernetes-pipeline
roleARN: <<YOUR_SAGEMAKER_ROLE_ARN>>

Submit the pipeline to SageMaker

To submit your prepared pipeline specification, apply the specification to your Kubernetes cluster as follows:

kubectl apply -f my-pipeline.yaml

Create and submit a pipeline execution YAML specification

Refer to the following Kubernetes YAML specification for a SageMaker pipeline. Prepare the pipeline execution YAML specification (pipeline-execution.yaml) as follows:

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: PipelineExecution
metadata:
  name: my-kubernetes-pipeline-execution
spec:
  parallelismConfiguration:
  	maxParallelExecutionSteps: 2
  pipelineExecutionDescription: "My first pipeline execution via Amazon EKS cluster."
  pipelineName: my-kubernetes-pipeline

To start a run of the pipeline, use the following code:

kubectl apply -f pipeline-execution.yaml

Review and troubleshoot the pipeline run

To list all pipelines created using the ACK controller, use the following command:

To list all pipeline runs, use the following command:

kubectl get pipelineexecution

To get more details about the pipeline after it’s submitted, like checking the status, errors, or parameters of the pipeline, use the following command:

kubectl describe pipeline my-kubernetes-pipeline

To troubleshoot a pipeline run by reviewing more details about the run, use the following command:

kubectl describe pipelineexecution my-kubernetes-pipeline-execution

Clean up

Use the following command to delete any pipelines you created:

Use the following command to cancel any pipeline runs you started:

kubectl delete pipelineexecution

Conclusion

In this post, we presented an example of how ML engineers familiar with Jupyter notebooks and SageMaker environments can efficiently work with DevOps engineers familiar with Kubernetes and related tools to design and maintain an ML pipeline with the right infrastructure for their organization. This enables DevOps engineers to manage all the steps of the ML lifecycle with the same set of tools and environment they are used to, which enables organizations to innovate faster and more efficiently.

Explore the GitHub repository for ACK and the SageMaker controller to start managing your ML operations with Kubernetes.

About the Authors

Pratik Yeole is a Senior Solutions Architect working with global customers, helping customers build value-driven solutions on AWS. He has expertise in MLOps and containers domains. Outside of work, he enjoys time with friends, family, music, and cricket.

Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.