Introducing Amazon EKS assist in Amazon SageMaker HyperPod

We’re thrilled to introduce Amazon Elastic Kubernetes Service (Amazon EKS) assist in Amazon SageMaker HyperPod, a purpose-built infrastructure engineered with resilience at its core. This functionality permits for the seamless addition of SageMaker HyperPod managed compute to EKS clusters, utilizing automated node and job resiliency options for basis mannequin (FM) improvement.

FMs are sometimes educated on large-scale compute clusters with a whole lot or 1000’s of accelerators. Below such circumstances, {hardware} failures pose a major problem, as a result of a single accelerator failure amongst 1000’s can halt all the coaching course of. For instance, Meta Llama 3 405B pre-training over 54 days on 16K NVIDIA H100 Tensor Core GPUs skilled 419 surprising interruptions, with 78% attributed to confirmed or suspected {hardware} points, and with 58.7% of those interruptions being GPU-related issues, together with NVLink failures and HBM3 memory failures.

Since its inception, SageMaker HyperPod was designed with a give attention to managed resiliency options to mitigate such {hardware} failures, enabling FM builders akin to Thomson Reuters, Perplexity AI, and Hugging Face to scale their FM coaching and inference on Slurm clusters. With the EKS assist in HyperPod, now you can additionally profit from the resiliency options on Kubernetes clusters by managing machine studying (ML) workloads utilizing the HyperPod compute and managed Kubernetes management aircraft on the EKS cluster.

AI startups like Observea and Articul8, and enterprises like Thomson Reuters use this new characteristic set to handle their ML mannequin improvement lifecycle:

“By means of our use of SageMaker HyperPod, our clients and inner groups not have to fret about working and configuring the Kubernetes management aircraft, and SageMaker HyperPod gives the community efficiency and optimized configurations to assist advanced HPC workloads. With Amazon EKS assist in SageMaker HyperPod, we will cut back time we spent for undifferentiated heavy lifting in infrastructure administration and cut back operational prices by over 30%.”

– Observea

“As a Kubernetes home, we at the moment are thrilled to welcome the launch of Amazon EKS assist for SageMaker HyperPod. This can be a recreation changer for us because it integrates seamlessly with our present coaching pipelines and makes it even simpler for us to handle and function our large-scale Kubernetes clusters. As well as, this additionally helps our finish clients as we at the moment are in a position to bundle and productize this functionality into our GenAI platform, enabling our clients to run their very own coaching and fine-tuning workloads in a extra streamlined method.”

– Articul8 AI

This submit is designed for Kubernetes cluster directors and ML scientists, offering an summary of the important thing options that SageMaker HyperPod introduces to facilitate large-scale mannequin coaching on an EKS cluster.

The submit is organized into the next three sections:

Overview of Amazon EKS assist in SageMaker HyperPod – This part gives a high-level overview of Amazon EKS assist in SageMaker HyperPod, introducing three key resiliency options HyperPod compute gives on the EKS cluster. Moreover, this part explains how HyperPod gives a easy developer expertise for admins and scientists.
HyperPod cluster setup and node resiliency options – This part gives an in depth information on integrating HyperPod managed compute into your EKS cluster as Kubernetes employee nodes, emphasizing how its built-in resiliency options present infrastructure stability. This part is particularly helpful for admins.
Coaching job resiliency with the job auto resume performance – On this part, we exhibit how scientists can submit and handle their distributed coaching jobs utilizing both the native Kubernetes CLI (kubectl) or optionally the brand new HyperPod CLI (hyperpod) with automated job restoration enabled.

Overview of EKS assist in SageMaker HyperPod

This part gives a high-level overview of Amazon EKS assist in SageMaker HyperPod, introduces three key resiliency options HyperPod compute gives on the EKS cluster, and discusses how SageMaker HyperPod gives easy consumer experiences for admins and scientists.

Structure overview

Amazon EKS assist in HyperPod helps a 1-to-1 mapping between an EKS cluster (serving as a Kubernetes control plane) and a HyperPod compute (connected as a bunch of employee nodes). You’ve three digital non-public clouds (VPCs) on this structure, internet hosting several types of assets:

Amazon EKS VPC – An AWS managed VPC hosts the EKS control plane. This VPC doesn’t seem within the buyer account. Amazon EKS creates a extremely obtainable endpoint for the managed Kubernetes API server that you simply use to speak together with your cluster (utilizing instruments like kubectl). The managed endpoint makes use of Community Load Balancer to load steadiness Kubernetes API servers.
HyperPod VPC – An AWS managed VPC hosts the HyperPod compute. This VPC doesn’t seem within the buyer account. The nodes connect with the EKS management aircraft via a cross-account elastic community interface (ENI).
SageMaker consumer VPC – A user-managed VPC hosts assets akin to Amazon FSx for Lustre, which is optionally related to Amazon Easy Storage Service (Amazon S3) utilizing an knowledge repository affiliation, in your account.

Cross-account ENIs additionally bridge communication between HyperPod compute situations and different AWS providers in your account, akin to Amazon Elastic Container Registry (Amazon ECR) and Amazon CloudWatch.

The next diagram illustrates the high-level structure of Amazon EKS assist in HyperPod.

HyperPod-managed resiliency options

Amazon EKS assist in HyperPod gives the next three capabilities to ensure the cluster stays wholesome and coaching jobs proceed underneath surprising interruptions:

Deep well being checks – This can be a managed well being verify for stress testing GPUs and AWS Trainium situations, in addition to performing Elastic Material Adapter (EFA) These checks could be run through the cluster creation, replace, or node alternative phases, and could be enabled or disabled via HyperPod APIs.
Automated node restoration – HyperPod performs managed, light-weight, and non-invasive checks, coupled with automated node alternative functionality. The HyperPod monitoring agent constantly screens and detects potential points, together with reminiscence exhaustion, disk failures, GPU anomalies, kernel deadlocks, container runtime points, and out-of-memory (OOM) crashes. Based mostly on the underlying concern, the monitoring agent both replaces or reboots the node.
Job auto resume – SageMaker HyperPod gives a job auto resume functionality utilizing the Kubeflow Training Operator for PyTorch to supply restoration and continuation of coaching jobs within the occasion of interruptions or failures. The extension makes certain the job waits and restarts after the node is changed.

Consumer experiences

Along with the aforementioned managed resiliency options, SageMaker HyperPod gives easy consumer experiences for each admins and scientists which might be important for managing a big cluster and operating large-scale coaching jobs on them as a part of the Amazon EKS integration:

Admin expertise – SageMaker HyperPod gives APIs and a console expertise to create and handle node teams within the EKS cluster, together with the power to SSH into the cluster nodes. SageMaker HyperPod additionally gives a mechanism to put in further dependencies on the cluster nodes utilizing lifecycle scripts, and an API-based mechanism to supply cluster software program updates and enhance total observability.
Scientist expertise – Together with enabling scientists to coach FMs utilizing Amazon EKS because the orchestrator, SageMaker HyperPod gives further capabilities for scientists to effortlessly practice fashions. With the HyperPod CLI, scientists can submit coaching jobs by offering a .yaml file and handle jobs (checklist, describe, view, cancel) with no need to make use of kubectl. Scientists can use open supply instruments like Kueue (a Kubernetes device for job queuing) and adjoining SageMaker capabilities like managed MLflow to handle their experiments and coaching runs. Scientists also can entry native SageMaker distributed coaching libraries that present efficiency enhancements by as much as 20%. You can even allow SageMaker HyperPod compute with Amazon EKS assist utilizing third-party instruments like KubeRay, which runs on the Kubernetes API. This lets you deliver your most popular job submission and administration capabilities used with different Kubernetes clusters into your HyperPod surroundings.

HyperPod compute setup and node resiliency options

On this part, we offer an in depth information on integrating HyperPod managed compute into your EKS cluster as Kubernetes employee nodes, and talk about how its built-in resiliency options present infrastructure stability.

Conditions

It’s worthwhile to have the next in place previous to the HyperPod compute deployment:

EKS cluster – You’ll be able to affiliate HyperPod compute to an present EKS cluster that satisfies the set of conditions. Alternatively, you may deploy a ready-made EKS cluster with a single AWS CloudFormation template. Refer the architecture guide for step-by-step setup instruction.
Customized assets – Working multi-node distributed coaching requires varied assets varied parts, akin to system plugins, CSI drivers, and Coaching Operators, to be pre-deployed on the EKS cluster. You additionally must deploy further assets for the well being monitoring agent and deep well being verify. HyperPodHelmCharts simplify the method utilizing Helm, one among mostly used bundle mangers for Kubernetes. Refer the developer information for set up.

HyperPod compute setup

With the aforementioned assets efficiently deployed, you’re now ready to create the HyperPod compute. The cluster configuration is specified utilizing a JSON file; the next code gives an instance:

cat > cluster-config.json << EOL
{
    "ClusterName": "ml-cluster",
    "Orchestrator": {
        "Eks": {
            "ClusterArn": "${EKS_CLUSTER_ARN}"
        }
    },
    "InstanceGroups": [
        {
            "InstanceGroupName": "worker-group-1",
            "InstanceType": "ml.p5.48xlarge",
            "InstanceCount": 4,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://${BUCKET_NAME}",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "${EXECUTION_ROLE}",
            "ThreadsPerCore": 1,
            "OnStartDeepHealthChecks": [
                "InstanceStress",
                "InstanceConnectivity"
            ]
        }
    ],
    "VpcConfig": {
        "SecurityGroupIds": [
            "$SECURITY_GROUP"
        ],
        "Subnets": [
            "$SUBNET_ID"
        ]
    },
    "NodeRecovery": "Computerized"
}
EOL

The offered configuration file accommodates two key highlights:

“OnStartDeepHealthChecks”: [“InstanceStress”, “InstanceConnectivity”] – Instructs HyperPod to conduct a deep well being verify every time new GPU or Trainium situations are added
“NodeRecovery”: “Computerized” – Allows HyperPod’s automated node restoration performance

You’ll be able to create a HyperPod compute with the next aws command (you want model 2.17.47 or newer):

aws sagemaker create-cluster 
    --cli-input-json file://cluster-config.json

{
    "ClusterArn": "arn:aws:sagemaker:us-east-2:xxxxxxxxxx:cluster/wccy5z4n4m49"
}

To confirm the cluster standing, you should use the next command:

aws sagemaker list-clusters --output desk

This command shows the cluster particulars, together with the cluster identify, standing, and creation time:

-----------------------------------------------------------------------------------------------------------------------
|                                                    ListClusters                                                     |
+---------------------------------------------------------------------------------------------------------------------+
||                                                 ClusterSummaries                                                  ||
|+----------------------------------------------------------------+--------------+----------------+------------------+|
||                           ClusterArn                           | ClusterName  | ClusterStatus  |  CreationTime    ||
|+----------------------------------------------------------------+--------------+----------------+------------------+|
||  arn:aws:sagemaker:us-east-2:111111111111:cluster/wccy5z4n4m49 |  ml-cluster  |  Creating      |  1723724079.337  ||
|+----------------------------------------------------------------+--------------+----------------+------------------+|

Alternatively, you may confirm the cluster standing via the SageMaker console. After a short interval, you may observe that the standing for all nodes transitions to Working.

SageMaker Console

Node resiliency options

To achieve additional perception into the situations, you should use kubectl get nodes and study the node labels. The sagemaker.amazonaws.com/node-health-status label reveals the life stage of every node. For example, nodes with the ml.m5.2xlarge occasion sort are labeled as Schedulable, indicating that they’ve efficiently handed the common HyperPod well being verify. Conversely, nodes with the ml.p5.48xlarge occasion sort are labeled as Unschedulable, indicating that they’ve entered the preliminary deep well being checks. The next code exhibits an instance:

# kubectl get nodes --show-labels=true
NAME                         ...  LABELS
hyperpod-i-023cfe933b3b34369 ...  beta.kubernetes.io/instance-type=ml.m5.2xlarge,sagemaker.amazonaws.com/node-health-status=Schedulable,  ...
hyperpod-i-045961b6424401838 ...  beta.kubernetes.io/instance-type=ml.p5.48xlarge,sagemaker.amazonaws.com/node-health-status=Unschedulable, ...
hyperpod-i-074b81fdb5bf52e19 ...  beta.kubernetes.io/instance-type=ml.p5.48xlarge,sagemaker.amazonaws.com/node-health-status=Unschedulable, ...
hyperpod-i-0ae97710b3033cdb1 ...  beta.kubernetes.io/instance-type=ml.m5.2xlarge,sagemaker.amazonaws.com/node-health-status=Schedulable,  ...

The deep well being verify logs are saved within the CloudWatch log group at /aws/sagemaker/Clusters/<cluster_name>/<cluster_id>. The log streams are logged at DeepHealthCheckResults/<log_stream_id>. When the deep well being checks establish a difficulty, the output log gives detailed info, together with the occasion ID that failed the deep well being checks and the precise failure purpose. For instance:

# Example1
{
"degree": "error",
"ts": "2024-08-15T21:15:22Z",
"msg": "Encountered FaultyInstance. Change the Occasion. Area: us-east-2,
InstanceType: p5.48xlarge. ERROR:Bandwidth has lower than threshold: Anticipated minimal
threshold :80,NCCL Check output Bw: 30"
}
# Example2
{
"degree": "error",
"ts": "2024-08-15T21:15:22Z",
"msg": "Encountered Unknownerror. Change the Occasion. Area: us-east-2,
InstanceType: p5.48xlarge. ERROR: Crash detected in dcgm check"
}

You’ll be able to verify the progress of the deep well being verify with the next values for the sagemaker.amazonaws.com/deep-health-check label on every node:

amazonaws.com/deep-health-check: InProgress
amazonaws.com/deep-health-check: Handed
amazonaws.com/deep-health-check: Failed

If a node fails the deep well being checks, it will likely be changed. In any other case, it will likely be marked with the Schedulable label:

sagemaker.amazonaws.com/node-health-status: Schedulable

Whenever you wish to manually change a selected node in your cluster, you are able to do so by manually modifying the label.

For full checklist of resilience-related Kubernetes labels, please refer AWS documentation.

Even after the preliminary deep well being checks, HyperPod periodically runs common well being checks. To view the well being occasions detected by the HyperPod well being monitoring agent, you may verify the CloudWatch stream log:

Instance log group identify – /aws/sagemaker/Clusters/<cluster_name>/<cluster_id>
Instance log stream identify – SagemakerHealthMonitoringAgent/<your_node_group_name>/<instance_id>

The SagemakerHealthMonitoringAgent log stream for every node accommodates solely the detection occasions from the well being monitoring agent. For instance:

# Example1
{
    "degree": "data",
    "ts": "2024-09-06T03:15:11Z",
    "msg": "NPD caught ",
    "situation sort: ": "KernelDeadlock",
    "with situation particulars ": {
        "sort": "KernelDeadlock",
        "standing": "False",
        "transition": "2024-09-06T03:15:11.539932213Z",
        "purpose": "KernelHasNoDeadlock",
        "message": "kernel has no impasse"
    },
    "HealthMonitoringAgentDetectionEvent": "HealthEvent"
}
# Example2
{
    "degree": "data",
    "ts": "2024-09-06T03:15:11Z",
    "msg": "NPD caught ",
    "situation sort: ": "NvidiaErrorTerminate",
    "with situation particulars ": {
        "sort": "NvidiaErrorTerminate",
        "standing": "False",
        "transition": "2024-09-06T03:15:11.539932283Z",
        "purpose": "NvidiaNoErrorRequiredTerminate",
        "message": "Nvidia no error required terminate"
    },
    "HealthMonitoringAgentDetectionEvent": "HealthEvent"
}

The deep well being checks or the well being monitor agent establish points in a sure node, the node is labeled with sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplace:NoSchedule to keep away from scheduling pods, after which the node is changed or rebooted.

You’ll be able to monitor the well being standing of HyperPod nodes via CloudWatch Container Insights, now with enhanced observability for Amazon EKS. Container Insights helps accumulate, combination, and summarize metrics and logs from containerized functions and microservices, offering detailed insights into efficiency, well being, and standing metrics for CPU, GPU, Trainium, EFA, and file system as much as the container degree. For the entire checklist of metrics tracked, see Amazon EKS and Kubernetes Container Insights metrics. With the Container Insights integration with SageMaker HyperPod, you may also verify the person node well being standing and the entire variety of schedulable and unschedulable nodes, as proven within the following screenshots.

You could find the Container Insights arrange information in Amazon EKS Support in Amazon SageMaker HyperPod Workshop.

Coaching job resiliency with the job auto resume performance

Along with infrastructure resiliency options, you should use the use job auto resume functionality utilizing the Kubeflow Training Operator for PyTorch to keep up the restoration and continuation of coaching jobs within the occasion of interruptions or failures. The job auto resume characteristic makes an attempt to proceed the job, whereas the HyperPod node auto restoration performance works on resolving node failures (node reboot or alternative as wanted) to reduce coaching downtime. This part demonstrates the job auto resume characteristic utilizing a PyTorch FSDP example on the awsome-distributed-training repository.

To allow the job auto resume characteristic, you create a PyTorchJob with the fsdp.yaml manifest, which incorporates the next annotations and nodeSelector:

apiVersion: "kubeflow.org/v1"
type: PyTorchJob
metadata:
    identify: fsdpjob
    namespace: kubeflow
    # config for HyperPod job auto-resume
    annotations: {
        sagemaker.amazonaws.com/enable-job-auto-resume: "true",
        sagemaker.amazonaws.com/job-max-retry-count: "2"
    }
spec:
  pytorchReplicaSpecs:
  ......
  Employee:
      replicas: 10
      restartPolicy: OnFailure

      template:
          spec:
            nodeSelector: sagemaker.amazonaws.com/node-health-status: Schedulable 
......

With the annotations sagemaker.amazonaws.com/enable-job-auto-resume: "true" and sagemaker.amazonaws.com/job-max-retry-count: "2", SageMaker HyperPod resumes interrupted coaching jobs as much as two occasions and schedules the resumed jobs onto wholesome nodes. These wholesome nodes are recognized by the node selector label sagemaker.amazonaws.com/node-health-status: Schedulable, guaranteeing that solely nodes which have handed fundamental well being checks and can be found for operating workloads are used for resumed jobs.

Submit the PyTorchJob utilizing the kubectl command:

kubectl apply -f fsdp.yaml

With the job auto resume characteristic enabled, if a job fails as a consequence of a {hardware} failure or any transient points throughout coaching, SageMaker HyperPod initiates the node alternative workflow and restarts the job after the defective nodes are changed. You’ll be able to confirm the standing of job auto resume by describing the PyTorchJob:

kubectl describe pytorchjob -n kubeflow <job-name>

Within the occasion of a {hardware} failure, the Kubeflow coaching job restarts as follows:

Begin Time: 2024-07-11T05:53:10Z
Allow job auto-resume 27

Occasions:
Kind Cause Age From
Message
---- ------ ---- ----

Regular SuccessfulCreateService 9m45s pytorchjob-controller
Created service: pt-job-1-worker-0
Regular SuccessfulCreateService 9m45s pytorchjob-controller
Created service: pt-job-1-worker-1
Regular SuccessfulCreateService 9m45s pytorchjob-controller
Created service: pt-job-1-master-0
Warning PyTorchJobRestarting 7m59s pytorchjob-controller
PyTorchJob pt-job-1 is restarting as a result of 1 Grasp reproduction(s) failed.
Regular SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller
Created pod: pt-job-1-worker-0
Regular SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller
Created pod: pt-job-1-worker-1
Regular SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller
Created pod: pt-job-1-master-0
Warning PyTorchJobRestarting 7m58s pytorchjob-controller
PyTorchJob pt-job-1 is restarting as a result of 1 Employee reproduction(s) failed

Whenever you submit a coaching job with the HyperPod CLI, you may also request the job to be auto resumed within the following method:

hyperpod start-job 
    --config-file ./config.yaml 
   --auto-resume true  
   --max-retry 2

Discuss with config.yaml for full configuration. For different CLI choices, discuss with the documentation on Github repository.

Clear up

To delete your SageMaker HyperPod compute, use both the SageMaker console or the next AWS Command Line Interface (AWS CLI) command:

aws sagemaker delete-cluster --cluster-name <cluster_name>

Cluster deletion can take a couple of minutes. You’ll be able to verify profitable deletion after you see no clusters on the SageMaker console.

Conclusion

With the assist for Amazon EKS in SageMaker HyperPod, clients who’ve standardized their FM improvement workflows on Kubernetes can undertake SageMaker HyperPod and handle their cluster assets utilizing a well-known Kubernetes interface in SageMaker HyperPod. When coaching an FM, SageMaker HyperPod routinely screens cluster well being, and when an infrastructure fault akin to a GPU failure happens, SageMaker HyperPod routinely remediates the difficulty and restarts the coaching course of from the final saved checkpoint, with none human intervention. Amazon EKS additional enhances this functionality by operating deep well being checks. Every time a brand new occasion is added to the SageMaker HyperPod compute, it undergoes a deep well being verify course of to establish and change doubtlessly problematic situations. SageMaker HyperPod then routinely replaces or reboots nodes recognized as defective and resumes coaching processes within the occasion of surprising interruptions, involving node alternative and job resubmission.

For an end-to-end tutorial on cluster administration and FM coaching, go to the . For extra info on infrastructure deployment and extra distributed coaching check instances, discuss with the awsome-distributed-training repository. In the event you’re fascinated by deploying HyperPod with step-by-step instructions, you can begin from the aws-do-hyperpod repository.

In regards to the authors

Keita Watanabe is a Senior GenAI Specialist Options Architect within the world-wide specialist group at Amazon Net Companies, the place he helps develop machine studying options utilizing OSS tasks akin to Slurm and Kubernetes. His background is in machine studying analysis and improvement. Previous to becoming a member of AWS, Keita labored within the ecommerce trade as a analysis scientist growing picture retrieval programs for product search. Keita holds a PhD in Science from the College of Tokyo.

Alex Iankoulski is a full-stack software program and infrastructure architect who likes to do deep, hands-on work. He’s presently a Principal Options Architect within the world-wide specialist group at AWS. In his position, he focuses on serving to clients with the orchestration and scaling of ML and AI workloads on container-powered AWS providers. He’s additionally the creator of the open supply do framework and a Docker captain who loves making use of container applied sciences to speed up the tempo of innovation whereas fixing the world’s greatest challenges. Through the previous 10 years, Alex has labored on democratizing generative AI and ML, combating local weather change, and making journey safer, healthcare higher, and power smarter.

shimox Tomonori Shimomura is a Senior Options Architect on the Amazon SageMaker crew, the place he gives in-depth technical session to SageMaker clients and suggests product enhancements to the product crew. Earlier than becoming a member of Amazon, he labored on the design and improvement of embedded software program for online game consoles, and now he leverages his in-depth expertise in cloud-side know-how. In his free time, he enjoys enjoying video video games, studying books, and writing software program.

arunkumar-Lokh Arun Kumar Lokanatha is a Senior ML Options Architect with the Amazon SageMaker crew. He makes a speciality of giant language mannequin coaching workloads, serving to clients construct LLM workloads utilizing SageMaker HyperPod, SageMaker coaching jobs, and SageMaker distributed coaching. Outdoors of labor, he enjoys operating, mountain climbing, and cooking.

manoj Manoj Ravi is a Senior Product Supervisor on the Amazon SageMaker crew. He’s obsessed with constructing next-gen AI merchandise and works on functions and instruments to make basis mannequin improvement and deployment easy for patrons. He holds an MBA from the Haas Faculty of Enterprise and a grasp’s diploma from Carnegie Mellon College. In his spare time, Manoj enjoys enjoying tennis and pursuing panorama images.

Introducing Amazon EKS assist in Amazon SageMaker HyperPod

Overview of EKS assist in SageMaker HyperPod

Structure overview

HyperPod-managed resiliency options

Consumer experiences

HyperPod compute setup and node resiliency options

Conditions

HyperPod compute setup

Node resiliency options

Coaching job resiliency with the job auto resume performance

Clear up

Conclusion

In regards to the authors

Whereas Congress Fights Over DeFi, President Trump’s Silence Says a Lot

Apple releases first AI-generated picture of Craig Federighi’s canine

Converter

Editors Pick

Newsletter

Categories

Related Posts