Use Amazon Sagemaker HyperPod process governance to schedule topology-aware workloads

by root September 15, 2025

written by root September 15, 2025 0 comment 254 views

At this time we’re happy to announce new options in Amazon Sagemaker HyperPod Job Governance to assist optimize coaching effectivity and community latency for AI workloads. Sagemaker HyperPod Job Governance streamlines useful resource allocation and facilitates environment friendly computing useful resource utilization between groups and tasks in Amazon Elastic Kubernetes Service (Amazon EKS) clusters. Directors can handle accelerated calculation allocations, implement process precedence insurance policies, and enhance useful resource utilization. This helps organizations deal with accelerating generative AI innovation and lowering time to market slightly than coordinating useful resource allocation and regeneration duties. For extra data, see Amazon Sagemaker HyperPod Job Governance Greatest Practices.

Generated AI workloads usually require intensive community communication throughout Amazon Elastic Compute Cloud (Amazon EC2) situations. Community bandwidth impacts each workload runtime and processing latency. The community latency of those communications will depend on the bodily placement of the situations throughout the hierarchical infrastructure of the info middle. Knowledge facilities may be organized into nested organizational models resembling community nodes and node units, with a number of situations per community node and a number of community nodes per node set. For instance, situations throughout the identical organizational unit expertise sooner processing instances in comparison with the presence of various models. Because of this fewer community hops between situations will lead to decrease communication.

By bearing in mind the bodily and logical placement of sources, EC2 community topology data can be utilized throughout work submissions to optimize the location of the Sagemaker HyperPod cluster era AI workloads. The topology of an EC2 occasion is defined by a set of nodes with one node in every layer of the community. For extra details about how an Amazon EC2 occasion topology works, see Learn how to deploy an EC2 topology. Community topology labels provide the next necessary advantages:

Decrease community hops and scale back latency by routing site visitors to close by situations
Improved coaching effectivity by optimizing workload placement throughout community sources

Topology-aware scheduling in Sagemaker HyperPod Job Governance permits you to use topology community labels to schedule jobs over optimized community communications, bettering process effectivity and useful resource utilization for AI workloads.

This submit introduces Sagemaker HyperPod process governance and topology-conscious scheduling by submitting jobs that signify hierarchical community data. Gives particulars on use Sagemaker HyperPod process governance to optimize work effectivity.

Answer overview

Knowledge scientists work together with Sagemaker HyperPod clusters. Knowledge scientists are liable for coaching, tweaking, and deploying fashions on accelerated computational situations. It is very important be sure that information scientists have the required capability and privileges when interacting with clusters of GPUs.

To implement topology scheduling, first examine the topology data for all nodes within the cluster, run a script that tells you which ones situations are on the identical community node, and eventually schedule a topologyware coaching process on the cluster. This workflow offers larger visibility and management over the location of coaching situations.

On this submit, you’ll view node topology data and submit duties which are topology conscious to the cluster. For reference, a community node describes the community node set of an occasion. In every set of community nodes, three layers kind a hierarchical view of every occasion’s topology. The closest situations to one another share the identical Layer 3 community node. If there are not any common community nodes within the decrease layer (Layer 3), examine if there’s any commonality in Layer 2.

Stipulations

To start topology consciousness scheduling, the next stipulations are required:

EKS cluster
Sagemaker hyperpod cluster with situations enabled for topology data
Sagemaker HyperPod Job Governance Add-on Set up (model 1.2.2 or later)
Kubectl has been put in
(Optionally available) Sagemaker HyperPod CLI is put in

Get node topology data

Run the next command to show the node labels within the cluster: This command offers community topology data for every occasion.

kubectl get nodes -L topology.k8s.aws/network-node-layer-1
kubectl get nodes -L topology.k8s.aws/network-node-layer-2
kubectl get nodes -L topology.k8s.aws/network-node-layer-3

Situations with the identical community node layer 3 are as shut as doable, in accordance with the EC2 topology hierarchy. You will notice a listing of node labels that appear to be this:topology.k8s.aws/network-node-layer-3: nn-33333exampleRun the next script to view the nodes within the cluster on the identical Layer 1, 2, and three community nodes:

git clone https://github.com/aws-samples/awsome-distributed-training.git
cd awsome-distributed-training/1.architectures/7.sagemaker-hyperpod-eks/task-governance 
chmod +x visualize_topology.sh
bash visualize_topology.sh

The output of this script prints a circulation chart that can be utilized within the Circulate Diagram Editor. Mermaid.js.org Visualizes the node topology of a cluster. The next diagram exhibits an instance cluster terpologies for a 7-instance cluster.

Submit the duty

Sagemaker HyperPod Job Governance offers two methods to submit duties utilizing topology consciousness. This part discusses these two choices and a 3rd different to process governance.

Modify the Kubernetes manifest file

First, you possibly can modify an current Kubernetes manifest file to incorporate one in all two annotation choices:

kueue.x-k8s.io/podset-required-topology – Use this selection if all pods should be scheduled on nodes in the identical community node layer to start out a job
kueue.x-k8s.io/podset-preferred-topology -Ideally, in order for you all pods to be scheduled on nodes in the identical community node layer, use this selection, nevertheless it’s versatile

The next code is kueue.x-k8s.io/podset-required-topology Setting to schedule pods that share the identical Layer 3 community node:

apiVersion: batch/v1
form: Job
metadata:
  identify: test-tas-job
  namespace: hyperpod-ns-team-a
  labels:
    kueue.x-k8s.io/queue-name: hyperpod-ns-team-a-localqueue
    kueue.x-k8s.io/priority-class: inference-priority
spec:
  parallelism: 10
  completions: 10
  droop: true
  template:
    metadata:
      labels:
        kueue.x-k8s.io/queue-name: hyperpod-ns-team-a-localqueue
      annotations:
        kueue.x-k8s.io/podset-required-topology: "topology.k8s.aws/network-node-layer-3"
    spec:
      containers:
        - identify: dummy-job
          picture: public.ecr.aws/docker/library/alpine:newest
          command: ["sleep", "3600s"]
          sources:
            requests:
              cpu: "1"
      restartPolicy: By no means

To see which node the pod is working, use the next command to view the node ID for every pod:kubectl get pods -n hyperpod-ns-team-a -o vast

Use the Sagemaker HyperPod CLI

The second strategy to submit a job is to make use of the Sagemaker HyperPod CLI. To make use of topology-enabled scheduling, make certain to put in the newest model (pending model). To make use of topologyware scheduling with the Sagemaker HyperPod CLI, you possibly can embody both --preferred-topology Parameters or --required-topology Your parameters create job Directions.

The next code is an instance command for beginning a topology-aware Mnist coaching job utilizing the Sagemaker HyperPod CLI, changing xxxxxxxxx along with your AWS account ID.

hyp create hyp-pytorch-job 
--job-name test-pytorch-job-cli 
--image XXXXXXXXXXXX.dkr.ecr.us-west-2.amazonaws.com/ptjob:mnist 
--pull-policy "At all times" 
--tasks-per-node 1 
--max-retry 1 
--preferred-topology topology.k8s.aws/network-node-layer-3

cleansing

When you deploy new sources in accordance with this submit, Clean up the Sagemaker HyperPod EKS Workshop section Please make certain there are not any pointless costs.

Conclusion

Throughout large-scale language modeling (LLM) coaching, pod-to-pod communication distributes fashions to a number of situations, requiring frequent information change between these situations. On this submit, we mentioned how Sagemaker HyperPod process governance might help you schedule workloads to allow job effectivity by optimizing throughput and latency. We additionally defined schedule jobs utilizing Sagemaker HyperPod topology community data, optimizing the delay in community communication for AI duties.

We suggest attempting this answer and sharing your suggestions within the feedback part.

In regards to the writer

Nisha Nadkarni He’s a senior Genai Specialist Options Architect at AWS and guides companies by means of finest practices when deploying large-scale distributed coaching and inference for AWS. Earlier than her present position, she spent a number of years at AWS, specializing in serving to rising Genai startups develop fashions from concepts to manufacturing.

Siamak Nariman I am a senior product supervisor at AWS. He focuses on AI/ML expertise, ML mannequin administration, and ML governance, bettering general organizational effectivity and productiveness. He has intensive expertise in automating processes and deploying a variety of applied sciences.

Zican Li He’s a senior software program engineer at Amazon Internet Providers (AWS) and leads the software program growth of process governance for SageMaker HyperPod. In his position, he focuses on empowering clients with extremely AI capabilities whereas selling an surroundings that maximizes effectivity and productiveness for his or her engineering groups.

Anoop Saha I’m an SR GTM specialist at Amazon Internet Providers (AWS) specializing in producing AI mannequin coaching and inference. He companions with High Frontier Mannequin Builders, Strategic Clients and AWS Providers groups to allow distributed coaching and reasoning at scale in AWS and Lead Joint GTM actions. Earlier than AWS, Anoop performed a number of management roles in startups and huge enterprises, focusing totally on silicon and programs structure for AI infrastructure.

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.

Use Amazon Sagemaker HyperPod process governance to schedule topology-aware workloads

Answer overview

Stipulations

Get node topology data

Submit the duty

Modify the Kubernetes manifest file

Use the Sagemaker HyperPod CLI

cleansing

Conclusion

In regards to the writer

Ripple will donate $25 million in RLUSD on XRP ledger to Accion Alternative Fund and Rent Heroes USA

SNAP broadcasts SNAP OS 2.0 with native browsers, WebXR assist and extra

Converter

Editors Pick

Newsletter

Categories

Related Posts