Monitor and debug generative AI inference with SageMaker detailed metrics and Insights dashboard on CloudWatch

by root June 19, 2026

written by root June 19, 2026 0 comment 3 views

Monitoring and troubleshooting generative AI inference endpoints working at scale is difficult. When your giant language mannequin (LLM) endpoint’s P99 latency spikes, it’s essential to decide in minutes whether or not the foundation trigger is GPU reminiscence stress, a saturated KV cache, unbalanced site visitors throughout Availability Zones, or an auto scaling coverage that hasn’t triggered. The shift from coaching to serving is reshaping how groups deploy LLMs and different generative AI fashions in manufacturing. Machine studying (ML) platform engineers, MLOps groups, and web site reliability engineers (SREs) should preserve inference endpoints wholesome, responsive, and cost-efficient, usually throughout dozens of fashions and a whole lot of GPU situations.

Amazon SageMaker AI gives absolutely managed real-time inference internet hosting for machine studying fashions. You deploy a mannequin to a SageMaker endpoint backed by a number of compute situations, and SageMaker handles provisioning and scaling. SageMaker helps a number of endpoint architectures. This submit focuses on the 2 most related to generative AI workloads with detailed observability:

Single-model endpoints (SME) – Every endpoint hosts one mannequin on devoted situations. SMEs are easy to arrange and purpose about, however every mannequin requires its personal fleet of GPU situations.
Inference part (IC) endpoints – A number of fashions share the identical set of situations by way of inference parts. Every inference part defines a mannequin, its useful resource necessities (CPU, GPU, reminiscence), and its scaling coverage. IC endpoints are the beneficial structure for manufacturing generative AI workloads as a result of they assist multi-model internet hosting on shared GPU infrastructure, impartial scaling per mannequin, and excessive availability (HA) by way of copy distribution throughout AZs.

SageMaker endpoints emit metrics like invocation counts, mannequin latency, and overhead latency to Amazon CloudWatch. These mixture metrics are helpful for understanding total endpoint well being. As a result of groups scale to multi-model deployments on GPU fleets, they want deeper alerts. Amazon SageMaker AI now emits over 100 detailed inference metrics. These cowl GPU well being, token-level latency, KV cache stress, site visitors distribution throughout AZs, inference part placement, and chilly begin diagnostics. These metrics stream to a built-in SageMaker Insights dashboard in Amazon CloudWatch, a completely managed observability resolution that removes the necessity for customized Grafana dashboards and Prometheus configuration. The SageMaker Insights dashboard helps each endpoint sorts and robotically exhibits IC-specific panels when inference parts are detected.

For extra particulars on SageMaker inference, see Deploy fashions for real-time inference.

On this submit, you’ll discover ways to:

Activate detailed observability metrics on new and current SageMaker inference endpoints.
Navigate the SageMaker Insights dashboard to observe fleet well being throughout Efficiency, Capability, and Reliability views.
Join the metrics to your individual observability software (Grafana, Datadog) by way of the PromQL-compatible endpoint.

SageMaker inference observability overview

SageMaker inference endpoints emit native OpenTelemetry metrics to CloudWatch. The SageMaker Insights dashboard is situated within the CloudWatch console below Infrastructure Monitoring → SageMaker Insights. It queries these metrics utilizing PromQL and renders visualizations on the fleet, endpoint, and inference-component degree throughout three tabs: Efficiency, Capability, and Reliability.

Efficiency – Fleet well being, token latency, throughput, errors, engine stress.
Capability – GPU, CPU, and reminiscence utilization of the fleet.
Reliability – Availability Zone distribution, scaling occasions, chilly begin anatomy, and inadequate capability errors.

SageMaker Insights dashboard in CloudWatch showing the Performance, Capacity, and Reliability tabs

Key providers

Amazon SageMaker AI – Managed inference with endpoints and inference parts.
Amazon CloudWatch – Native assist for OpenTelemetry metrics and PromQL queries by way of SageMaker Insights.

For background on the OpenTelemetry and PromQL assist in CloudWatch, see Introducing OpenTelemetry PromQL assist in Amazon CloudWatch.

Stipulations

You need to have the next to observe together with this submit.

An AWS account with not less than one SageMaker real-time inference endpoint.
AWS Identification and Entry Administration (IAM) permissions: sagemaker:CreateEndpointConfig, sagemaker:UpdateEndpoint, and cloudwatch:GetMetricData.
vLLM or SGLang container framework (required for token-level metrics like TTFT and ITL).

GPU situations obtain per-accelerator utilization metrics along with the CPU and reminiscence metrics accessible on all occasion sorts. For the total setup information, see Getting began with detailed observability.

Activate detailed metrics in your endpoints

New endpoints: Computerized (default-on)

For any new endpoint configurations you create, detailed metrics are turned on by default. The EnableDetailedObservability parameter in your endpoint configuration defaults to true. No further code is required.

import boto3

sm = boto3.shopper("sagemaker")

# Create endpoint config — observability turned on by default
response = sm.create_endpoint_config(
    EndpointConfigName="my-llm-config",
    ProductionVariants=[{
        "VariantName": "primary",
        "InstanceType": "ml.g6.4xlarge",
        "InitialInstanceCount": 2,
        "ManagedInstanceScaling": {
            "Status": "ENABLED",
            "MinInstanceCount": 2,
            "MaxInstanceCount": 8
        }
    }],
    ExecutionRoleArn="arn:aws:iam::123456789012:position/SageMakerExecutionRole"

The EnableDetailedObservability flag in your endpoint configuration defaults to true, so no further configuration is required. You can too explicitly set the publishing frequency utilizing MetricsPublishFrequencyInSeconds in MetricsConfig. The default is 60 seconds. For workloads that want close to real-time monitoring, you possibly can set it to lower than a minute.

# Create endpoint
sm.create_endpoint(
    EndpointName="my-llm-endpoint",
    EndpointConfigName="my-llm-config"
)

Inside 2 minutes of the endpoint reaching InService, the OpenTelemetry format metrics start flowing to CloudWatch.

Current endpoints: Choose-in

Current endpoints require an specific opt-in. Create a brand new endpoint configuration with the MetricsConfig flag, then replace your endpoint. This follows the identical sample as any endpoint configuration change.

SageMaker console showing the Enable detailed observability option in endpoint configuration

# Step 1: Create new config with detailed observability turned on
sm.create_endpoint_config(
    EndpointConfigName="my-existing-config-v2",
    ProductionVariants=[{
        "VariantName": "primary",
        "ModelName": "my-existing-model",
        "InstanceType": "ml.g6.4xlarge",
        "InitialInstanceCount": 2
    }],
    MetricsConfig={"EnableDetailedObservability": True},
    ExecutionRoleArn="arn:aws:iam::123456789012:position/SageMakerExecutionRole"
)

# Step 2: Replace endpoint
sm.update_endpoint(
    EndpointName="my-existing-endpoint",
    EndpointConfigName="my-existing-config-v2"
)

The SageMaker console additionally gives a guided three-step wizard after you select Allow detailed observability: be taught concerning the metrics, activate OTel enrichment, and choose which endpoints to choose in.

Three-step wizard in the SageMaker console for enabling detailed observability on existing endpoints

Allow OTel enrichment for traditional CloudWatch metrics

Native OpenTelemetry metrics stream robotically to CloudWatch after enablement. Nevertheless, current basic metrics (Invocations, ModelLatency, OverheadLatency) require OTel enrichment to be seen within the SageMaker Insights dashboard and queryable with PromQL.

Navigate to CloudWatch Console then Settings and activate OTel metric enrichment and Useful resource tags for telemetry. It is a one-time, account-level and AWS Area-level setting.

CloudWatch Settings page with OTel metric enrichment and Resource tags for telemetry options selected

Navigate to the SageMaker Insights dashboard from the SageMaker console

You possibly can entry the SageMaker Insights dashboard by way of both the SageMaker console or the CloudWatch console. Inside SageMaker, there are three entry factors, every pre-filtered to their context:

#	Entry Level	Filter Utilized	Use Case
1	Endpoints record web page → “Open SageMaker Insights”	Fleet-level (all endpoints)	“Give me the massive image”
2	Endpoint element web page → “View in SageMaker Insights”	Filtered to that endpoint	“Drill into this particular endpoint”
3	IC tab → per-IC “Metrics” hyperlink	Filtered to endpoint + IC	“Debug this inference part”

Each path deep-links with pre-applied filters, so that you received’t land on a clean dashboard looking for your sources.

SageMaker console with three deep-link entry points to the SageMaker Insights dashboard

Efficiency tab: Monitoring fleet well being and debugging latency

The Efficiency tab is the place most prospects spend their time. It solutions questions like “Is all the pieces working nicely?” and “If not, which part is the issue?” The Efficiency tab consists of a number of time-series panels that work collectively to pinpoint latency points.

Efficiency well being and occasion efficiency desk

Coloration-coded hexagons visualize each useful resource in your fleet. Toggle between Situations, IC Copies, and Endpoints views. The hexagon coloration signifies state:

Inexperienced for OK.
White for no alarms detected.
Purple for in alarm.

Hover over any hexagon to see occasion sort, TTFT, output TPS, concurrent requests, KV cache utilization, and CloudWatch alarm standing. Select Filter by this occasion to drill down. Each panel on the web page updates to point out solely that occasion’s information.

Honeycomb hexagon visualization with a hover card showing per-instance performance metrics

The desk exhibits each occasion with efficiency metrics side-by-side. Use this desk to identify outliers in TTFT, output TPS, and concurrent requests. The TTFT, Output TPS, Concurrent Requests, and KV Cache columns present information emitted by the vLLM and SGLang frameworks solely.

The Token streaming panel plots Time to First Token (TTFT) and Inter-Token Latency (ITL) over time with a P50/P99 toggle. TTFT measures how lengthy customers wait earlier than seeing the primary response character. ITL measures time between consecutive tokens, which straight impacts streaming smoothness. You possibly can filter by endpoint, inference part identify, or mannequin to isolate which part contributes to latency.

If you establish a TTFT spike, the Latency breakdown panel helps you attribute it. This panel separates complete latency into Mannequin Latency (time the mannequin spends processing) and Overhead Latency (time the platform spends routing and scheduling). An Invoke tab exhibits the total request path, and a Streaming tab exhibits time-to-first-chunk particularly. If each Mannequin Latency and Overhead Latency are regular however TTFT remains to be elevated, the mannequin’s inference engine is perhaps holding requests in its inner queue, for instance, ready for KV cache slots. Test the Engine and request stress panel to verify.

The Site visitors distribution panel exhibits per-instance or per-inference-component request stream with Availability Zone filtering. Toggle the AZ dropdown to isolate site visitors by zone. If one AZ exhibits zero site visitors whereas others are loaded, that signifies a routing or placement problem. You should use the occasion/IC toggle to modify between “Which machines deal with site visitors?” and “Which fashions deal with site visitors?” views.

Lastly, the Token throughput panel measures precise tokens processed per second, damaged down by enter/output, percentiles, or by occasion. This straight measures inference effectivity. For instance, in case your ml.g6.4xlarge delivers 150 tokens per second output when the mannequin benchmark exhibits 500, that signifies a useful resource constraint, configuration problem, or KV cache stress. The multi-framework legend (SGLang, vLLM, DJL) lets multi-model endpoints evaluate throughput throughout inference engines.

Engine and request stress

The Engine and request stress panel is your early warning system for stopping outages.

Engine and request pressure panel showing KV cache utilization, running requests, and waiting requests over time

The time-series view exhibits the per-framework breakdown, with tooltips that present actual values at any timestamp. When you see KV cache repeatedly climbing to 40–50 % throughout enterprise hours, configure autoscaling to set off at a threshold worth earlier than prospects really feel the impression.

Capability tab: Planning deployments and useful resource administration

The Capability tab solutions questions like “Do I’ve sufficient sources?”, “The place is there headroom?”, and “Can I match one other mannequin?”

Capability well being

The identical honeycomb visualization from Efficiency reappears right here, with useful resource utilization percentages within the hover card: GPU, GPU reminiscence, CPU, CPU reminiscence, and Disk.

Capacity health honeycomb view with a hover card showing GPU, GPU memory, CPU, CPU memory, and disk utilization

Earlier than you deploy a brand new mannequin or scale copies, hover over situations in your goal endpoint. If GPU reminiscence is at 89 %, there’s restricted VRAM headroom for extra mannequin weights.

Fleet utilization over time

This panel exhibits useful resource consumption tendencies with toggles for Occasion, IC copies, and Endpoint aggregation. Key alerts embody the next:

GPU Reminiscence trending upward over days signifies that you just’re approaching capability limits. Add situations earlier than utilization reaches the restrict.
GPU Reminiscence dropping abruptly signifies {that a} mannequin crashed or was unloaded. Examine.
Disk spikes that recur periodically correlate with mannequin downloads throughout chilly begins.

Fleet utilization time series showing GPU, GPU memory, CPU, memory, and disk consumption with Instance, IC copies, and Endpoint toggles

Reliability tab: Supporting excessive availability and resilience view

The Reliability tab solutions questions like “If an AZ goes down, will my inference fleet survive?”, “Are scaling occasions working?”, and “Why are chilly begins sluggish?”

Availability Zone distribution

A bar chart exhibits occasion and IC copy counts per AZ. This view exhibits your excessive availability posture.

Bar chart of instance and IC copy counts per Availability Zone with Instances and IC Copies toggle

Distribution	Danger	Motion
Even throughout over 3 AZs	Low	No motion
Concentrated in 1-2 AZs	Medium	Rebalance
0 situations in any AZ	Excessive	Single AZ failure takes you offline

Toggle between Situations and IC Copies. Situations is perhaps balanced, however IC copies may very well be targeting a number of machines.

Chilly begin anatomy

Stacked bar chart breaking down each IC provisioning event into model download, GPU load, container start, and health check phases

Each IC provisioning occasion displayed as a horizontal stacked bar with 4 phases:

Section	Coloration	What it measures	Optimization
Mannequin obtain	Blue	Pull mannequin weights from Amazon Easy Storage Service (Amazon S3)	Compress artifacts, use Amazon Elastic File System (Amazon EFS) caching
GPU load	Purple	Load weights onto GPU	Smaller quantization, pre-warming
Container begin	Orange	Container initialization	Cut back dependencies

Within the screenshot, gma-ic-vllm took 237.6 seconds, with mannequin obtain dominating, whereas gma-rblk-ic-tiny was solely 41.4 seconds as a result of it’s a smaller mannequin. This view tells you which of them section to optimize for sooner scaling response instances.

ICE diagnostics

The ICE diagnostics view tracks inadequate capability errors (ICE), which happen when SageMaker can’t provision requested situations. The desk exhibits:

When the failure occurred.
Which endpoint was affected (deep-links to the console).
Which occasion sort was unavailable.
Which AZ had no capability.

Within the previous screenshot, all 12 ICE occasions are for p5.48xlarge throughout all 4 AZs, indicating full regional exhaustion for this occasion sort. You now know to modify to different occasion sorts as a fallback.

ICE diagnostics table listing the time, affected endpoint, instance type, and Availability Zone for each insufficient capacity event

For groups with current Grafana or different PromQL-compatible instruments, you possibly can question SageMaker Insights metrics straight out of your platform with out switching to the CloudWatch console. The next walkthrough demonstrates the setup utilizing Grafana. The identical steps apply to self-hosted Grafana or different appropriate instruments, with minor configuration variations.

SageMaker Insights metrics flowing through the PromQL endpoint to a Grafana dashboard

Step 1: Get the PromQL endpoint URL

Navigate to SageMaker Console, then choose Endpoints. From there, choose your endpoint after which select Hook up with your observability software. Copy the displayed endpoint URL. It follows the format proven within the SageMaker console.

Step 2: Configure your Grafana information supply

In Amazon Managed Grafana (Basic CloudWatch 2.4+) or self-hosted Grafana with the Amazon Managed Service for Prometheus plugin (v3.0.0+):

Navigate to Configuration, Knowledge Sources, then Add information supply. Choose Amazon Managed Service for Prometheus and set the URL to the PromQL endpoint URL from Step 1.
Beneath Service Supplier, enter monitoring.
Configure SigV4 authentication with an IAM position that has the cloudwatch:GetMetricData and cloudwatch:ListMetrics permissions.
Select Save & Take a look at. It’s best to see Knowledge supply is working.

Step 3: Import the pre-built dashboard template

Obtain the dashboard template JSON from the identical Hook up with your observability software web page within the SageMaker console. Import the downloaded JSON template into Grafana (Dashboards → Import), choose the Prometheus information supply you configured in Step 2, and also you get pre-configured Efficiency, Capability, and Reliability panels matching the SageMaker Insights structure.

Imported Grafana dashboard with pre-configured Performance, Capacity, and Reliability panels matching SageMaker Insights

Step 4: Question metrics with PromQL

With the information supply related, you possibly can write customized PromQL queries. For instance:

KV cache
vllm:kv_cache_usage_perc{"aws.sagemaker.endpoint.identify"="ep-prsn-ic","aws.sagemaker.inference_component.identify"="ic-qwen3-4b"}

# Lively requests
vllm:num_requests_running{"aws.sagemaker.endpoint.identify"="ep-prsn-ic","aws.sagemaker.inference_component.identify"="ic-qwen3-4b"}

# TTFT P99
histogram_quantile(0.99, fee(vllm:time_to_first_token_seconds{"aws.sagemaker.endpoint.identify"="ep-prsn-ic","aws.sagemaker.inference_component.identify"="ic-qwen3-4b"}[5m]))

Pricing

SageMaker doesn’t cost individually for emitting detailed observability metrics. The metrics are revealed to Amazon CloudWatch in OpenTelemetry information format, and normal CloudWatch OpenTelemetry ingestion pricing applies. OpenTelemetry metrics ingested into CloudWatch are charged at $0.50 per GB ingested. When you activate OTel vended metric enrichment (required to view basic CloudWatch metrics like Invocations and ModelLatency within the Insights dashboard), enriched metrics are additionally charged at $0.50 per GB. For detailed pricing examples and a value calculator, see the OpenTelemetry Metrics part on the Amazon CloudWatch pricing web page.

Clear up

To keep away from ongoing fees, delete check sources on this order:

# Delete inference parts first (if IC endpoint)
aws sagemaker delete-inference-component --inference-component-name my-ic

# Delete endpoints
aws sagemaker delete-endpoint --endpoint-name my-endpoint

# Watch for deletion, then delete configs
aws sagemaker delete-endpoint-config --endpoint-config-name my-config

GPU situations are billed per second whereas endpoints are InService. Delete promptly after testing.

Conclusion

On this submit, you enabled SageMaker detailed metrics on inference endpoints and used the built-in SageMaker Insights dashboard to observe fleet well being, debug latency utilizing token-level metrics, validate excessive availability, and plan capability for brand new deployments.

To get began, see the next sources:

Acknowledgments

The SageMaker Insights dashboard and detailed observability metrics are the results of shut collaboration between the Amazon SageMaker AI and Amazon CloudWatch groups. We thank the engineering, product, and options structure groups whose work made this launch doable.

We additionally thank the next contributors for his or her evaluation and inputs on this weblog submit:

Felipe Lopez – Principal GenAI/ML Architect, AWS
Sandeep Raveesh-Babu – Sr. Worldwide Specialist SA, GenAI, AWS
Johna Liu – Sr. Software program Growth Engineer, Amazon SageMaker
Raviprakash Darbha – Sr. Software program Growth Engineer, Amazon SageMaker
Prajwal Kammardi – Software program Growth Engineer, Amazon SageMaker
Jiaxi Xu – Software program Growth Engineer, Amazon SageMaker
Orcun Berkem – Principal Engineer, Observability, Amazon CloudWatch
Steve McCurry – Principal Product Supervisor, Amazon CloudWatch

Concerning the creator

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.

Monitor and debug generative AI inference with SageMaker detailed metrics and Insights dashboard on CloudWatch

SageMaker inference observability overview

Key providers

Stipulations

Activate detailed metrics in your endpoints

New endpoints: Computerized (default-on)

Current endpoints: Choose-in

Allow OTel enrichment for traditional CloudWatch metrics

Navigate to the SageMaker Insights dashboard from the SageMaker console

Efficiency tab: Monitoring fleet well being and debugging latency

Efficiency well being and occasion efficiency desk

Engine and request stress

Capability tab: Planning deployments and useful resource administration

Capability well being

Fleet utilization over time

Reliability tab: Supporting excessive availability and resilience view

Availability Zone distribution

Chilly begin anatomy

ICE diagnostics

Step 1: Get the PromQL endpoint URL

Step 2: Configure your Grafana information supply

Step 3: Import the pre-built dashboard template

Step 4: Question metrics with PromQL

Pricing

Clear up

Conclusion

Acknowledgments

Concerning the creator

Menace panorama that ought to preserve U.S. cyber insurers up at evening