Construct a customized mannequin supplier for Strands Agent utilizing LLM hosted on SageMaker AI endpoints

by root March 8, 2026

written by root March 8, 2026 0 comment 73 views

Organizations are more and more deploying customized massive language fashions (LLMs) on Amazon SageMaker AI real-time endpoints utilizing most popular service frameworks resembling SGLang, vLLM, and TorchServe, giving them extra management over their deployments, optimizing prices, and assembly compliance necessities. Nonetheless, this flexibility poses important technical challenges, together with: Response format incompatibility with strand brokers. These customized serving frameworks sometimes return responses in an OpenAI-compatible format to facilitate assist for a variety of environments; strand agent Mannequin responses that observe the Bedrock Messages API format are anticipated.

This problem is particularly necessary for fashions hosted on SageMaker AI real-time endpoints, as message API assist will not be assured. The Amazon Bedrock Mantle distributed inference engine helps OpenAI messaging codecs beginning in December 2025, however the flexibility of SageMaker AI permits clients to host quite a lot of underlying fashions. A few of them require arcane immediate and response codecs that don’t conform to straightforward APIs. This creates a niche between the serving framework’s output construction and what Strands expects, stopping seamless integration although each techniques are technically useful. The answer lies in implementation customized mannequin parser prolong SageMaker AI model Moreover, by changing the mannequin server response format to the format anticipated by Strands, organizations can leverage their most popular service framework with out sacrificing compatibility with the Strands Brokers SDK.

This publish describes find out how to construct a customized mannequin parser for the Strands agent when working with LLMs hosted in SageMaker that don’t natively assist the Bedrock Messages API format. The next steps stroll you thru deploying Llama 3.1 on SageMaker utilizing SGLang. awslabs/ml-container-creatorSubsequent, implement a customized parser and combine it with the Strands agent.

strand customized parser

The Strands agent expects mannequin responses in a selected format aligned to the Bedrock Messages API. While you deploy a mannequin utilizing a customized serving framework resembling SGLang, vLLM, or TorchServe, the mannequin sometimes returns responses in its personal format. Typically appropriate with OpenAI to assist a variety of environments. And not using a customized parser, you’ll obtain an error just like the next:

TypeError: 'NoneType' object will not be subscriptable

This happens as a result of the Strands Agent is configured by default. SageMakerAIModel The category makes an attempt to parse the response by assuming a selected construction that the customized endpoint doesn’t present. Within the code base related to this publish: SageMakerAIModel A category with customized parsing logic that converts the mannequin server response format into the format anticipated by Strands.

Implementation overview

Our implementation consists of three layers.

mannequin deployment layer: Llama 3.1 is powered by SGLang on SageMaker and returns OpenAI appropriate responses
parser layer: Customized LlamaModelProvider class to increase SageMakerAIModel To deal with response codecs in Llama 3.1
agent layer: Strands agent makes use of a customized supplier of conversational AI to correctly parse mannequin responses

Let’s use it first awslabs/ml-container-creatoran open supply Yeoman generator from AWS Labs that automates the creation of SageMaker BYOC (Convey Your Personal Container) deployment initiatives. Generate the artifacts required to construct the LLM serving container, together with Dockerfiles, CodeBuild configurations, and deployment scripts.

Set up ml-container-creator

The very first thing you must do is construct a serving container on your mannequin. Construct a container utilizing an open supply undertaking and generate a deployment script for that container. The next command exhibits find out how to set up it. awslabs/ml-container-creator and its dependencies. npm and yeoman. For extra info, see the undertaking Please read and Wiki To get began.

# Set up Yeoman globally
npm set up -g yo

# Clone and set up ml-container-creator
git clone https://github.com/awslabs/ml-container-creator
cd ml-container-creator
npm set up && npm hyperlink

# Confirm set up
yo --generators # Ought to present ml-container-creator

Generate deployment undertaking

As soon as put in and linked, you’ll be able to run the put in generator utilizing the yo command. You need to use yo ml-container-creator to run the generator wanted for this train.

# Run the generator
yo ml-container-creator

# Configuration choices:
# - Framework: transformers
# - Mannequin Server: sglang
# - Mannequin: meta-llama/Llama-3.1-8B-Instruct
# - Deploy Goal: codebuild
# - Occasion Kind: ml.g6.12xlarge (GPU)
# - Area: us-east-1

The generator creates a whole undertaking construction.

<project-directory>/
├── Dockerfile # Container with SGLang and dependencies
├── buildspec.yml # CodeBuild configuration
├── code/
│ └── serve # SGLang server startup script
├── deploy/
│ ├── submit_build.sh # Triggers CodeBuild
│ └── deploy.sh # Deploys to SageMaker
└── take a look at/
└── test_endpoint.sh # Endpoint testing script

Construct and deploy

Venture constructed by awslabs/ml-container-creator Comprises templated construct and deployment scripts. of ./deploy/submit_build.sh and ./deploy/deploy.sh The script is used to construct the picture, push it to Amazon Elastic Container Registry (ECR), and deploy it to Amazon SageMaker AI real-time endpoints.

cd llama-31-deployment

# Construct container with CodeBuild (no native Docker required)
./deploy/submit_build.sh

# Deploy to SageMaker
./deploy/deploy.sh arn:aws:iam::ACCOUNT:function/SageMakerExecutionRole

Implementation course of:

CodeBuild builds Docker pictures utilizing SGLang and Llama 3.1
Picture is pushed to Amazon ECR
SageMaker creates real-time endpoints
SGLang downloads the mannequin from HuggingFace and masses it into GPU reminiscence.
Endpoint reaches InService standing (roughly 10-Quarter-hour)

You possibly can take a look at your endpoint utilizing ./take a look at/test_endpoint.shor use a direct name.

import boto3
import json

runtime_client = boto3.shopper('sagemaker-runtime', region_name="us-east-1")

payload = {
"messages": [
    {"user", "content": "Hello, how are you?"}
  ],
  "max_tokens": 100,
  "temperature": 0.7
}

response = runtime_client.invoke_endpoint(
  EndpointName="llama-31-deployment-endpoint",
  ContentType="utility/json",
  Physique=json.dumps(payload)
)

end result = json.masses(response['Body'].learn().decode('utf-8'))
print(end result['choices'][0]['message']['content'])

Perceive response codecs

Llama 3.1 returns OpenAI-compatible responses. Strands expects mannequin responses to adapt to the Bedrock Messages API format. Till late final 12 months, this was a regular compatibility mismatch. Beginning in December 2025, the Amazon Bedrock Mantle distributed inference engine helps OpenAI messaging codecs.

{
  "id": "cmpl-abc123",
  "object": "chat.completion",
  "created": 1704067200,
  "mannequin": "meta-llama/Llama-3.1-8B-Instruct",
  "decisions": [{
    "index": 0,
    "message": {"role": "assistant", "content": "I'm doing well, thank you for asking!"},
    "finish_reason": "stop"
  }],
  "utilization": {
    "prompt_tokens": 23,
    "completion_tokens": 12,
    "total_tokens": 35
  }
}

Nonetheless, Message API assist will not be assured for fashions hosted on SageMaker AI real-time endpoints. SageMaker AI permits clients to host all kinds of underlying fashions on managed GPU-accelerated infrastructure, a few of which can require esoteric immediate/response codecs. For instance, the default SageMakerAIModel Makes use of the legacy Bedrock Message API format and makes an attempt to entry fields that aren’t current in the usual OpenAI message format. TypeError Model failure.

Implementing a customized mannequin parser

A customized mannequin parser is Strand Agent SDK This gives robust compatibility and adaptability for purchasers constructing LLM-powered brokers hosted on SageMaker AI. This part describes find out how to create a customized supplier to increase. SageMakerAIModel:

def stream(self, messages: Listing[Dict[str, Any]], tool_specs: record, system_prompt: Non-obligatory[str], **kwargs):
  # Construct payload messages
  payload_messages = []
  if system_prompt:
    payload_messages.append({"function": "system", "content material": system_prompt})
    # Extract message content material from Strands format
    for msg in messages:
      payload_messages.append({"function": "consumer", "content material": msg['content'][0]['text']})
      
      # Construct full payload with streaming enabled
      payload = {
        "messages": payload_messages,
        "max_tokens": kwargs.get('max_tokens', self.max_tokens),
        "temperature": kwargs.get('temperature', self.temperature),
        "top_p": kwargs.get('top_p', self.top_p),
        "stream": True
      }

      attempt:
        # Invoke SageMaker endpoint with streaming
        response = self.runtime_client.invoke_endpoint_with_response_stream(
          EndpointName=self.endpoint_name,
          ContentType="utility/json",
          Settle for="utility/json",
          Physique=json.dumps(payload)
        )

        # Course of streaming response
        accumulated_content = ""
          for occasion in response['Body']:
            chunk = occasion['PayloadPart']['Bytes'].decode('utf-8')
            if not chunk.strip():
              proceed
    
            # Parse SSE format: "information: {json}n"
            for line in chunk.break up('n'):
              if line.startswith('information: '):
                attempt:
                  json_str = line.substitute('information: ', '').strip()
                  if not json_str:
                    proceed
                  
                  chunk_data = json.masses(json_str)
                  if 'decisions' in chunk_data and chunk_data['choices']:
                    delta = chunk_data['choices'][0].get('delta', {})

                    # Yield content material delta in Strands format
                    if 'content material' in delta:
                      content_chunk = delta['content']
                      accumulated_content += content_chunk
                      yield {
                        "sort": "contentBlockDelta",
                        "delta": {"textual content": content_chunk},
                        "contentBlockIndex": 0
                      }

                    # Test for completion
                    finish_reason = chunk_data['choices'][0].get('finish_reason')
                    if finish_reason:
                      yield {
                        "sort": "messageStop",
                        "stopReason": finish_reason
                      }

                    # Yield utilization metadata
                    if 'utilization' in chunk_data:
                      yield {
                        "sort": "metadata",
                        "utilization": chunk_data['usage']
                      }

                besides json.JSONDecodeError:
                  proceed

      besides Exception as e:
        yield {
          "sort": "error",
          "error": {
            "message": f"Endpoint invocation failed: {str(e)}",
            "sort": "EndpointInvocationError"
          }
      }

The stream methodology is SageMakerAIModel This permits the agent to parse the response primarily based on the necessities of the underlying mannequin. Whereas nearly all of fashions assist OpenAI’s Message API protocol, this characteristic permits energy customers to leverage extremely specified LLM on prime of SageMaker AI and energy their agent workloads utilizing the Strands Brokers SDK. As soon as your customized mannequin response logic is constructed, you’ll be able to simply initialize your agent with a customized mannequin supplier utilizing the Strands Brokers SDK.

from strands.agent import Agent

# Initialize customized supplier
supplier = LlamaModelProvider(
  endpoint_name="llama-31-deployment-endpoint",
  region_name="us-east-1",
  max_tokens=1000,
  temperature=0.7
)

# Create agent with customized supplier
agent = Agent(
  title="llama-assistant",
  mannequin=supplier,
  system_prompt=(
    "You're a useful AI assistant powered by Llama 3.1, "
    "deployed on Amazon SageMaker. You present clear, correct, "
    "and pleasant responses to consumer questions."
  )
)

# Check the agent
response = agent("What are the important thing advantages of deploying LLMs on SageMaker?")
print(response.content material)

An entire implementation of this tradition parser, together with a Jupyter pocket book with detailed directions and an ml-container-creator deployment undertaking, is out there within the companion. GitHub repository.

conclusion

Constructing a customized mannequin parser for the Strands agent permits customers to leverage completely different LLM deployments on SageMaker, no matter response format. by extending SageMakerAIModel and implement stream() This methodology means that you can combine customized host fashions whereas sustaining Strands’ clear agent interface.

Necessary factors:

awslabs/ml-container-creator Simplify SageMaker BYOC deployments with production-ready infrastructure code
Customized parsers bridge the hole between mannequin server response codecs and Strands expectations
The stream() methodology is a key integration level for customized suppliers

Concerning the writer

Dan Ferguson I’m a Senior Options Architect at AWS primarily based in New York, USA. Dan is a machine studying companies knowledgeable devoted to serving to clients combine ML workflows effectively, successfully, and sustainably.

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.

Construct a customized mannequin supplier for Strands Agent utilizing LLM hosted on SageMaker AI endpoints

strand customized parser

Implementation overview

Set up ml-container-creator

Generate deployment undertaking

Construct and deploy

Perceive response codecs

Implementing a customized mannequin parser

conclusion

Necessary factors:

Concerning the writer

Ethereum co-founder Jeffrey Wilke transfers $157 million in ETH to Kraken after months of pockets silence

Does daylight saving time hit you want a ton of bricks? This is methods to cope with it higher

Converter

Editors Pick

Newsletter

Categories

Related Posts