Organizations are more and more deploying customized massive language fashions (LLMs) on Amazon SageMaker AI real-time endpoints utilizing most popular service frameworks resembling SGLang, vLLM, and TorchServe, giving them extra management over their deployments, optimizing prices, and assembly compliance necessities. Nonetheless, this flexibility poses important technical challenges, together with: Response format incompatibility with strand brokers. These customized serving frameworks sometimes return responses in an OpenAI-compatible format to facilitate assist for a variety of environments; strand agent Mannequin responses that observe the Bedrock Messages API format are anticipated.
This problem is particularly necessary for fashions hosted on SageMaker AI real-time endpoints, as message API assist will not be assured. The Amazon Bedrock Mantle distributed inference engine helps OpenAI messaging codecs beginning in December 2025, however the flexibility of SageMaker AI permits clients to host quite a lot of underlying fashions. A few of them require arcane immediate and response codecs that don’t conform to straightforward APIs. This creates a niche between the serving framework’s output construction and what Strands expects, stopping seamless integration although each techniques are technically useful. The answer lies in implementation customized mannequin parser prolong SageMaker AI model Moreover, by changing the mannequin server response format to the format anticipated by Strands, organizations can leverage their most popular service framework with out sacrificing compatibility with the Strands Brokers SDK.
This publish describes find out how to construct a customized mannequin parser for the Strands agent when working with LLMs hosted in SageMaker that don’t natively assist the Bedrock Messages API format. The next steps stroll you thru deploying Llama 3.1 on SageMaker utilizing SGLang. awslabs/ml-container-creatorSubsequent, implement a customized parser and combine it with the Strands agent.
strand customized parser
The Strands agent expects mannequin responses in a selected format aligned to the Bedrock Messages API. While you deploy a mannequin utilizing a customized serving framework resembling SGLang, vLLM, or TorchServe, the mannequin sometimes returns responses in its personal format. Typically appropriate with OpenAI to assist a variety of environments. And not using a customized parser, you’ll obtain an error just like the next:
TypeError: 'NoneType' object will not be subscriptable
This happens as a result of the Strands Agent is configured by default. SageMakerAIModel The category makes an attempt to parse the response by assuming a selected construction that the customized endpoint doesn’t present. Within the code base related to this publish: SageMakerAIModel A category with customized parsing logic that converts the mannequin server response format into the format anticipated by Strands.
Implementation overview
Our implementation consists of three layers.
- mannequin deployment layer: Llama 3.1 is powered by SGLang on SageMaker and returns OpenAI appropriate responses
- parser layer: Customized
LlamaModelProviderclass to increaseSageMakerAIModelTo deal with response codecs in Llama 3.1 - agent layer: Strands agent makes use of a customized supplier of conversational AI to correctly parse mannequin responses
Let’s use it first awslabs/ml-container-creatoran open supply Yeoman generator from AWS Labs that automates the creation of SageMaker BYOC (Convey Your Personal Container) deployment initiatives. Generate the artifacts required to construct the LLM serving container, together with Dockerfiles, CodeBuild configurations, and deployment scripts.
Set up ml-container-creator
The very first thing you must do is construct a serving container on your mannequin. Construct a container utilizing an open supply undertaking and generate a deployment script for that container. The next command exhibits find out how to set up it. awslabs/ml-container-creator and its dependencies. npm and yeoman. For extra info, see the undertaking Please read and Wiki To get began.
Generate deployment undertaking
As soon as put in and linked, you’ll be able to run the put in generator utilizing the yo command. You need to use yo ml-container-creator to run the generator wanted for this train.
The generator creates a whole undertaking construction.
Construct and deploy
Venture constructed by awslabs/ml-container-creator Comprises templated construct and deployment scripts. of ./deploy/submit_build.sh and ./deploy/deploy.sh The script is used to construct the picture, push it to Amazon Elastic Container Registry (ECR), and deploy it to Amazon SageMaker AI real-time endpoints.
Implementation course of:
- CodeBuild builds Docker pictures utilizing SGLang and Llama 3.1
- Picture is pushed to Amazon ECR
- SageMaker creates real-time endpoints
- SGLang downloads the mannequin from HuggingFace and masses it into GPU reminiscence.
- Endpoint reaches InService standing (roughly 10-Quarter-hour)
You possibly can take a look at your endpoint utilizing ./take a look at/test_endpoint.shor use a direct name.
Perceive response codecs
Llama 3.1 returns OpenAI-compatible responses. Strands expects mannequin responses to adapt to the Bedrock Messages API format. Till late final 12 months, this was a regular compatibility mismatch. Beginning in December 2025, the Amazon Bedrock Mantle distributed inference engine helps OpenAI messaging codecs.
Nonetheless, Message API assist will not be assured for fashions hosted on SageMaker AI real-time endpoints. SageMaker AI permits clients to host all kinds of underlying fashions on managed GPU-accelerated infrastructure, a few of which can require esoteric immediate/response codecs. For instance, the default SageMakerAIModel Makes use of the legacy Bedrock Message API format and makes an attempt to entry fields that aren’t current in the usual OpenAI message format. TypeError Model failure.
Implementing a customized mannequin parser
A customized mannequin parser is Strand Agent SDK This gives robust compatibility and adaptability for purchasers constructing LLM-powered brokers hosted on SageMaker AI. This part describes find out how to create a customized supplier to increase. SageMakerAIModel:
The stream methodology is SageMakerAIModel This permits the agent to parse the response primarily based on the necessities of the underlying mannequin. Whereas nearly all of fashions assist OpenAI’s Message API protocol, this characteristic permits energy customers to leverage extremely specified LLM on prime of SageMaker AI and energy their agent workloads utilizing the Strands Brokers SDK. As soon as your customized mannequin response logic is constructed, you’ll be able to simply initialize your agent with a customized mannequin supplier utilizing the Strands Brokers SDK.
An entire implementation of this tradition parser, together with a Jupyter pocket book with detailed directions and an ml-container-creator deployment undertaking, is out there within the companion. GitHub repository.
conclusion
Constructing a customized mannequin parser for the Strands agent permits customers to leverage completely different LLM deployments on SageMaker, no matter response format. by extending SageMakerAIModel and implement stream() This methodology means that you can combine customized host fashions whereas sustaining Strands’ clear agent interface.
Necessary factors:
- awslabs/ml-container-creator Simplify SageMaker BYOC deployments with production-ready infrastructure code
- Customized parsers bridge the hole between mannequin server response codecs and Strands expectations
- The stream() methodology is a key integration level for customized suppliers
Concerning the writer
Dan Ferguson I’m a Senior Options Architect at AWS primarily based in New York, USA. Dan is a machine studying companies knowledgeable devoted to serving to clients combine ML workflows effectively, successfully, and sustainably.

