As we speak, Amazon SageMaker AI launched OpenAI-compatible API assist for real-time inference endpoints. In case you use OpenAI SDK, LangChain, or Strands Agent, now you can name your mannequin on SageMaker AI by altering simply the endpoint URL. No customized purchasers, SigV4 wrappers, or code rewrites required.
overview
With this launch, SageMaker AI endpoints are actually /openai/v1 A path that accepts chat completion requests and returns unchanged responses from the container, together with streaming. OpenAI endpoints are enabled for all endpoints and inference elements utilizing the usual SageMaker AI API and SDK.
SageMaker AI routes primarily based on the endpoint title within the URL, so you should use any OpenAI-compatible consumer out of the field. Now you can create time-limited bearer tokens on your endpoints and use them with OpenAI purchasers.
See the accompanying documentation for a working instance together with deployment and invocation. Notebooks on GitHub.
“We run an AI coding agent that makes use of a number of LLM suppliers by an LLM gateway (Bifrost) that speaks the OpenAI Chat Completion Protocol. The bearer token characteristic permits us so as to add SageMaker as a drop-in OpenAI-compatible inference endpoint (no customized SigV4 signing), so it really works natively with our gateway, the Vercel AI SDK, and commonplace OpenAI purchasers.” Giorgio Piatti, AI/ML Engineer – Caffeine.AI)
use case
Agent workflows on owned infrastructure
Once you construct multi-step AI brokers utilizing frameworks like Strands Agent or LangChain, you’ll be able to run their whole workflow by yourself SageMaker AI endpoint. The agent calls the mannequin utilizing the identical OpenAI-compatible interface it was constructed with, however the inference runs on a devoted GPU occasion in your account.
Internet hosting a number of fashions by a single interface
If you wish to run a number of fashions (for instance, Llama for basic duties, a fine-tuned Mistral for domain-specific work, and a smaller mannequin for classification), you’ll be able to host all of them on a single SageMaker AI endpoint utilizing the inference part. Every mannequin has its personal useful resource allocation, and all fashions could be referred to as by the identical OpenAI SDK. You need not write separate API purchasers or routing logic in your utility code.
Ship fine-tuned fashions with out altering code
If you wish to fine-tune open supply fashions for particular use circumstances, you’ll be able to deploy them to SageMaker AI and name them by the identical OpenAI-compatible interfaces that your purposes already use. The one change is the endpoint URL. The remainder of the appliance (SDK calls, streaming logic, immediate format) stays the identical.
Resolution overview
On this put up we’ll cowl:
- How bearer token authentication works with SageMaker AI endpoints.
- Deploying and invoking endpoints for a single mannequin.
- Deploying and invoking inference elements for multi-model deployment.
- Integration with Strands Agent framework.
Conditions
To proceed with this tutorial you will have:
- An AWS account with permissions to create SageMaker AI endpoints.
- SageMaker Python SDK (
pip set up sagemaker). - OpenAI Python SDK (
pip set up openai). - Fashions saved in Amazon Easy Storage Service (Amazon S3). For instance, Qwen3-4B, which I downloaded from Hugging Face.
- An AWS Identification and Entry Administration (IAM) execution function to create the endpoint.
AmazonSageMakerFullAccesscoverage. - IAM execution function
sagemaker:CallWithBearerTokenandsagemaker:InvokeEndpointPermission to name the endpoint.
Authentication with bearer token
SageMaker AI OpenAI suitable endpoints use bearer token authentication. The SageMaker Python SDK features a token generator that creates time-limited tokens (legitimate for as much as 12 hours) out of your current AWS credentials. No extra secrets and techniques or API keys are required.
The token comprises function or consumer credentials and requires the next: sagemaker:CallWithBearerToken and sagemaker:InvokeEndpoint Motion authority.
Generate a token
Generate a token utilizing the next Python script.
The token generator makes use of AWS credentials out there in your surroundings: IAM consumer credentials, an occasion profile on Amazon Elastic Compute Cloud (Amazon EC2), or an AWS IAM Identification Middle (SSO) session.
of generate_token The operate generates a time-limited bearer token for authenticating with the SageMaker API. By default, tokens are legitimate for 12 hours, however you’ll be able to override this. expiry parameters utilizing timedelta Values ​​are between 1 second and 12 hours. This operate accepts an elective area. aws_credentials_providerand expiration date. If no AWS Area is specified, reverts to the AWS Area. AWS_REGION environmental variables. If no credential supplier is specified, the default AWS credential chain, which searches a number of sources together with surroundings variables, is used to resolve the credentials. ~/.aws/credentials, ~/.aws/configcontainer credentials, occasion profiles. See the Boto3 Credentials documentation for the whole decision order.
Auto-refresh tokens for long-running purposes
For purposes that run constantly, you’ll be able to implement an automated replace sample utilizing: httpx Ensures {that a} new token is generated for every request.
IAM permissions
The IAM function or consumer that calls the endpoint should have the next permissions:
As a finest observe, all the time restrict. Useful resource to a selected endpoint ARN InvokeEndpoint Fairly than utilizing wildcards. Bearer tokens generated from this function have the identical stage of entry, so the slim scope coverage limits the scope of the explosion if the token is by accident uncovered. word that CallWithBearerToken Wildcard ("*") for Useful resource area. Useful resource stage limits aren’t supported.
How tokens work
The bearer token is a base64 encoded SigV4 signed URL. when making a name generate_tokenthe SageMaker AI SDK constructs requests to SageMaker AI companies. CallWithBearerToken Execute the motion, signal it domestically along with your AWS credentials, and encode the ensuing signed URL as a transportable token string. No community calls are made throughout token era. Signing is finished totally on the consumer facet. Once you current this token to the SageMaker AI endpoint, the service decodes it, validates the SigV4 signature, verifies that the token has not expired, and verifies that the unique IAM id has the required permissions. The token lifetime is the lesser of the expiration worth and the remaining lifetime of the AWS credentials used to signal the token.
Safety finest practices: The bearer token comprises the identical authorization because the underlying AWS credentials used to generate it. Deal with tokens with the identical care as credentials. Restrict the scope of the IAM function used for token era to the minimal needed privileges. sagemaker:InvokeEndpoint and sagemaker:CallWithBearerToken Solely goal endpoint ARNs that the caller must entry. Don’t generate tokens from roles with prolonged privileges, equivalent to these granted by . AdministratorAccess or SageMakerFullAccess Managed coverage.
Don’t retailer tokens on disk, in surroundings variables, in configuration recordsdata, in databases, or in distributed caches. Don’t log tokens and solely ship them over encrypted communication protocols equivalent to HTTPS. Producing a token is an area operation with no community overhead, so we suggest that you simply generate a brand new token on the time of use or use the auto-renew characteristic. httpx.Auth The sample proven within the earlier instance. This avoids the danger of token leakage and lets you use your tokens with most expiry time remaining. As a finest observe, set the token expiration time to the shortest period required by your workload.
Deploy a single mannequin endpoint
A single mannequin endpoint hosts one mannequin and handles requests straight. The next instance deploys Qwen3-4B utilizing the SageMaker AI vLLM Deep Studying Container. ml.g6.2xlarge Examples.
Observe: SageMaker AI endpoints incur costs throughout service, no matter visitors. For extra data, see the Amazon SageMaker AI pricing web page.
The endpoint transitions as follows: InService The standing might be displayed inside a couple of minutes. When you’re prepared, it is suitable with each commonplace SageMaker AI. /invocations Paths and OpenAI Appropriate Paths /openai/v1/chat/completions.
Name endpoint for a single mannequin
As soon as the endpoint is a service, name it utilizing the OpenAI Python SDK. The bottom URL follows this format:
of mannequin Fields are handed to the container. SageMaker AI routes requests primarily based on the endpoint title within the URL, so you’ll be able to depart this area empty or set it to match the mannequin title the container expects.
Deploy the inference part endpoint
Inference elements enable a single endpoint to host a number of fashions, every with devoted computing sources. For inference elements, the mannequin is related to the part somewhat than the endpoint configuration.
You’ll be able to create extra inference elements on the identical endpoint to host a number of fashions with impartial scaling and useful resource allocation.
Name the inference part
To name a selected inference part, embrace its title within the URL path.
The next instance exhibits two inference elements on a shared endpoint. Every part is focused to a separate OpenAI consumer that shares a connection pool.
shared httpx.Shopper Permits each OpenAI consumer situations to reuse the identical TLS session and connection pool.
Integration with Strands agent
Strands Brokers is an open supply SDK for constructing AI brokers. Strands Brokers helps OpenAI-compatible mannequin suppliers, so now you can run multi-agent workflows totally by yourself SageMaker AI infrastructure. This offers you the pliability of an agent utility that may management devoted endpoints. No knowledge leaves your account, and you may select precisely which mannequin variations your brokers run.
cleansing
To keep away from ongoing costs, delete the endpoint and related sources if you’re finished. SageMaker AI endpoints incur prices whereas in service no matter whether or not they’re receiving visitors.
conclusion
With OpenAI-compatible API assist, Amazon SageMaker AI removes the mixing barrier between the place most AI purposes at the moment reside and the infrastructure they should scale. You’ll be able to preserve your current code, use OpenAI-compatible frameworks, and run inference on devoted endpoints with the required GPU, scaling, and knowledge residency controls. First, deploy your mannequin to the SageMaker AI real-time endpoint utilizing a supported container and SageMaker Python SDKand specify the endpoint URL to the OpenAI consumer. For extra data, see Use SageMaker AI with OpenAI-compatible APIs. Amazon SageMaker AI Developer Informationor open the Amazon SageMaker AI console and create your first endpoint.
In regards to the creator

