Asserting OpenAI-compatible API assist for Amazon SageMaker AI endpoints

As we speak, Amazon SageMaker AI launched OpenAI-compatible API assist for real-time inference endpoints. In case you use OpenAI SDK, LangChain, or Strands Agent, now you can name your mannequin on SageMaker AI by altering simply the endpoint URL. No customized purchasers, SigV4 wrappers, or code rewrites required.

overview

With this launch, SageMaker AI endpoints are actually /openai/v1 A path that accepts chat completion requests and returns unchanged responses from the container, together with streaming. OpenAI endpoints are enabled for all endpoints and inference elements utilizing the usual SageMaker AI API and SDK.

SageMaker AI routes primarily based on the endpoint title within the URL, so you should use any OpenAI-compatible consumer out of the field. Now you can create time-limited bearer tokens on your endpoints and use them with OpenAI purchasers.

See the accompanying documentation for a working instance together with deployment and invocation. Notebooks on GitHub.

“We run an AI coding agent that makes use of a number of LLM suppliers by an LLM gateway (Bifrost) that speaks the OpenAI Chat Completion Protocol. The bearer token characteristic permits us so as to add SageMaker as a drop-in OpenAI-compatible inference endpoint (no customized SigV4 signing), so it really works natively with our gateway, the Vercel AI SDK, and commonplace OpenAI purchasers.” Giorgio Piatti, AI/ML Engineer – Caffeine.AI)

use case

Agent workflows on owned infrastructure

Once you construct multi-step AI brokers utilizing frameworks like Strands Agent or LangChain, you’ll be able to run their whole workflow by yourself SageMaker AI endpoint. The agent calls the mannequin utilizing the identical OpenAI-compatible interface it was constructed with, however the inference runs on a devoted GPU occasion in your account.

Internet hosting a number of fashions by a single interface

If you wish to run a number of fashions (for instance, Llama for basic duties, a fine-tuned Mistral for domain-specific work, and a smaller mannequin for classification), you’ll be able to host all of them on a single SageMaker AI endpoint utilizing the inference part. Every mannequin has its personal useful resource allocation, and all fashions could be referred to as by the identical OpenAI SDK. You need not write separate API purchasers or routing logic in your utility code.

Ship fine-tuned fashions with out altering code

If you wish to fine-tune open supply fashions for particular use circumstances, you’ll be able to deploy them to SageMaker AI and name them by the identical OpenAI-compatible interfaces that your purposes already use. The one change is the endpoint URL. The remainder of the appliance (SDK calls, streaming logic, immediate format) stays the identical.

Resolution overview

On this put up we’ll cowl:

How bearer token authentication works with SageMaker AI endpoints.
Deploying and invoking endpoints for a single mannequin.
Deploying and invoking inference elements for multi-model deployment.
Integration with Strands Agent framework.

Conditions

To proceed with this tutorial you will have:

An AWS account with permissions to create SageMaker AI endpoints.
SageMaker Python SDK (pip set up sagemaker).
OpenAI Python SDK (pip set up openai).
Fashions saved in Amazon Easy Storage Service (Amazon S3). For instance, Qwen3-4B, which I downloaded from Hugging Face.
An AWS Identification and Entry Administration (IAM) execution function to create the endpoint. AmazonSageMakerFullAccess coverage.
IAM execution function sagemaker:CallWithBearerToken and sagemaker:InvokeEndpoint Permission to name the endpoint.

Authentication with bearer token

SageMaker AI OpenAI suitable endpoints use bearer token authentication. The SageMaker Python SDK features a token generator that creates time-limited tokens (legitimate for as much as 12 hours) out of your current AWS credentials. No extra secrets and techniques or API keys are required.

The token comprises function or consumer credentials and requires the next: sagemaker:CallWithBearerToken and sagemaker:InvokeEndpoint Motion authority.

Generate a token

Generate a token utilizing the next Python script.

from sagemaker.core.token_generator import generate_token
from datetime import timedelta

token = generate_token(area="us-west-2", expiry=timedelta(minutes=5))

The token generator makes use of AWS credentials out there in your surroundings: IAM consumer credentials, an occasion profile on Amazon Elastic Compute Cloud (Amazon EC2), or an AWS IAM Identification Middle (SSO) session.

of generate_token The operate generates a time-limited bearer token for authenticating with the SageMaker API. By default, tokens are legitimate for 12 hours, however you’ll be able to override this. expiry parameters utilizing timedelta Values are between 1 second and 12 hours. This operate accepts an elective area. aws_credentials_providerand expiration date. If no AWS Area is specified, reverts to the AWS Area. AWS_REGION environmental variables. If no credential supplier is specified, the default AWS credential chain, which searches a number of sources together with surroundings variables, is used to resolve the credentials. ~/.aws/credentials, ~/.aws/configcontainer credentials, occasion profiles. See the Boto3 Credentials documentation for the whole decision order.

Auto-refresh tokens for long-running purposes

For purposes that run constantly, you’ll be able to implement an automated replace sample utilizing: httpx Ensures {that a} new token is generated for every request.

import httpx
from sagemaker.core.token_generator import generate_token

class SageMakerAuth(httpx.Auth):
    def __init__(self, area: str):
        self.area = area

    def auth_flow(self, request):
        request.headers["Authorization"] = f"Bearer {generate_token(area=self.area)}"
        yield request

http_client = httpx.Shopper(auth=SageMakerAuth(area="us-west-2"))

IAM permissions

The IAM function or consumer that calls the endpoint should have the next permissions:

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Effect": "Allow",
            "Action": "sagemaker:InvokeEndpoint",
            "Resource": "arn:aws:sagemaker:<REGION>:<ACCOUNT_ID>:endpoint/<ENDPOINT_NAME>"
        },
        {
            "Effect": "Allow",
            "Action": "sagemaker:CallWithBearerToken",
            "Resource": "*"
        }
    ]
}

As a finest observe, all the time restrict. Useful resource to a selected endpoint ARN InvokeEndpoint Fairly than utilizing wildcards. Bearer tokens generated from this function have the identical stage of entry, so the slim scope coverage limits the scope of the explosion if the token is by accident uncovered. word that CallWithBearerToken Wildcard ("*") for Useful resource area. Useful resource stage limits aren’t supported.

How tokens work

The bearer token is a base64 encoded SigV4 signed URL. when making a name generate_tokenthe SageMaker AI SDK constructs requests to SageMaker AI companies. CallWithBearerToken Execute the motion, signal it domestically along with your AWS credentials, and encode the ensuing signed URL as a transportable token string. No community calls are made throughout token era. Signing is finished totally on the consumer facet. Once you current this token to the SageMaker AI endpoint, the service decodes it, validates the SigV4 signature, verifies that the token has not expired, and verifies that the unique IAM id has the required permissions. The token lifetime is the lesser of the expiration worth and the remaining lifetime of the AWS credentials used to signal the token.

Safety finest practices: The bearer token comprises the identical authorization because the underlying AWS credentials used to generate it. Deal with tokens with the identical care as credentials. Restrict the scope of the IAM function used for token era to the minimal needed privileges. sagemaker:InvokeEndpoint and sagemaker:CallWithBearerToken Solely goal endpoint ARNs that the caller must entry. Don’t generate tokens from roles with prolonged privileges, equivalent to these granted by . AdministratorAccess or SageMakerFullAccess Managed coverage.

Don’t retailer tokens on disk, in surroundings variables, in configuration recordsdata, in databases, or in distributed caches. Don’t log tokens and solely ship them over encrypted communication protocols equivalent to HTTPS. Producing a token is an area operation with no community overhead, so we suggest that you simply generate a brand new token on the time of use or use the auto-renew characteristic. httpx.Auth The sample proven within the earlier instance. This avoids the danger of token leakage and lets you use your tokens with most expiry time remaining. As a finest observe, set the token expiration time to the shortest period required by your workload.

Deploy a single mannequin endpoint

A single mannequin endpoint hosts one mannequin and handles requests straight. The next instance deploys Qwen3-4B utilizing the SageMaker AI vLLM Deep Studying Container. ml.g6.2xlarge Examples.

Observe: SageMaker AI endpoints incur costs throughout service, no matter visitors. For extra data, see the Amazon SageMaker AI pricing web page.

import boto3
import sagemaker
import time
from sagemaker.core.helper.session_helper import Session
from sagemaker.core.helper.session_helper import get_execution_role

# AWS configuration
REGION = "us-west-2"

# Mechanically resolve account ID and default SageMaker execution function
session = Session(boto_session=boto3.Session(region_name=REGION))
ACCOUNT_ID = boto3.consumer("sts", region_name=REGION).get_caller_identity()["Account"]
EXECUTION_ROLE = get_execution_role(sagemaker_session=session)

# HF Mannequin ID
MODEL_HF_ID = "Qwen/Qwen3-4B"

# SageMaker vLLM Deep Studying Container
VLLM_IMAGE = f"763104351884.dkr.ecr.{REGION}.amazonaws.com/vllm:0.20.2-gpu-py312-cu130-ubuntu22.04-sagemaker"

# Occasion kind (1x NVIDIA L4 GPU)
INSTANCE_TYPE = "ml.g6.2xlarge"

sagemaker_client = boto3.consumer("sagemaker", region_name=REGION)

print(f"Area: {REGION}")
print(f"Account ID: {ACCOUNT_ID}")
print(f"Execution function: {EXECUTION_ROLE}")
print(f"Mannequin HF ID: {MODEL_HF_ID}")

import time

TIMESTAMP = str(int(time.time()))
SME_MODEL_NAME = f"openai-compat-sme-model-{TIMESTAMP}"
SME_ENDPOINT_CONFIG_NAME = f"openai-compat-sme-epc-{TIMESTAMP}"
SME_ENDPOINT_NAME = f"openai-compat-sme-ep-{TIMESTAMP}"

print(f"Timestamp suffix: {TIMESTAMP}")
print(f"Mannequin: {SME_MODEL_NAME}")
print(f"Endpoint config: {SME_ENDPOINT_CONFIG_NAME}")
print(f"Endpoint: {SME_ENDPOINT_NAME}")

sagemaker_client.create_model(
    ModelName=SME_MODEL_NAME,
    ExecutionRoleArn=EXECUTION_ROLE,
    PrimaryContainer={
        "Picture": VLLM_IMAGE,
        "Setting": {
            "HF_MODEL_ID": MODEL_HF_ID,
            "SM_VLLM_TENSOR_PARALLEL_SIZE": "1",
            "SM_VLLM_MAX_NUM_SEQS": "4",
            "SM_VLLM_ENABLE_AUTO_TOOL_CHOICE": "true",
            "SM_VLLM_TOOL_CALL_PARSER": "hermes",
            "SAGEMAKER_ENABLE_LOAD_AWARE": "1",
        },
    },
)
print(f"Mannequin created: {SME_MODEL_NAME}")

sagemaker_client.create_endpoint_config(
    EndpointConfigName=SME_ENDPOINT_CONFIG_NAME,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": SME_MODEL_NAME,
            "InstanceType": INSTANCE_TYPE,
            "InitialInstanceCount": 1,
        }
    ],
)
print(f"Endpoint configuration created: {SME_ENDPOINT_CONFIG_NAME}")

sagemaker_client.create_endpoint(
    EndpointName=SME_ENDPOINT_NAME,
    EndpointConfigName=SME_ENDPOINT_CONFIG_NAME,
)
print(f"Endpoint creation initiated: {SME_ENDPOINT_NAME}")

print("Ready for endpoint to achieve InService standing (this takes 5-10 minutes)...")
waiter = sagemaker_client.get_waiter("endpoint_in_service")
waiter.wait(
    EndpointName=SME_ENDPOINT_NAME,
    WaiterConfig={"Delay": 30, "MaxAttempts": 40},
)
print(f"Endpoint is InService: {SME_ENDPOINT_NAME}")

The endpoint transitions as follows: InService The standing might be displayed inside a couple of minutes. When you’re prepared, it is suitable with each commonplace SageMaker AI. /invocations Paths and OpenAI Appropriate Paths /openai/v1/chat/completions.

Name endpoint for a single mannequin

As soon as the endpoint is a service, name it utilizing the OpenAI Python SDK. The bottom URL follows this format:

https://runtime.sagemaker.<REGION>.amazonaws.com/endpoints/<ENDPOINT_NAME>/openai/v1

from openai import OpenAI
from sagemaker.core.token_generator import generate_token

REGION = "us-west-2"

sme_base_url = f"https://runtime.sagemaker.{REGION}.amazonaws.com/endpoints/{SME_ENDPOINT_NAME}/openai/v1"

consumer = OpenAI(
    base_url=sme_base_url,
    api_key=generate_token(area=REGION)
)

print(f"Base URL: {sme_base_url}")

stream = consumer.chat.completions.create(
    mannequin="",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain how transformers work in machine learning, in three sentences."},
    ],
    stream=True,
)

for chunk in stream:
    if chunk.decisions[0].delta.content material:
        print(chunk.decisions[0].delta.content material, finish="")
print()

of mannequin Fields are handed to the container. SageMaker AI routes requests primarily based on the endpoint title within the URL, so you’ll be able to depart this area empty or set it to match the mannequin title the container expects.

Deploy the inference part endpoint

Inference elements enable a single endpoint to host a number of fashions, every with devoted computing sources. For inference elements, the mannequin is related to the part somewhat than the endpoint configuration.

IC_MODEL_NAME = f"openai-compat-ic-model-{TIMESTAMP}"
IC_ENDPOINT_CONFIG_NAME = f"openai-compat-ic-epc-{TIMESTAMP}"
IC_ENDPOINT_NAME = f"openai-compat-ic-ep-{TIMESTAMP}"
IC_NAME = f"openai-compat-ic-qwen3-4b-{TIMESTAMP}"

print(f"Mannequin: {IC_MODEL_NAME}")
print(f"Endpoint config: {IC_ENDPOINT_CONFIG_NAME}")
print(f"Endpoint: {IC_ENDPOINT_NAME}")
print(f"Inference comp: {IC_NAME}")

sagemaker_client.create_model(
    ModelName=IC_MODEL_NAME,
    ExecutionRoleArn=EXECUTION_ROLE,
    PrimaryContainer={
        "Picture": VLLM_IMAGE,
        "Setting": {
            "HF_MODEL_ID": MODEL_HF_ID,
            "SM_VLLM_TENSOR_PARALLEL_SIZE": "1",
            "SM_VLLM_MAX_NUM_SEQS": "4",
            "SM_VLLM_ENABLE_AUTO_TOOL_CHOICE": "true",
            "SM_VLLM_TOOL_CALL_PARSER": "hermes",
            "SAGEMAKER_ENABLE_LOAD_AWARE": "1",
        },
    },
)
print(f"Mannequin created: {IC_MODEL_NAME}")

sagemaker_client.create_endpoint_config(
    EndpointConfigName=IC_ENDPOINT_CONFIG_NAME,
    ExecutionRoleArn=EXECUTION_ROLE,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "InstanceType": INSTANCE_TYPE,
            "InitialInstanceCount": 1,
        }
    ],
)
print(f"Endpoint configuration created: {IC_ENDPOINT_CONFIG_NAME}")

sagemaker_client.create_endpoint(
    EndpointName=IC_ENDPOINT_NAME,
    EndpointConfigName=IC_ENDPOINT_CONFIG_NAME,
)
print(f"Endpoint creation initiated: {IC_ENDPOINT_NAME}")

print("Ready for endpoint to achieve InService standing (this takes 5-10 minutes)...")
waiter = sagemaker_client.get_waiter("endpoint_in_service")
waiter.wait(
    EndpointName=IC_ENDPOINT_NAME,
    WaiterConfig={"Delay": 30, "MaxAttempts": 40},
)
print(f"Endpoint is InService: {IC_ENDPOINT_NAME}")

sagemaker_client.create_inference_component(
    InferenceComponentName=IC_NAME,
    EndpointName=IC_ENDPOINT_NAME,
    VariantName="variant1",
    Specification={
        "ModelName": IC_MODEL_NAME,
        "ComputeResourceRequirements": {
            "MinMemoryRequiredInMb": 1024,
            "NumberOfCpuCoresRequired": 2,
            "NumberOfAcceleratorDevicesRequired": 1,
        },
    },
    RuntimeConfig={"CopyCount": 1},
)
print(f"Inference part creation initiated: {IC_NAME}")

print("Ready for inference part to achieve InService standing...")
whereas True:
    desc = sagemaker_client.describe_inference_component(InferenceComponentName=IC_NAME)
    standing = desc["InferenceComponentStatus"]
    if standing == "InService":
        print(f"Inference part is InService: {IC_NAME}")
        break
    elif standing == "Failed":
        increase RuntimeError(f"Inference part failed: {desc.get('FailureReason', 'unknown')}")
    time.sleep(30)

You’ll be able to create extra inference elements on the identical endpoint to host a number of fashions with impartial scaling and useful resource allocation.

Name the inference part

To name a selected inference part, embrace its title within the URL path.

https://runtime.sagemaker.<REGION>.amazonaws.com/endpoints/<ENDPOINT>/inference-components/<IC_NAME>/openai/v1

The next instance exhibits two inference elements on a shared endpoint. Every part is focused to a separate OpenAI consumer that shares a connection pool.

import httpx
from openai import OpenAI
from sagemaker.core.token_generator import generate_token

shared_http = httpx.Shopper()

client_a = OpenAI(
    base_url=(
        f"https://runtime.sagemaker.{REGION}.amazonaws.com"
        f"/endpoints/{IC_ENDPOINT_NAME}/inference-components/{IC_NAME}/openai/v1"
    ),
    api_key=generate_token(area=REGION),
    http_client=shared_http,
)

response = client_a.chat.completions.create(
    mannequin="",
    messages=[{"role": "user", "content": "What is 42 * 3? Reply with the number."}],
)
print(f"Response: {response.decisions[0].message.content material}")
print(f"Connection pool energetic: shared_http is reusable throughout a number of IC purchasers")

shared httpx.Shopper Permits each OpenAI consumer situations to reuse the identical TLS session and connection pool.

Integration with Strands agent

Strands Brokers is an open supply SDK for constructing AI brokers. Strands Brokers helps OpenAI-compatible mannequin suppliers, so now you can run multi-agent workflows totally by yourself SageMaker AI infrastructure. This offers you the pliability of an agent utility that may management devoted endpoints. No knowledge leaves your account, and you may select precisely which mannequin variations your brokers run.

from openai import AsyncOpenAI
from strands import Agent, instrument
from strands.fashions.openai import OpenAIModel
from sagemaker.core.token_generator import generate_token

@instrument
def calculator(expression: str) -> str:
    """Consider a math expression."""
    return str(eval(expression))

strands_client = AsyncOpenAI(
    base_url=f"https://runtime.sagemaker.{REGION}.amazonaws.com/endpoints/{SME_ENDPOINT_NAME}/openai/v1",
    api_key=generate_token(area=REGION),
)

mannequin = OpenAIModel(consumer=strands_client, model_id="", params={"temperature": 0.7})

coder = Agent(
    mannequin=mannequin,
    system_prompt=(
        "You're an skilled Python developer. Write clear, well-documented "
        "Python code with kind hints. Output ONLY the code, no clarification."
    ),
    instruments=[calculator],
)

reviewer = Agent(
    mannequin=mannequin,
    system_prompt=(
        "You're a senior code reviewer. Evaluate Python code for correctness, "
        "efficiency, and PEP 8 fashion. Give a concise evaluate with particular recommendations."
    ),
    instruments=[calculator],
)

cleansing

To keep away from ongoing costs, delete the endpoint and related sources if you’re finished. SageMaker AI endpoints incur prices whereas in service no matter whether or not they’re receiving visitors.

import boto3
sagemaker_client = boto3.consumer("sagemaker", region_name="us-west-2")

sagemaker_client.delete_inference_component(InferenceComponentName="<IC_NAME>")
sagemaker_client.delete_endpoint(EndpointName="<ENDPOINT_NAME>")
sagemaker_client.delete_endpoint_config(EndpointConfigName="<ENDPOINT_CONFIG_NAME>")
sagemaker_client.delete_model(ModelName="<MODEL_NAME>")

conclusion

With OpenAI-compatible API assist, Amazon SageMaker AI removes the mixing barrier between the place most AI purposes at the moment reside and the infrastructure they should scale. You’ll be able to preserve your current code, use OpenAI-compatible frameworks, and run inference on devoted endpoints with the required GPU, scaling, and knowledge residency controls. First, deploy your mannequin to the SageMaker AI real-time endpoint utilizing a supported container and SageMaker Python SDKand specify the endpoint URL to the OpenAI consumer. For extra data, see Use SageMaker AI with OpenAI-compatible APIs. Amazon SageMaker AI Developer Informationor open the Amazon SageMaker AI console and create your first endpoint.

Asserting OpenAI-compatible API assist for Amazon SageMaker AI endpoints

overview

use case

Agent workflows on owned infrastructure

Internet hosting a number of fashions by a single interface

Ship fine-tuned fashions with out altering code

Resolution overview

Conditions

Authentication with bearer token

Generate a token

Auto-refresh tokens for long-running purposes

IAM permissions

How tokens work

Deploy a single mannequin endpoint

Name endpoint for a single mannequin

Deploy the inference part endpoint

Name the inference part

Integration with Strands agent

cleansing

conclusion

In regards to the creator

Creating end-to-end belief for insurance coverage and restoration

Screening cut-off dates could shield kids’s well being, U.S. surgeon basic recommends

Converter