In the present day, we’re excited to announce the supply of Llama 4 Scout and Maverick fashions in Amazon SageMaker JumpStart and coming quickly in Amazon Bedrock. Llama 4 represents Meta’s most superior multimodal fashions so far, that includes a combination of consultants (MoE) structure and context window assist as much as 10 million tokens. With native multimodality and early fusion expertise, Meta states that these new fashions reveal unprecedented efficiency throughout textual content and imaginative and prescient duties whereas sustaining environment friendly compute necessities. With a dramatic enhance on supported context size from 128K in Llama 3, Llama 4 is now appropriate for multi-document summarization, parsing intensive person exercise for customized duties, and reasoning over intensive codebases. Now you can deploy the Llama-4-Scout-17B-16E-Instruct, Llama-4-Maverick-17B-128E-Instruct, and Llama-4-Maverick-17B-128E-Instruct-FP8 fashions utilizing SageMaker JumpStart within the US East (N. Virginia) AWS Area.
On this weblog publish, we stroll you thru how one can deploy and immediate a Llama-4-Scout-17B-16E-Instruct mannequin utilizing SageMaker JumpStart.
Llama 4 overview
Meta introduced Llama 4 right this moment, introducing three distinct mannequin variants: Scout, which provides superior multimodal capabilities and a 10M token context window; Maverick, an economical resolution with a 128K context window; and Behemoth, in preview. These fashions are optimized for multimodal reasoning, multilingual duties, coding, tool-calling, and powering agentic programs.
Llama 4 Maverick is a robust general-purpose mannequin with 17 billion energetic parameters, 128 consultants, and 400 billion complete parameters, and optimized for high-quality normal assistant and chat use circumstances. Moreover, Llama 4 Maverick is offered with base and instruct fashions in each a quantized model (FP8) for environment friendly deployment on the Instruct mannequin and a non-quantized (BF16) model for optimum accuracy.
Llama 4 Scout, the extra compact and smaller mannequin, has 17 billion energetic parameters, 16 consultants, and 109 billion complete parameters, and options an industry-leading 10M token context window. These fashions are designed for industry-leading efficiency in picture and textual content understanding with assist for 12 languages, enabling the creation of AI functions that bridge language boundaries.
See Meta’s community license agreement for utilization phrases and extra particulars.
SageMaker JumpStart overview
SageMaker JumpStart provides entry to a broad collection of publicly out there basis fashions (FMs). These pre-trained fashions function highly effective beginning factors that may be deeply personalized to deal with particular use circumstances. You should utilize state-of-the-art mannequin architectures—equivalent to language fashions, laptop imaginative and prescient fashions, and extra—with out having to construct them from scratch.
With SageMaker JumpStart, you possibly can deploy fashions in a safe atmosphere. The fashions could be provisioned on devoted SageMaker inference situations could be remoted inside your digital personal cloud (VPC). After deploying an FM, you possibly can additional customise and fine-tune it utilizing the intensive capabilities of Amazon SageMaker AI, together with SageMaker inference for deploying fashions and container logs for improved observability. With SageMaker AI, you possibly can streamline all the mannequin deployment course of.
Stipulations
To attempt the Llama 4 fashions in SageMaker JumpStart, you want the next stipulations:
Uncover Llama 4 fashions in SageMaker JumpStart
SageMaker JumpStart offers FMs by way of two major interfaces: SageMaker Studio and the Amazon SageMaker Python SDK. This offers a number of choices to find and use a whole bunch of fashions to your particular use case.
SageMaker Studio is a complete built-in improvement atmosphere (IDE) that provides a unified, web-based interface for performing all elements of the AI improvement lifecycle. From making ready knowledge to constructing, coaching, and deploying fashions, SageMaker Studio offers purpose-built instruments to streamline all the course of.
In SageMaker Studio, you possibly can entry SageMaker JumpStart to find and discover the intensive catalog of FMs out there for deployment to inference capabilities on SageMaker Inference. You may entry SageMaker JumpStart by selecting JumpStart within the navigation pane or by selecting JumpStart from the Residence web page in SageMaker Studio, as proven within the following determine.
Alternatively, you need to use the SageMaker Python SDK to programmatically entry and use SageMaker JumpStart fashions. This method permits for better flexibility and integration with current AI and machine studying (AI/ML) workflows and pipelines.
By offering a number of entry factors, SageMaker JumpStart helps you seamlessly incorporate pre-trained fashions into your AI/ML improvement efforts, no matter your most well-liked interface or workflow.
Deploy Llama 4 fashions for inference by way of the SageMaker JumpStart UI
On the SageMaker JumpStart touchdown web page, yow will discover all the general public pre-trained fashions provided by SageMaker AI. You may then select the Meta mannequin supplier tab to find all of the out there Meta fashions.
Should you’re utilizing SageMaker Basic Studio and don’t see the Llama 4 fashions, replace your SageMaker Studio model by shutting down and restarting. For extra details about model updates, see Shut down and Replace Studio Basic Apps.
- Seek for Meta to view the Meta mannequin card. Every mannequin card exhibits key info, together with:
- Mannequin title
- Supplier title
- Activity class (for instance, Textual content Era)
- Choose the mannequin card to view the mannequin particulars web page.

The mannequin particulars web page contains the next info:
- The mannequin title and supplier info
- Deploy button to deploy the mannequin
- About and Notebooks tabs with detailed info
The About tab contains essential particulars, equivalent to:
- Mannequin description
- License info
- Technical specs
- Utilization pointers
Earlier than you deploy the mannequin, we beneficial you evaluation the mannequin particulars and license phrases to verify compatibility together with your use case.
- Select Deploy to proceed with deployment.

- For Endpoint title, use the routinely generated title or enter a customized one.
- For Occasion sort, use the default: p5.48xlarge.
- For Preliminary occasion depend, enter the variety of situations (default: 1).
Choosing acceptable occasion sorts and counts is essential for value and efficiency optimization. Monitor your deployment to regulate these settings as wanted. - Below Inference sort, Actual-time inference is chosen by default. That is optimized for sustained visitors and low latency.
- Evaluate all configurations for accuracy. For this mannequin, we strongly advocate adhering to SageMaker JumpStart default settings and ensuring that community isolation stays in place.
- Select Deploy. The deployment course of can take a number of minutes to finish.

When deployment is full, your endpoint standing will change to InService. At this level, the mannequin is able to settle for inference requests by way of the endpoint. You may monitor the deployment progress on the SageMaker console Endpoints web page, which can show related metrics and standing info. When the deployment is full, you possibly can invoke the mannequin utilizing a SageMaker runtime shopper and combine it together with your functions.
Deploy Llama 4 fashions for inference utilizing the SageMaker Python SDK
Whenever you select Deploy and settle for the phrases, mannequin deployment will begin. Alternatively, you possibly can deploy by way of the instance pocket book by selecting Open Pocket book. The pocket book offers end-to-end steering on how one can deploy the mannequin for inference and clear up sources.
To deploy utilizing a pocket book, begin by deciding on an acceptable mannequin, specified by the model_id. You may deploy any of the chosen fashions on SageMaker AI.
You may deploy the Llama 4 Scout mannequin utilizing SageMaker JumpStart with the next SageMaker Python SDK code:
This deploys the mannequin on SageMaker AI with default configurations, together with default occasion sort and default VPC configurations. You may change these configurations by specifying non-default values in JumpStartModel. To efficiently deploy the mannequin, you need to manually set accept_eula=True as a deploy methodology argument. After it’s deployed, you possibly can run inference in opposition to the deployed endpoint by way of the SageMaker predictor:
Advisable situations and benchmark
The next desk lists all of the Llama 4 fashions out there in SageMaker JumpStart together with the model_id, default occasion sorts, and the utmost variety of complete tokens (sum of variety of enter tokens and variety of generated tokens) supported for every of those fashions. For elevated context size, you possibly can modify the default occasion sort within the SageMaker JumpStart UI.
| Mannequin title | Mannequin ID | Default occasion sort | Supported occasion sorts |
| Llama-4-Scout-17B-16E-Instruct | meta-vlm-llama-4-scout-17b-16e-instruct |
ml.p5.48xlarge | ml.g6e.48xlarge, ml.p5.48xlarge, ml.p5en.48xlarge |
| Llama-4-Maverick-17B-128E-Instruct | meta-vlm-llama-4-maverick-17b-128e-instruct |
ml.p5.48xlarge | ml.p5.48xlarge, ml.p5en.48xlarge |
| Llama 4-Maverick-17B-128E-Instruct-FP8 | meta-vlm-llama-4-maverick-17b-128-instruct-fp8 |
ml.p5.48xlarge | ml.p5.48xlarge, ml.p5en.48xlarge |
Inference and instance prompts for Llama 4 Scout 17B 16 Specialists mannequin
You should utilize the Llama 4 Scout mannequin for textual content and picture or imaginative and prescient reasoning use circumstances. With that mannequin, you possibly can carry out a wide range of duties, equivalent to picture captioning, picture textual content retrieval, visible query answering and reasoning, doc visible query answering, and extra.
Within the following sections we present instance payloads, invocations, and responses for Llama 4 Scout that you need to use in opposition to your Llama 4 mannequin deployments utilizing Sagemaker JumpStart.
Textual content-only enter
Enter:
Response:
Single-image enter
On this part, let’s take a look at Llama 4’s multimodal capabilities. By merging textual content and imaginative and prescient tokens right into a unified processing spine, Llama 4 can seamlessly perceive and reply to queries about a picture. The next is an instance of how one can immediate Llama 4 to reply questions on a picture such because the one within the instance:
Picture:

Enter:
Response:
The Llama 4 mannequin on JumpStart can take within the picture offered through a URL, underlining its highly effective potential for real-time multimodal functions.
Multi-image enter
Constructing on its superior multimodal performance, Llama 4 can effortlessly course of a number of pictures on the similar time. On this demonstration, the mannequin is prompted with two picture URLs and tasked with describing every picture and explaining their relationship, showcasing its capability to synthesize info throughout a number of visible inputs. Let’s take a look at this beneath by passing within the URLs of the next pictures within the payload.
Picture 1:

Picture 2:

Enter:
Response:
As you possibly can see, Llama 4 excels in dealing with a number of pictures concurrently, offering detailed and contextually related insights that emphasize its sturdy multimodal processing talents.
Codebase evaluation with Llama 4
Utilizing Llama 4 Scout’s industry-leading context window, this part showcases its means to deeply analyze expansive codebases. The instance extracts and contextualizes the buildspec-1-10-2.yml file from the AWS Deep Learning Containers GitHub repository, illustrating how the mannequin synthesizes info throughout a whole repository. We used a software to ingest the entire repository into plaintext that we offered to the mannequin as context:
Enter:
Output:
Multi-document processing
Harnessing the identical intensive token context window, Llama 4 Scout excels in multi-document processing. On this instance, the mannequin extracts key monetary metrics from Amazon 10-Okay studies (2017-2024), demonstrating its functionality to combine and analyze knowledge spanning a number of years—all with out the necessity for added processing instruments.
Enter:
Output:
Clear up
To keep away from incurring pointless prices, while you’re carried out, delete the SageMaker endpoints utilizing the next code snippets:
Alternatively, utilizing the SageMaker console, full the next steps:
- On the SageMaker console, below Inference within the navigation pane, select Endpoints.
- Seek for the embedding and textual content era endpoints.
- On the endpoint particulars web page, select Delete.
- Select Delete once more to verify.
Conclusion
On this publish, we explored how SageMaker JumpStart empowers knowledge scientists and ML engineers to find, entry, and deploy a variety of pre-trained FMs for inference, together with Meta’s most superior and succesful fashions so far. Get began with SageMaker JumpStart and Llama 4 fashions right this moment.
For extra details about SageMaker JumpStart, see Prepare, deploy, and consider pretrained fashions with SageMaker JumpStart and Getting began with Amazon SageMaker JumpStart.
In regards to the authors
Marco Punio is a Sr. Specialist Options Architect targeted on generative AI technique, utilized AI options, and conducting analysis to assist clients hyper-scale on AWS. As a member of the Third-party Mannequin Supplier Utilized Sciences Options Structure group at AWS, he’s a worldwide lead for the Meta–AWS Partnership and technical technique. Primarily based in Seattle, Washington, Marco enjoys writing, studying, exercising, and constructing functions in his free time.
Chakravarthy Nagarajan is a Principal Options Architect specializing in machine studying, large knowledge, and excessive efficiency computing. In his present function, he helps clients remedy real-world, advanced enterprise issues utilizing machine studying and generative AI options.
Banu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, the SageMaker machine studying and generative AI hub. She is captivated with constructing options that assist clients speed up their AI journey and unlock enterprise worth.
Malav Shastri is a Software program Growth Engineer at AWS, the place he works on the Amazon SageMaker JumpStart and Amazon Bedrock groups. His function focuses on enabling clients to reap the benefits of state-of-the-art open supply and proprietary basis fashions and conventional machine studying algorithms. Malav holds a Grasp’s diploma in Laptop Science.
Niithiyn Vijeaswaran is a Generative AI Specialist Options Architect with the Third-party Mannequin Science group at AWS. His space of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s diploma in Laptop Science and Bioinformatics.
Baladithya Balamurugan is a Options Architect at AWS targeted on ML deployments for inference and utilizing AWS Neuron to speed up coaching and inference. He works with clients to allow and speed up their ML deployments on companies equivalent to Amazon Sagemaker and Amazon EC2. Primarily based in San Francisco, Baladithya enjoys tinkering, creating functions, and his residence lab in his free time.
John Liu has 14 years of expertise as a product govt and 10 years of expertise as a portfolio supervisor. At AWS, John is a Principal Product Supervisor for Amazon Bedrock. Beforehand, he was the Head of Product for AWS Web3 and Blockchain. Previous to AWS, John held numerous product management roles at public blockchain protocols and fintech corporations, and likewise spent 9 years as a portfolio supervisor at numerous hedge funds.

