Amazon SageMaker multi-model endpoints (MMEs) are a totally managed functionality of SageMaker inference that means that you can deploy 1000’s of fashions on a single endpoint. Beforehand, MMEs pre-determinedly allotted CPU computing energy to fashions statically regardless the mannequin visitors load, utilizing Multi Model Server (MMS) as its mannequin server. On this submit, we talk about an answer during which an MME can dynamically alter the compute energy assigned to every mannequin primarily based on the mannequin’s visitors sample. This answer lets you use the underlying compute of MMEs extra effectively and save prices.
MMEs dynamically load and unload fashions primarily based on incoming visitors to the endpoint. When using MMS because the mannequin server, MMEs allocate a hard and fast variety of mannequin employees for every mannequin. For extra info, consult with Mannequin internet hosting patterns in Amazon SageMaker, Half 3: Run and optimize multi-model inference with Amazon SageMaker multi-model endpoints.
Nonetheless, this could lead to some points when your visitors sample is variable. Let’s say you’ve got a singular or few fashions receiving a considerable amount of visitors. You’ll be able to configure MMS to allocate a excessive variety of employees for these fashions, however this will get assigned to all of the fashions behind the MME as a result of it’s a static configuration. This results in a lot of employees utilizing {hardware} compute—even the idle fashions. The other drawback can occur in case you set a small worth for the variety of employees. The favored fashions gained’t have sufficient employees on the mannequin server stage to correctly allocate sufficient {hardware} behind the endpoint for these fashions. The primary concern is that it’s troublesome to stay visitors sample agnostic in case you can’t dynamically scale your employees on the mannequin server stage to allocate the required quantity of compute.
The answer we talk about on this submit makes use of DJLServing because the mannequin server, which may help mitigate a number of the points that we mentioned and allow per-model scaling and allow MMEs to be visitors sample agnostic.
MME structure
SageMaker MMEs allow you to deploy a number of fashions behind a single inference endpoint that will include a number of cases. Every occasion is designed to load and serve a number of fashions as much as its reminiscence and CPU/GPU capability. With this structure, a software program as a service (SaaS) enterprise can break the linearly growing value of internet hosting a number of fashions and obtain reuse of infrastructure according to the multi-tenancy mannequin utilized elsewhere within the software stack. The next diagram illustrates this structure.
A SageMaker MME dynamically hundreds fashions from Amazon Easy Storage Service (Amazon S3) when invoked, as a substitute of downloading all of the fashions when the endpoint is first created. In consequence, an preliminary invocation to a mannequin would possibly see increased inference latency than the following inferences, that are accomplished with low latency. If the mannequin is already loaded on the container when invoked, then the obtain step is skipped and the mannequin returns the inferences with low latency. For instance, assume you’ve got a mannequin that’s solely used a couple of occasions a day. It’s robotically loaded on demand, whereas often accessed fashions are retained in reminiscence and invoked with constantly low latency.
Behind every MME are mannequin internet hosting cases, as depicted within the following diagram. These cases load and evict a number of fashions to and from reminiscence primarily based on the visitors patterns to the fashions.

SageMaker continues to route inference requests for a mannequin to the occasion the place the mannequin is already loaded such that the requests are served from a cached mannequin copy (see the next diagram, which exhibits the request path for the primary prediction request vs. the cached prediction request path). Nonetheless, if the mannequin receives many invocation requests, and there are extra cases for the MME, SageMaker routes some requests to a different occasion to accommodate the rise. To reap the benefits of automated mannequin scaling in SageMaker, be sure to have occasion auto scaling set as much as provision extra occasion capability. Arrange your endpoint-level scaling coverage with both customized parameters or invocations per minute (really helpful) so as to add extra cases to the endpoint fleet.

Mannequin server overview
A mannequin server is a software program part that gives a runtime setting for deploying and serving machine studying (ML) fashions. It acts as an interface between the educated fashions and shopper functions that wish to make predictions utilizing these fashions.
The first function of a mannequin server is to permit easy integration and environment friendly deployment of ML fashions into manufacturing methods. As a substitute of embedding the mannequin straight into an software or a particular framework, the mannequin server supplies a centralized platform the place a number of fashions might be deployed, managed, and served.
Mannequin servers usually provide the next functionalities:
- Mannequin loading – The server hundreds the educated ML fashions into reminiscence, making them prepared for serving predictions.
- Inference API – The server exposes an API that permits shopper functions to ship enter knowledge and obtain predictions from the deployed fashions.
- Scaling – Mannequin servers are designed to deal with concurrent requests from a number of shoppers. They supply mechanisms for parallel processing and managing sources effectively to make sure excessive throughput and low latency.
- Integration with backend engines – Mannequin servers have integrations with backend frameworks like DeepSpeed and FasterTransformer to partition giant fashions and run extremely optimized inference.
DJL structure
DJL Serving is an open supply, excessive efficiency, common mannequin server. DJL Serving is constructed on high of DJL, a deep studying library written within the Java programming language. It could actually take a deep studying mannequin, a number of fashions, or workflows and make them out there by way of an HTTP endpoint. DJL Serving helps deploying fashions from a number of frameworks like PyTorch, TensorFlow, Apache MXNet, ONNX, TensorRT, Hugging Face Transformers, DeepSpeed, FasterTransformer, and extra.
DJL Serving gives many options that permit you to deploy your fashions with excessive efficiency:
- Ease of use – DJL Serving can serve most fashions out of the field. Simply convey the mannequin artifacts, and DJL Serving can host them.
- A number of gadget and accelerator assist – DJL Serving helps deploying fashions on CPU, GPU, and AWS Inferentia.
- Efficiency – DJL Serving runs multithreaded inference in a single JVM to spice up throughput.
- Dynamic batching – DJL Serving helps dynamic batching to extend throughput.
- Auto scaling – DJL Serving will robotically scale employees up and down primarily based on the visitors load.
- Multi-engine assist – DJL Serving can concurrently host fashions utilizing completely different frameworks (equivalent to PyTorch and TensorFlow).
- Ensemble and workflow fashions – DJL Serving helps deploying complicated workflows comprised of a number of fashions, and runs elements of the workflow on CPU and elements on GPU. Fashions inside a workflow can use completely different frameworks.
Specifically, the auto scaling characteristic of DJL Serving makes it easy to make sure the fashions are scaled appropriately for the incoming visitors. By default, DJL Serving determines the utmost variety of employees for a mannequin that may be supported primarily based on the {hardware} out there (CPU cores, GPU units). You’ll be able to set decrease and higher bounds for every mannequin to guarantee that a minimal visitors stage can at all times be served, and {that a} single mannequin doesn’t devour all out there sources.
DJL Serving makes use of a Netty frontend on high of backend employee thread swimming pools. The frontend makes use of a single Netty setup with a number of HttpRequestHandlers. Completely different request handlers will present assist for the Inference API, Management API, or different APIs out there from varied plugins.
The backend relies across the WorkLoadManager (WLM) module. The WLM takes care of a number of employee threads for every mannequin together with the batching and request routing to them. When a number of fashions are served, WLM checks the inference request queue measurement of every mannequin first. If the queue measurement is larger than two occasions a mannequin’s batch measurement, WLM scales up the variety of employees assigned to that mannequin.
Answer overview
The implementation of DJL with an MME differs from the default MMS setup. For DJL Serving with an MME, we compress the next information within the mannequin.tar.gz format that SageMaker Inference is anticipating:
- mannequin.joblib – For this implementation, we straight push the mannequin metadata into the tarball. On this case, we’re working with a
.joblibfile, so we offer that file in our tarball for our inference script to learn. If the artifact is simply too giant, you may also push it to Amazon S3 and level in the direction of that within the serving configuration you outline for DJL. - serving.properties – Right here you may configure any mannequin server-related environment variables. The facility of DJL right here is that you could configure
minWorkersandmaxWorkersfor every mannequin tarball. This permits for every mannequin to scale up and down on the mannequin server stage. For example, if a singular mannequin is receiving nearly all of the visitors for an MME, the mannequin server will scale the employees up dynamically. On this instance, we don’t configure these variables and let DJL decide the required variety of employees relying on our visitors sample. - mannequin.py – That is the inference script for any customized preprocessing or postprocessing you wish to implement. The mannequin.py expects your logic to be encapsulated in a deal with technique by default.
- necessities.txt (non-compulsory) – By default, DJL comes put in with PyTorch, however any extra dependencies you want might be pushed right here.
For this instance, we showcase the facility of DJL with an MME by taking a pattern SKLearn mannequin. We run a coaching job with this mannequin after which create 1,000 copies of this mannequin artifact to again our MME. We then showcase how DJL can dynamically scale to deal with any kind of visitors sample that your MME might obtain. This could embrace a fair distribution of visitors throughout all fashions or perhaps a few widespread fashions receiving nearly all of the visitors. You could find all of the code within the following GitHub repo.
Stipulations
For this instance, we use a SageMaker pocket book occasion with a conda_python3 kernel and ml.c5.xlarge occasion. To carry out the load checks, you should utilize an Amazon Elastic Compute Cloud (Amazon EC2) occasion or a bigger SageMaker pocket book occasion. On this instance, we scale to over a thousand transactions per second (TPS), so we advise testing on a heavier EC2 occasion equivalent to an ml.c5.18xlarge so that you’ve extra compute to work with.
Create a mannequin artifact
We first have to create our mannequin artifact and knowledge that we use on this instance. For this case, we generate some synthetic knowledge with NumPy and prepare utilizing an SKLearn linear regression mannequin with the next code snippet:
After you run the previous code, you must have a mannequin.joblib file created in your native setting.
Pull the DJL Docker picture
The Docker picture djl-inference:0.23.0-cpu-full-v1.0 is our DJL serving container used on this instance. You’ll be able to alter the next URL relying in your Area:
inference_image_uri = "474422712127.dkr.ecr.us-east-1.amazonaws.com/djl-serving-cpu:newest"
Optionally, you may also use this picture as a base picture and prolong it to construct your individual Docker picture on Amazon Elastic Container Registry (Amazon ECR) with every other dependencies you want.
Create the mannequin file
First, we create a file referred to as serving.properties. This instructs DJLServing to make use of the Python engine. We additionally outline the max_idle_time of a employee to be 600 seconds. This makes certain that we take longer to scale down the variety of employees we have now per mannequin. We don’t alter minWorkers and maxWorkers that we are able to outline and we let DJL dynamically compute the variety of employees wanted relying on the visitors every mannequin is receiving. The serving.properties is proven as follows. To see the whole checklist of configuration choices, consult with Engine Configuration.
Subsequent, we create our mannequin.py file, which defines the mannequin loading and inference logic. For MMEs, every mannequin.py file is restricted to a mannequin. Fashions are saved in their very own paths underneath the mannequin retailer (normally /decide/ml/mannequin/). When loading fashions, they are going to be loaded underneath the mannequin retailer path in their very own listing. The total mannequin.py instance on this demo might be seen within the GitHub repo.
We create a mannequin.tar.gz file that features our mannequin (mannequin.joblib), mannequin.py, and serving.properties:
For demonstration functions, we make 1,000 copies of the identical mannequin.tar.gz file to symbolize the big variety of fashions to be hosted. In manufacturing, it’s worthwhile to create a mannequin.tar.gz file for every of your fashions.
Lastly, we add these fashions to Amazon S3.
Create a SageMaker mannequin
We now create a SageMaker model. We use the ECR picture outlined earlier and the mannequin artifact from the earlier step to create the SageMaker mannequin. Within the mannequin setup, we configure Mode as MultiModel. This tells DJLServing that we’re creating an MME.
Create a SageMaker endpoint
On this demo, we use 20 ml.c5d.18xlarge cases to scale to a TPS within the 1000’s vary. Be certain to get a restrict enhance in your occasion kind, if essential, to attain the TPS you’re focusing on.
Load testing
On the time of writing, the SageMaker in-house load testing software Amazon SageMaker Inference Recommender doesn’t natively assist testing for MMEs. Due to this fact, we use the open supply Python software Locust. Locust is easy to arrange and might monitor metrics equivalent to TPS and end-to-end latency. For a full understanding of set it up with SageMaker, see Finest practices for load testing Amazon SageMaker real-time inference endpoints.
On this use case, we have now three completely different visitors patterns we wish to simulate with MMEs, so we have now the next three Python scripts that align with every sample. Our aim right here is to show that, no matter what our visitors sample is, we are able to obtain the identical goal TPS and scale appropriately.
We are able to specify a weight in our Locust script to assign visitors throughout completely different parts of our fashions. For example, with our single sizzling mannequin, we implement two strategies as follows:
We are able to then assign a sure weight to every technique, which is when a sure technique receives a particular share of the visitors:
For 20 ml.c5d.18xlarge cases, we see the next invocation metrics on the Amazon CloudWatch console. These values stay pretty constant throughout all three visitors patterns. To grasp CloudWatch metrics for SageMaker real-time inference and MMEs higher, consult with SageMaker Endpoint Invocation Metrics.

You could find the remainder of the Locust scripts within the locust-utils directory within the GitHub repository.
Abstract
On this submit, we mentioned how an MME can dynamically alter the compute energy assigned to every mannequin primarily based on the mannequin’s visitors sample. This newly launched characteristic is out there in all AWS Areas the place SageMaker is out there. Be aware that on the time of announcement, solely CPU cases are supported. To study extra, consult with Supported algorithms, frameworks, and cases.
Concerning the Authors
Ram Vegiraju is a ML Architect with the SageMaker Service staff. He focuses on serving to clients construct and optimize their AI/ML options on Amazon SageMaker. In his spare time, he loves touring and writing.
Qingwei Li is a Machine Studying Specialist at Amazon Net Providers. He acquired his Ph.D. in Operations Analysis after he broke his advisor’s analysis grant account and did not ship the Nobel Prize he promised. Presently he helps clients within the monetary service and insurance coverage business construct machine studying options on AWS. In his spare time, he likes studying and educating.
James Wu is a Senior AI/ML Specialist Answer Architect at AWS. serving to clients design and construct AI/ML options. James’s work covers a variety of ML use instances, with a major curiosity in laptop imaginative and prescient, deep studying, and scaling ML throughout the enterprise. Previous to becoming a member of AWS, James was an architect, developer, and expertise chief for over 10 years, together with 6 years in engineering and 4 years in advertising & promoting industries.
Saurabh Trikande is a Senior Product Supervisor for Amazon SageMaker Inference. He’s enthusiastic about working with clients and is motivated by the aim of democratizing machine studying. He focuses on core challenges associated to deploying complicated ML functions, multi-tenant ML fashions, value optimizations, and making deployment of deep studying fashions extra accessible. In his spare time, Saurabh enjoys mountain climbing, studying about revolutionary applied sciences, following TechCrunch and spending time together with his household.
Xu Deng is a Software program Engineer Supervisor with the SageMaker staff. He focuses on serving to clients construct and optimize their AI/ML inference expertise on Amazon SageMaker. In his spare time, he loves touring and snowboarding.
Siddharth Venkatesan is a Software program Engineer in AWS Deep Studying. He at present focusses on constructing options for big mannequin inference. Previous to AWS he labored within the Amazon Grocery org constructing new fee options for patrons world-wide. Outdoors of labor, he enjoys snowboarding, the outside, and watching sports activities.
Rohith Nallamaddi is a Software program Growth Engineer at AWS. He works on optimizing deep studying workloads on GPUs, constructing excessive efficiency ML inference and serving options. Previous to this, he labored on constructing microservices primarily based on AWS for Amazon F3 enterprise. Outdoors of labor he enjoys taking part in and watching sports activities.

