Enterprises are looking for to shortly unlock the potential of generative AI by offering entry to basis fashions (FMs) to completely different strains of enterprise (LOBs). IT groups are accountable for serving to the LOB innovate with velocity and agility whereas offering centralized governance and observability. For instance, they might want to trace the utilization of FMs throughout groups, chargeback prices and supply visibility to the related price middle within the LOB. Moreover, they might want to manage entry to completely different fashions per crew. For instance, if solely particular FMs could also be authorized to be used.
Amazon Bedrock is a completely managed service that gives a selection of high-performing basis fashions from main AI firms like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon by way of a single API, together with a broad set of capabilities to construct generative AI functions with safety, privateness, and accountable AI. As a result of Amazon Bedrock is serverless, you don’t must handle any infrastructure, and you’ll securely combine and deploy generative AI capabilities into your functions utilizing the AWS companies you might be already acquainted with.
A software program as a service (SaaS) layer for basis fashions can present a easy and constant interface for end-users, whereas sustaining centralized governance of entry and consumption. API gateways can present unfastened coupling between mannequin customers and the mannequin endpoint service, and suppleness to adapt to altering mannequin, architectures, and invocation strategies.
On this submit, we present you tips on how to construct an inner SaaS layer to entry basis fashions with Amazon Bedrock in a multi-tenant (crew) structure. We particularly deal with utilization and value monitoring per tenant and in addition controls akin to utilization throttling per tenant. We describe how the answer and Amazon Bedrock consumption plans map to the final SaaS journey framework. The code for the answer and an AWS Cloud Improvement Equipment (AWS CDK) template is offered within the GitHub repository.
Challenges
An AI platform administrator wants to offer standardized and easy accessibility to FMs to a number of improvement groups.
The next are a number of the challenges to offer ruled entry to basis fashions:
- Value and utilization monitoring – Observe and audit particular person tenant prices and utilization of basis fashions, and supply chargeback prices to particular price facilities
- Price range and utilization controls – Handle API quota, finances, and utilization limits for the permitted use of basis fashions over an outlined frequency per tenant
- Entry management and mannequin governance – Outline entry controls for particular enable listed fashions per tenant
- Multi-tenant standardized API – Present constant entry to basis fashions with OpenAPI requirements
- Centralized administration of API – Present a single layer to handle API keys for accessing fashions
- Mannequin variations and updates – Deal with new and up to date mannequin model rollouts
Resolution overview
On this resolution, we check with a multi-tenant strategy. A tenant right here can vary from a person consumer, a selected undertaking, crew, and even a complete division. As we talk about the strategy, we use the time period crew, as a result of it’s the most typical. We use API keys to limit and monitor API entry for groups. Every crew is assigned an API key for entry to the FMs. There could be completely different consumer authentication and authorization mechanisms deployed in a company. For simplicity, we don’t embrace these on this resolution. You might also combine present identification suppliers with this resolution.
The next diagram summarizes the answer structure and key parts. Groups (tenants) assigned to separate price facilities eat Amazon Bedrock FMs by way of an API service. To trace consumption and value per crew, the answer logs knowledge for every particular person invocation, together with the mannequin invoked, variety of tokens for textual content era fashions, and picture dimensions for multi-modal fashions. As well as, it aggregates the invocations per mannequin and prices by every crew.
You possibly can deploy the answer in your personal account utilizing the AWS CDK. AWS CDK is an open supply software program improvement framework to mannequin and provision your cloud software sources utilizing acquainted programming languages. The AWS CDK code is offered within the GitHub repository.
Within the following sections, we talk about the important thing parts of the answer in additional element.
Capturing basis mannequin utilization per crew
The workflow to seize FM utilization per crew consists of the next steps (as numbered within the previous diagram):
- A crew’s software sends a POST request to Amazon API Gateway with the mannequin to be invoked within the
model_idquestion parameter and the consumer immediate within the request physique. - API Gateway routes the request to an AWS Lambda perform (
bedrock_invoke_model) that’s accountable for logging crew utilization data in Amazon CloudWatch and invoking the Amazon Bedrock mannequin. - Amazon Bedrock gives a VPC endpoint powered by AWS PrivateLink. On this resolution, the Lambda perform sends the request to Amazon Bedrock utilizing PrivateLink to determine a personal connection between the VPC in your account and the Amazon Bedrock service account. To be taught extra about PrivateLink, see Use AWS PrivateLink to arrange non-public entry to Amazon Bedrock.
- After the Amazon Bedrock invocation, Amazon CloudTrail generates a CloudTrail occasion.
- If the Amazon Bedrock name is profitable, the Lambda perform logs the next data relying on the kind of invoked mannequin and returns the generated response to the applying:
- team_id – The distinctive identifier for the crew issuing the request.
- requestId – The distinctive identifier of the request.
- model_id – The ID of the mannequin to be invoked.
- inputTokens – The variety of tokens despatched to the mannequin as a part of the immediate (for textual content era and embeddings fashions).
- outputTokens – The utmost variety of tokens to be generated by the mannequin (for textual content era fashions).
- peak – The peak of the requested picture (for multi-modal fashions and multi-modal embeddings fashions).
- width – The width of the requested picture (for multi-modal fashions solely).
- steps – The steps requested (for Stability AI fashions).
Monitoring prices per crew
A special circulation aggregates the utilization data, then calculates and saves the on-demand prices per crew each day. By having a separate circulation, we be certain that price monitoring doesn’t influence the latency and throughput of the mannequin invocation circulation. The workflow steps are as follows:
- An Amazon EventBridge rule triggers a Lambda perform (
bedrock_cost_tracking) each day. - The Lambda perform will get the utilization data from CloudWatch for the day prior to this, calculates the related prices, and shops the info aggregated by
team_idandmodel_idin Amazon Easy Storage Service (Amazon S3) in CSV format.
To question and visualize the info saved in Amazon S3, you will have completely different choices, together with S3 Choose, and Amazon Athena and Amazon QuickSight.
Controlling utilization per crew
A utilization plan specifies who can entry a number of deployed APIs and optionally units the goal request fee to begin throttling requests. The plan makes use of API keys to determine API purchasers who can entry the related API for every key. You should use API Gateway utilization plans to throttle requests that exceed predefined thresholds. You can even use API keys and quota limits, which allow you to set the utmost variety of requests per API key every crew is permitted to difficulty inside a specified time interval. That is along with Amazon Bedrock service quotas which can be assigned solely on the account degree.
Stipulations
Earlier than you deploy the answer, ensure you have the next:
Deploy the AWS CDK stack
Comply with the directions within the README file of the GitHub repository to configure and deploy the AWS CDK stack.
The stack deploys the next sources:
- Personal networking atmosphere (VPC, non-public subnets, safety group)
- IAM position for controlling mannequin entry
- Lambda layers for the mandatory Python modules
- Lambda perform
invoke_model - Lambda perform
list_foundation_models - Lambda perform
cost_tracking - Relaxation API (API Gateway)
- API Gateway utilization plan
- API key related to the utilization plan
Onboard a brand new crew
For offering entry to new groups, you may both share the identical API key throughout completely different groups and monitor the mannequin consumptions by offering a special team_id for the API invocation, or create devoted API keys used for accessing Amazon Bedrock sources by following the directions offered within the README.
The stack deploys the next sources:
- API Gateway utilization plan related to the beforehand created REST API
- API key related to the utilization plan for the brand new crew, with reserved throttling and burst configurations for the API
For extra details about API Gateway throttling and burst configurations, check with Throttle API requests for higher throughput.
After you deploy the stack, you may see that the brand new API key for team-2 is created as properly.

Configure mannequin entry management
The platform administrator can enable entry to particular basis fashions by enhancing the IAM coverage related to the Lambda perform invoke_model. The
IAM permissions are outlined within the file setup/stack_constructs/iam.py. See the next code:
Invoke the service
After you will have deployed the answer, you may invoke the service immediately out of your code. The next
is an instance in Python for consuming the invoke_model API for textual content era by a POST request:
Output: Amazon Bedrock is an inner expertise platform developed by Amazon to run and function a lot of their companies and merchandise. Some key issues about Bedrock …
The next is one other instance in Python for consuming the invoke_model API for embeddings era by a POST request:
model_id = "amazon.titan-embed-text-v1" #the mannequin id for the Amazon Titan Embeddings Textual content mannequin
immediate = "What's Amazon Bedrock?"
response = requests.submit(
f"{api_url}/invoke_model?model_id={model_id}",
json={"inputs": immediate, "parameters": model_kwargs},
headers={
"x-api-key": api_key, #key for querying the API
"team_id": team_id #distinctive tenant identifier,
"embeddings": "true" #boolean worth for the embeddings mannequin
}
)
textual content = response.json()[0]["embedding"]
Output: 0.91796875, 0.45117188, 0.52734375, -0.18652344, 0.06982422, 0.65234375, -0.13085938, 0.056884766, 0.092285156, 0.06982422, 1.03125, 0.8515625, 0.16308594, 0.079589844, -0.033935547, 0.796875, -0.15429688, -0.29882812, -0.25585938, 0.45703125, 0.044921875, 0.34570312 …
Entry denied to basis fashions
The next is an instance in Python for consuming the invoke_model API for textual content era by a POST request with an entry denied response:
<Response [500]> “Traceback (most up-to-date name final):n File ”/var/activity/index.py”, line 213, in lambda_handlern response = _invoke_text(bedrock_client, model_id, physique, model_kwargs)n File ”/var/activity/index.py”, line 146, in _invoke_textn increase en File ”/var/activity/index.py”, line 131, in _invoke_textn response = bedrock_client.invoke_model(n File ”/decide/python/botocore/shopper.py”, line 535, in _api_calln return self._make_api_call(operation_name, kwargs)n File ”/decide/python/botocore/shopper.py”, line 980, in _make_api_calln increase error_class(parsed_response, operation_name)nbotocore.errorfactory.AccessDeniedException: An error occurred (AccessDeniedException) when calling the InvokeModel operation: Your account just isn’t licensed to invoke this API operation.n”
Value estimation instance
When invoking Amazon Bedrock fashions with on-demand pricing, the entire price is calculated because the sum of the enter and output prices. Enter prices are primarily based on the variety of enter tokens despatched to the mannequin, and output prices are primarily based on the tokens generated. The costs are per 1,000 enter tokens and per 1,000 output tokens. For extra particulars and particular mannequin costs, check with Amazon Bedrock Pricing.
Let’s have a look at an instance the place two groups, team1 and team2, entry Amazon Bedrock by the answer on this submit. The utilization and value knowledge saved in Amazon S3 in a single day is proven within the following desk.
The columns input_tokens and output_tokens retailer the entire enter and output tokens throughout mannequin invocations per mannequin and per crew, respectively, for a given day.
The columns input_cost and output_cost retailer the respective prices per mannequin and per crew. These are calculated utilizing the next formulation:
input_cost = input_token_count * model_pricing["input_cost"] / 1000output_cost = output_token_count * model_pricing["output_cost"] / 1000
| team_id | model_id | input_tokens | output_tokens | invocations | input_cost | output_cost |
| Team1 | amazon.titan-tg1-large | 24000 | 2473 | 1000 | 0.0072 | 0.00099 |
| Team1 | anthropic.claude-v2 | 2448 | 4800 | 24 | 0.02698 | 0.15686 |
| Team2 | amazon.titan-tg1-large | 35000 | 52500 | 350 | 0.0105 | 0.021 |
| Team2 | ai21.j2-grande-instruct | 4590 | 9000 | 45 | 0.05738 | 0.1125 |
| Team2 | anthropic.claude-v2 | 1080 | 4400 | 20 | 0.0119 | 0.14379 |
Finish-to-end view of a purposeful multi-tenant serverless SaaS atmosphere
Let’s perceive what an end-to-end purposeful multi-tenant serverless SaaS atmosphere may seem like. The next is a reference structure diagram.

This structure diagram is a zoomed-out model of the earlier structure diagram defined earlier within the submit, the place the earlier structure diagram explains the small print of one of many microservices talked about (foundational mannequin service). This diagram explains that, other than foundational mannequin service, you must produce other parts as properly in your multi-tenant SaaS platform to implement a purposeful and scalable platform.
Let’s undergo the small print of the structure.
Tenant functions
The tenant functions are the entrance finish functions that work together with the atmosphere. Right here, we present a number of tenants accessing from completely different native or AWS environments. The entrance finish functions could be prolonged to incorporate a registration web page for brand new tenants to register themselves and an admin console for directors of the SaaS service layer. If the tenant functions require a customized logic to be applied that wants interplay with the SaaS atmosphere, they will implement the specs of the applying adaptor microservice. Instance situations could possibly be including customized authorization logic whereas respecting the authorization specs of the SaaS atmosphere.
Shared companies
The next are shared companies:
- Tenant and consumer administration companies –These companies are accountable for registering and managing the tenants. They supply the cross-cutting performance that’s separate from software companies and shared throughout the entire tenants.
- Basis mannequin service –The answer structure diagram defined firstly of this submit represents this microservice, the place the interplay from API Gateway to Lambda features is occurring inside the scope of this microservice. All tenants use this microservice to invoke the foundations fashions from Anthropic, AI21, Cohere, Stability, Meta, and Amazon, in addition to fine-tuned fashions. It additionally captures the data wanted for utilization monitoring in CloudWatch logs.
- Value monitoring service –This service tracks the price and utilization for every tenant. This microservice runs on a schedule to question the CloudWatch logs and output the aggregated utilization monitoring and inferred price to the info storage. The price monitoring service could be prolonged to construct additional experiences and visualization.
Software adaptor service
This service presents a set of specs and APIs {that a} tenant could implement to be able to combine their customized logic to the SaaS atmosphere. Based mostly on how a lot customized integration is required, this element could be non-compulsory for tenants.
Multi-tenant knowledge retailer
The shared companies retailer their knowledge in a knowledge retailer that may be a single shared Amazon DynamoDB desk with a tenant partitioning key that associates DynamoDB gadgets with particular person tenants. The price monitoring shared service outputs the aggregated utilization and value monitoring knowledge to Amazon S3. Based mostly on the use case, there could be an application-specific knowledge retailer as properly.
A multi-tenant SaaS atmosphere can have much more parts. For extra data, check with Constructing a Multi-Tenant SaaS Resolution Utilizing AWS Serverless Providers.
Help for a number of deployment fashions
SaaS frameworks sometimes define two deployment fashions: pool and silo. For the pool mannequin, all tenants entry FMs from a shared atmosphere with frequent storage and compute infrastructure. Within the silo mannequin, every tenant has its personal set of devoted sources. You possibly can examine isolation fashions within the SaaS Tenant Isolation Strategies whitepaper.
The proposed resolution could be adopted for each SaaS deployment fashions. Within the pool strategy, a centralized AWS atmosphere hosts the API, storage, and compute sources. In silo mode, every crew accesses APIs, storage, and compute sources in a devoted AWS atmosphere.
The answer additionally matches with the out there consumption plans offered by Amazon Bedrock. AWS gives a selection of two consumptions plan for inference:
- On-Demand – This mode means that you can use basis fashions on a pay-as-you-go foundation with out having to make any time-based time period commitments
- Provisioned Throughput – This mode means that you can provision adequate throughput to fulfill your software’s efficiency necessities in trade for a time-based time period dedication
For extra details about these choices, check with Amazon Bedrock Pricing.
The serverless SaaS reference resolution described on this submit can apply the Amazon Bedrock consumption plans to offer fundamental and premium tiering choices to end-users. Fundamental might embrace On-Demand or Provisioned Throughput consumption of Amazon Bedrock and will embrace particular utilization and finances limits. Tenant limits could possibly be enabled by throttling requests primarily based on requests, token sizes, or finances allocation. Premium tier tenants might have their very own devoted sources with provisioned throughput consumption of Amazon Bedrock. These tenants would sometimes be related to manufacturing workloads that require excessive throughput and low latency entry to Amazon Bedrock FMs.
Conclusion
On this submit, we mentioned tips on how to construct an inner SaaS platform to entry basis fashions with Amazon Bedrock in a multi-tenant setup with a deal with monitoring prices and utilization, and throttling limits for every tenant. Further subjects to discover embrace integrating present authentication and authorization options within the group, enhancing the API layer to incorporate internet sockets for bi-directional shopper server interactions, including content material filtering and different governance guardrails, designing a number of deployment tiers, integrating different microservices within the SaaS structure, and plenty of extra.
Your entire code for this resolution is offered within the GitHub repository.
For extra details about SaaS-based frameworks, check with SaaS Journey Framework: Constructing a New SaaS Resolution on AWS.
In regards to the Authors
Hasan Poonawala is a Senior AI/ML Specialist Options Architect at AWS, working with Healthcare and Life Sciences prospects. Hasan helps design, deploy and scale Generative AI and Machine studying functions on AWS. He has over 15 years of mixed work expertise in machine studying, software program improvement and knowledge science on the cloud. In his spare time, Hasan likes to discover nature and spend time with family and friends.
Anastasia Tzeveleka is a Senior AI/ML Specialist Options Architect at AWS. As a part of her work, she helps prospects throughout EMEA construct basis fashions and create scalable generative AI and machine studying options utilizing AWS companies.
Bruno Pistone is a Generative AI and ML Specialist Options Architect for AWS primarily based in Milan. He works with massive prospects serving to them to deeply perceive their technical wants and design AI and Machine Studying options that make the very best use of the AWS Cloud and the Amazon Machine Studying stack. His experience embrace: Machine Studying finish to finish, Machine Studying Industrialization, and Generative AI. He enjoys spending time along with his pals and exploring new locations, in addition to travelling to new locations.
Vikesh Pandey is a Generative AI/ML Options architect, specialising in monetary companies the place he helps monetary prospects construct and scale Generative AI/ML platforms and resolution which scales to a whole bunch to even 1000’s of customers. In his spare time, Vikesh likes to write down on numerous weblog boards and construct legos along with his child.

