Unlock mannequin insights with log chance help for Amazon Bedrock Customized Mannequin Import

You should utilize Amazon Bedrock Customized Mannequin Import to seamlessly combine your custom-made fashions—similar to Llama, Mistral, and Qwen—that you’ve fine-tuned elsewhere into Amazon Bedrock. The expertise is totally serverless, minimizing infrastructure administration whereas offering your imported fashions with the identical unified API entry as native Amazon Bedrock fashions. Your customized fashions profit from computerized scaling, enterprise-grade safety, and native integration with Amazon Bedrock options similar to Amazon Bedrock Guardrails and Amazon Bedrock Data Bases.

Understanding how assured a mannequin is in its predictions is crucial for constructing dependable AI functions, significantly when working with specialised customized fashions which may encounter domain-specific queries.

With log chance help now added to Customized Mannequin Import, you possibly can entry details about your fashions’ confidence of their predictions on the token stage. This enhancement offers better visibility into mannequin habits and allows new capabilities for mannequin analysis, confidence scoring, and superior filtering strategies.

On this submit, we discover how log possibilities work with imported fashions in Amazon Bedrock. You’ll be taught what log possibilities are, the best way to allow them in your API calls, and the best way to interpret the returned information. We additionally spotlight sensible functions—from detecting potential hallucinations to optimizing RAG techniques and evaluating fine-tuned fashions—that show how these insights can enhance your AI functions, serving to you construct extra reliable options along with your customized fashions.

Understanding log possibilities

In language fashions, a log chance represents the logarithm of the chance that the mannequin assigns to a token in a sequence. These values point out how assured the mannequin is about every token it generates or processes. Log possibilities are expressed as unfavourable numbers, with values nearer to zero indicating greater confidence. For instance, a log chance of -0.1 corresponds to roughly 90% confidence, whereas a price of -3.0 corresponds to about 5% confidence. By inspecting these values, you possibly can determine when a mannequin is very sure versus when it’s making much less assured predictions. Log possibilities present a quantitative measure of how probably the mannequin thought of every generated token, providing beneficial perception into the arrogance of its output. By analyzing them you possibly can,

Gauge confidence throughout a response: Assess how assured the mannequin was in numerous sections of its output, serving to you determine the place it was sure versus unsure.
Rating and examine outputs: Evaluate total sequence chance (by including or averaging log possibilities) to rank or filter a number of mannequin outputs.
Detect potential hallucinations: Determine sudden drops in token-level confidence, which may flag segments which may require verification or overview.
Cut back RAG prices with early pruning: Run brief, low-cost draft generations primarily based on retrieved contexts, compute log possibilities for these drafts, and discard low-scoring candidates early, avoiding pointless full-length generations or costly reranking whereas holding solely essentially the most promising contexts within the pipeline.
Construct confidence-aware functions: Adapt system habits primarily based on certainty ranges—for instance, set off clarifying prompts, present fallback responses, or flagging for human overview.

General, log possibilities are a strong device for decoding and debugging mannequin responses with measurable certainty—significantly beneficial for functions the place understanding why a mannequin responded in a sure means could be as essential because the response itself.

Stipulations

To make use of log chance help with customized mannequin import in Amazon Bedrock, you want:

An lively AWS account with entry to Amazon Bedrock
A customized mannequin created in Amazon Bedrock utilizing the Customized Mannequin Import function after July 31, 2025, when the log possibilities help was launched
Acceptable AWS Id and Entry Administration (IAM) permissions to invoke fashions by the Amazon Bedrock Runtime

Introducing log possibilities help in Amazon Bedrock

With this launch, Amazon Bedrock now permits fashions imported utilizing the Customized Mannequin Import function to return token-level log possibilities as a part of the inference response.

When invoking a mannequin by Amazon Bedrock InvokeModel API, you possibly can entry token log possibilities by setting "return_logprobs": true within the JSON request physique. With this flag enabled, the mannequin’s response will embody further fields offering log possibilities for each the immediate tokens and the generated tokens, in order that clients can analyze the mannequin’s confidence in its predictions. These log possibilities allow you to quantitatively assess how assured your customized fashions are when processing inputs and producing responses. The granular metrics enable for higher analysis of response high quality, troubleshooting of sudden outputs, and optimization of prompts or mannequin configurations.

Let’s stroll by an instance of invoking a customized mannequin on Amazon Bedrock with log possibilities enabled and look at the output format. Suppose you’ve already imported a customized mannequin (as an illustration, a fine-tuned Llama 3.2 1B mannequin) into Amazon Bedrock and have its mannequin Amazon Useful resource Title (ARN). You possibly can invoke this mannequin utilizing the Amazon Bedrock Runtime SDK (Boto3 for Python on this instance) as proven within the following instance:

import boto3, json

bedrock_runtime = boto3.consumer('bedrock-runtime')  
model_arn = "arn:aws:bedrock:<<aws-region>>:<<account-id>>:imported-model/your-model-id"

# Outline the request payload with log probabilties enabled
request_payload = {
    "immediate": "The short brown fox jumps",
    "max_gen_len": 50,
    "temperature": 0.5,
    "cease": [".", "n"],
    "return_logprobs": True   # Request log possibilities
}

response = bedrock_runtime.invoke_model(
    modelId=model_arn,
    physique=json.dumps(request_payload),
    contentType="software/json",
    settle for="software/json"
)

# Parse the JSON response
outcome = json.hundreds(response["body"].learn())
print(json.dumps(outcome, indent=2))

Within the previous code, we ship a immediate—"The short brown fox jumps"—to our customized imported mannequin. We configure commonplace inference parameters: a most technology size of fifty tokens, a reasonable temperature of 0.5 for reasonable randomness, and a cease situation (both a interval or a newline). The "return_logprobs":True parameter tells Amazon Bedrock to return log possibilities within the response.

The InvokeModel API returns a JSON response containing three primary parts: the usual generated textual content output, metadata concerning the technology course of, and now log possibilities for each immediate and generated tokens. These values reveal the mannequin’s inside confidence for every token prediction, so you possibly can perceive not simply what textual content was produced, however how sure the mannequin was at every step of the method. The next is an instance response from the "fast brown fox jumps" immediate, exhibiting log possibilities (showing as unfavourable numbers):

{
  'prompt_logprobs': [
    None,
    {'791': -3.6223082542419434, '14924': -1.184808373451233},
    {'4062': -9.256651878356934, '220': -3.6941518783569336},
    {'14198': -4.840845108032227, '323': -1.7158453464508057},
    {'39935': -0.049946799874305725},
    {'35308': -0.2087990790605545}
  ],
  'technology': ' over the lazy canine',
  'prompt_token_count': 6,
  'generation_token_count': 5,
  'stop_reason': 'cease',
  'logprobs': [
    {'927': -0.04093993827700615},
    {'279': -0.0728893131017685},
    {'16053': -0.02005653828382492},
    {'5679': -0.03769925609230995},
    {'627': -1.194122076034546}
  ]
}

The uncooked API response offers token IDs paired with their log possibilities. To make this information interpretable, we have to first decode the token IDs utilizing the suitable tokenizer (on this case, the Llama 3.2 1B tokenizer), which maps every ID again to its precise textual content token. Then we convert log possibilities to possibilities by making use of the exponential perform, translating these values into extra intuitive possibilities between 0 and 1. We have now carried out these transformations utilizing customized code (not proven right here) to provide a human-readable format the place every token seems alongside its chance, making the mannequin’s confidence in its predictions instantly clear.

{'prompt_logprobs': [None,
  {'791': "'The' (p=0.0267)", '14924': "'Question' (p=0.3058)"},
  {'4062': "' quick' (p=0.0001)", '220': "' ' (p=0.0249)"},
  {'14198': "' brown' (p=0.0079)", '323': "' and' (p=0.1798)"},
  {'39935': "' fox' (p=0.9513)"},
  {'35308': "' jumps' (p=0.8116)"}],
 'technology': ' over the lazy canine',
 'prompt_token_count': 6,
 'generation_token_count': 5,
 'stop_reason': 'cease',
 'logprobs': [{'927': "' over' (p=0.9599)"},
  {'279': "' the' (p=0.9297)"},
  {'16053': "' lazy' (p=0.9801)"},
  {'5679': "' dog' (p=0.9630)"},
  {'627': "'.n' (p=0.3030)"}]}

Let’s break down what this tells us concerning the mannequin’s inside processing:

technology: That is the precise textual content generated by the mannequin (in our instance, it’s a continuation of the immediate that we despatched to the mannequin). This is similar discipline you’ll get usually from any mannequin invocation.
prompt_token_count and generation_token_count: These point out the variety of tokens within the enter immediate and within the output, respectively. In our instance, the immediate was tokenized into six tokens, and the mannequin generated 5 tokens in its completion.
stop_reason: The rationale the technology stopped ("cease" means the mannequin naturally stopped at a cease sequence or end-of-text, "size" means it hit the max token restrict, and so forth). In our case it exhibits "cease", indicating the mannequin stopped by itself or due to the cease situation we offered.
prompt_logprobs: This array offers log possibilities for every token within the immediate. Because the mannequin processes your enter, it constantly predicts what ought to come subsequent primarily based on what it has seen to date. These values measure which tokens in your immediate have been anticipated or stunning to the mannequin.
- The primary entry is None as a result of the very first token has no previous context. The mannequin can not predict something with out prior data. Every subsequent entry comprises token IDs mapped to their log possibilities. We have now transformed these IDs to readable textual content and reworked the log possibilities into percentages for simpler understanding.
- You possibly can observe the mannequin’s rising confidence because it processes acquainted sequences. For instance, after seeing The short brown, the mannequin predicted fox with 95.1% confidence. After seeing the total context as much as fox, it predicted jumps with 81.1% confidence.
- Many positions present a number of tokens with their possibilities, revealing options the mannequin thought of. As an illustration, on the second place, the mannequin evaluated each The (2.7%) and Query (30.6%), which suggests the mannequin thought of each tokens viable at that place. This added visibility helps you perceive the place the mannequin weighted options and may reveal when it was extra unsure or had issue selecting from a number of choices.
- Notably low possibilities seem for some tokens—fast acquired simply 0.01%—indicating the mannequin discovered these phrases sudden of their context.
- The general sample tells a transparent story: particular person phrases initially acquired low possibilities, however as the whole fast brown fox jumps phrase emerged, the mannequin’s confidence elevated dramatically, exhibiting it acknowledged this as a well-known expression.
- When a number of tokens in your immediate constantly obtain low possibilities, your phrasing could be uncommon for the mannequin. This uncertainty can have an effect on the standard of completions. Utilizing these insights, you possibly can reformulate prompts to higher align with patterns the mannequin encountered in its coaching information.
logprobs: This array comprises log possibilities for every token within the mannequin’s generated output. The format is comparable: a dictionary mapping token IDs to their corresponding log possibilities.
- After decoding these values, we are able to see that the tokens over, the, lazy, and canine all have excessive possibilities. This demonstrates the mannequin acknowledged it was finishing the well-known phrase the short brown fox jumps over the lazy canine—a standard pangram that the mannequin seems to have robust familiarity with.
- In distinction, the ultimate interval (newline) token has a a lot decrease chance (30.3%), revealing the mannequin’s uncertainty about the best way to conclude the sentence. This is smart as a result of the mannequin had a number of legitimate choices: ending the sentence with a interval, persevering with with further content material, or selecting one other punctuation mark altogether.

Sensible use instances of log possibilities

Token-level log possibilities from the Customized Mannequin Import function present beneficial insights into your mannequin’s decision-making course of. These metrics rework the way you work together along with your customized fashions by revealing their confidence ranges for every generated token. Listed here are impactful methods to make use of these insights:

Rating a number of completions

You should utilize log possibilities to quantitatively rank a number of generated outputs for a similar immediate. When your software wants to decide on between completely different attainable completions—whether or not for summarization, translation, or artistic writing—you possibly can calculate every completion’s total chance by averaging or including the log possibilities throughout all its tokens.

Instance:

Immediate: Translate the phrase "Battre le fer pendant qu'il est chaud"

Completion A: "Strike whereas the iron is sizzling" (Common log chance: -0.39)
Completion B: "Beat the iron whereas it's sizzling." (Common log chance: -0.46)

On this instance, Completion A receives the next log chance rating (nearer to zero), indicating the mannequin discovered this idiomatic translation extra pure than the extra literal Completion B. This numerical method allows your software to routinely choose essentially the most possible output or current a number of candidates ranked by the mannequin’s confidence stage.

This rating functionality extends past translation to many eventualities the place a number of legitimate outputs exist—together with content material technology, code completion, and artistic writing—offering an goal high quality metric primarily based on the mannequin’s confidence reasonably than relying solely on subjective human judgment.

Detecting hallucinations and low-confidence solutions

Fashions may produce hallucinations—plausible-sounding however factually incorrect statements—when dealing with ambiguous prompts, complicated queries, or subjects exterior their experience. Log possibilities present a sensible option to detect these situations by revealing the mannequin’s inside uncertainty, serving to you determine probably inaccurate data even when the output seems assured.

By analyzing token-level log possibilities, you possibly can determine which components of a response the mannequin was probably unsure about, even when the textual content seems assured on the floor. This functionality is particularly beneficial in retrieval-augmented technology (RAG) techniques, the place responses must be grounded in retrieved context. When a mannequin has related data obtainable, it sometimes generates solutions with greater confidence. Conversely, low confidence throughout a number of tokens suggests the mannequin could be producing content material with out enough supporting data.

Instance:

Immediate:

"Clarify how the Portfolio Synergy Quotient (PSQ) is utilized in multi-asset funding
 methods?"

Mannequin output:

"The PSQ is a measure of the diversification advantages of mixing completely different asset 
 lessons in a portfolio."

On this instance, we deliberately requested a few fictional metric—Portfolio Synergy Quotient (PSQ)—to show how log possibilities reveal uncertainty in mannequin responses. Regardless of producing a professional-sounding definition for this non-existent monetary idea, the token-level confidence scores inform a revealing story. The arrogance scores proven beneath are derived by making use of the exponential perform to the log possibilities returned by the mannequin.

PSQ exhibits medium confidence (63.8%), indicating that the mannequin acknowledged the acronym format however wasn’t extremely sure about this particular time period.
Frequent finance terminology like lessons (98.2%) and portfolio (92.8%) exhibit excessive confidence, probably as a result of these are commonplace ideas broadly utilized in monetary contexts.
Important connecting ideas present notably low confidence: measure (14.0%) and diversification (31.8%), reveal the mannequin’s uncertainty when trying to clarify what PSQ means or does.
Practical phrases like is (45.9%) and of (56.6%) hover within the medium confidence ranges, suggesting uncertainty concerning the total construction of the reason.

By figuring out these low-confidence segments, you possibly can implement focused safeguards in your functions—similar to flagging content material for verification, retrieving further context, producing clarifying questions, or making use of confidence thresholds for delicate data. This method helps create extra dependable AI techniques that may distinguish between high-confidence information and unsure responses.

Monitoring immediate high quality

When engineering prompts in your software, log possibilities reveal how nicely the mannequin understands your directions. If the primary few generated tokens present unusually low possibilities, it usually alerts that the mannequin struggled to interpret what you might be asking.

By monitoring the common log chance of the preliminary tokens—sometimes the primary 5–10 generated tokens—you possibly can quantitatively measure immediate readability. Nicely-structured prompts with clear context sometimes produce greater possibilities as a result of the mannequin instantly is aware of what to do. Imprecise or underspecified prompts usually yield decrease preliminary token likelihoods because the mannequin hesitates or searches for route.

Instance:

Immediate comparability for customer support responses:

Fundamental immediate:

"Write a response to this buyer grievance: I ordered a laptop computer 2 weeks in the past and it 
 nonetheless hasn't arrived."

Common log chance of first 5 tokens: -1.215 (decrease confidence)

Optimized immediate:

"You're a senior customer support supervisor with experience in battle decision and 
 buyer retention. You're employed for a good electronics retailer that values 
 buyer satisfaction above all else. Your activity is to answer the next 
 buyer grievance with professionalism and empathy. 
 Buyer Criticism: I ordered a laptop computer 2 weeks in the past and it nonetheless hasn't arrived."

Common log chance of first 5 tokens: -0.333 (greater confidence)

The optimized immediate generates greater log possibilities, demonstrating that exact directions and clear context scale back the mannequin’s uncertainty. Slightly than making absolute judgments about immediate high quality, this method enables you to measure relative enchancment between variations. You possibly can straight observe how particular parts—position definitions, contextual particulars, and express expectations—improve mannequin confidence. By systematically measuring these confidence scores throughout completely different immediate iterations, you construct a quantitative framework for immediate engineering that reveals precisely when and the way your directions change into unclear to the mannequin, enabling steady data-driven refinement.

Lowering RAG prices with early pruning

In conventional RAG implementations, techniques retrieve 5–20 paperwork and generate full responses utilizing these retrieved contexts. This method drives up inference prices as a result of each retrieved context consumes tokens no matter precise usefulness.

Log possibilities allow a more cost effective different by early pruning. As an alternative of instantly processing the retrieved paperwork in full:

Generate draft responses primarily based on every retrieved context
Calculate the common log chance throughout these brief drafts
Rank contexts by their common log chance scores
Discard low-scoring contexts that fall beneath a confidence threshold
Generate the whole response utilizing solely the highest-confidence contexts

This method works as a result of contexts that comprise related data produce greater log possibilities within the draft technology section. When the mannequin encounters useful context, it generates textual content with better confidence, mirrored in log possibilities nearer to zero. Conversely, irrelevant or tangential contexts produce extra unsure outputs with decrease log possibilities.

By filtering contexts earlier than full technology, you possibly can scale back token consumption whereas sustaining and even bettering reply high quality. This shifts the method from a brute-force method to a focused pipeline that directs full technology solely towards contexts the place the mannequin demonstrates real confidence within the supply materials.

Tremendous-tuning analysis

When you’ve fine-tuned a mannequin in your particular area, log possibilities provide a quantitative option to assess the effectiveness of your coaching. By analyzing confidence patterns in responses, you possibly can decide in case your mannequin has developed correct calibration—exhibiting excessive confidence for proper domain-specific solutions and applicable uncertainty elsewhere.

A well-calibrated fine-tuned mannequin ought to assign greater possibilities to correct data inside its specialised space whereas sustaining decrease confidence when working exterior its coaching area. Issues with calibration seem in two primary types. Overconfidence happens when the mannequin assigns excessive possibilities to incorrect responses, suggesting it hasn’t correctly realized the boundaries of its information. Beneath confidence manifests as constantly low possibilities regardless of producing correct solutions, indicating that coaching won’t have sufficiently strengthened right patterns.

By systematically testing your mannequin throughout numerous eventualities and analyzing the log possibilities, you possibly can determine areas needing further coaching or detect potential biases in your present method. This creates a data-driven suggestions loop for iterative enhancements, ensuring your mannequin performs reliably inside its supposed scope whereas sustaining applicable boundaries round its experience.

Getting began

Right here’s the best way to begin utilizing log possibilities with fashions imported by the Amazon Bedrock Customized Mannequin Import function:

Allow log possibilities in your API calls: Add "return_logprobs": true to your request payload when invoking your customized imported mannequin. This parameter works with each the InvokeModel and InvokeModelWithResponseStream APIs. Start with acquainted prompts to look at which tokens your mannequin predicts with excessive confidence in comparison with which it finds stunning.
Analyze confidence patterns in your customized fashions: Look at how your fine-tuned or domain-adapted fashions reply to completely different inputs. The log possibilities reveal whether or not your mannequin is appropriately calibrated in your particular area—exhibiting excessive confidence the place it must be sure.
Develop confidence-aware functions: Implement sensible use instances similar to hallucination detection, response rating, and content material verification to make your functions extra strong. For instance, you possibly can flag low-confidence sections of responses for human overview or choose the highest-confidence response from a number of generations.

Conclusion

Log chance help for Amazon Bedrock Customized Mannequin Import gives enhanced visibility into mannequin decision-making. This function transforms beforehand opaque mannequin habits into quantifiable confidence metrics that builders can analyze and use.

All through this submit, we’ve demonstrated the best way to allow log possibilities in your API calls, interpret the returned information, and use these insights for sensible functions. From detecting potential hallucinations and rating a number of completions to optimizing RAG techniques and evaluating fine-tuning high quality, log possibilities provide tangible advantages throughout numerous use instances.

For purchasers working with custom-made basis fashions like Llama, Mistral, or Qwen, these insights tackle a basic problem: understanding not simply what a mannequin generates, however how assured it’s in its output. This distinction turns into important when deploying AI in domains requiring excessive reliability—similar to finance, healthcare, or enterprise functions—the place incorrect outputs can have vital penalties.

By revealing confidence patterns throughout several types of queries, log possibilities allow you to assess how nicely your mannequin customizations have affected calibration, highlighting the place your mannequin excels and the place it would want refinement. Whether or not you might be evaluating fine-tuning effectiveness, debugging sudden responses, or constructing techniques that adapt to various confidence ranges, this functionality represents an essential development in bringing better transparency and management to generative AI growth on Amazon Bedrock.

We look ahead to seeing how you utilize log possibilities to construct extra clever and reliable functions along with your customized imported fashions. This functionality demonstrates the dedication from Amazon Bedrock to offer builders with instruments that allow assured innovation whereas delivering the scalability, safety, and ease of a completely managed service.

In regards to the authors

Manoj Selvakumar is a Generative AI Specialist Options Architect at AWS, the place he helps organizations design, prototype, and scale AI-powered options within the cloud. With experience in deep studying, scalable cloud-native techniques, and multi-agent orchestration, he focuses on turning rising improvements into production-ready architectures that drive measurable enterprise worth. He’s captivated with making complicated AI ideas sensible and enabling clients to innovate responsibly at scale—from early experimentation to enterprise deployment. Earlier than becoming a member of AWS, Manoj labored in consulting, delivering information science and AI options for enterprise purchasers, constructing end-to-end machine studying techniques supported by robust MLOps practices for coaching, deployment, and monitoring in manufacturing.

Yanyan Zhang is a Senior Generative AI Information Scientist at Amazon Net Companies, the place she has been engaged on cutting-edge AI/ML applied sciences as a Generative AI Specialist, serving to clients use generative AI to realize their desired outcomes. Yanyan graduated from Texas A&M College with a PhD in Electrical Engineering. Exterior of labor, she loves touring, understanding, and exploring new issues.

Lokeshwaran Ravi is a Senior Deep Studying Compiler Engineer at AWS, specializing in ML optimization, mannequin acceleration, and AI safety. He focuses on enhancing effectivity, decreasing prices, and constructing safe ecosystems to democratize AI applied sciences, making cutting-edge ML accessible and impactful throughout industries.

Revendra Kumar is a Senior Software program Growth Engineer at Amazon Net Companies. In his present position, he focuses on mannequin internet hosting and inference MLOps on Amazon Bedrock. Previous to this, he labored as an engineer on internet hosting Quantum computer systems on the cloud and creating infrastructure options for on-premises cloud environments. Exterior of his skilled pursuits, Revendra enjoys staying lively by taking part in tennis and mountain climbing.

Unlock mannequin insights with log chance help for Amazon Bedrock Customized Mannequin Import

Understanding log possibilities

Stipulations

Introducing log possibilities help in Amazon Bedrock

Sensible use instances of log possibilities

Rating a number of completions

Detecting hallucinations and low-confidence solutions

Monitoring immediate high quality

Lowering RAG prices with early pruning

Tremendous-tuning analysis

Getting began

Conclusion

In regards to the authors

Blockchain transforms the damaged switch system of soccer

11 Finest Protein Powders, In keeping with Years of Testing (2025)

Converter

Editors Pick

Newsletter

Categories

Related Posts