Optimizing a mannequin for video semantic search requires a stability between accuracy, value, and latency. Quicker, smaller fashions lack routing intelligence, and bigger, extra exact fashions add important latency overhead. In Half 1 of this collection, you discovered the right way to construct a multimodal video semantic search system with clever intent routing on AWS utilizing Amazon Bedrock’s Anthropic Claude Haiku mannequin. The Haiku mannequin gives excessive accuracy for person search intent, however will increase end-to-end search time to 2-4 seconds. This contributes to 75% of the general delay.
Determine 1: Instance end-to-end question delay breakdown
Now think about what occurs when the routing logic turns into extra complicated. Enterprise metadata may be way more complicated than the 5 attributes on this instance (title, caption, individual, style, and timestamp). Prospects can take into consideration digicam angles, temper and sentiment, license and rights home windows, and extra domain-specific classifications. Extra delicate logic means extra demanding prompts, and extra demanding prompts result in costlier and slower responses. That is the place mannequin customization comes into play. Reasonably than selecting between a mannequin that’s quick however too easy, or correct however costly or gradual, you possibly can obtain all three by coaching smaller fashions to carry out duties precisely with a lot decrease latency and value.
This publish reveals you the right way to use Amazon Bedrock’s mannequin customization approach, Mannequin Distillation, to switch routing intelligence from a big instructor mannequin (Amazon Nova Premier) to a a lot smaller scholar mannequin (Amazon Nova Micro). This strategy reduces inference value by greater than 95% and latency by 50% whereas sustaining the fragile routing high quality required by the duty.
Answer overview
We’ll stroll via an entire distillation pipeline end-to-end in a Jupyter pocket book. Broadly talking, the pocket book incorporates the next steps:
- Put together coaching knowledge — Create 10,000 artificial labeled samples utilizing Nova Premier and add the dataset to Amazon Easy Storage Service (Amazon S3) in Bedrock distilled format
- Run a distillation coaching job — Configure the job with instructor and scholar mannequin identifiers and submit it through Amazon Bedrock
- Deploy the extracted mannequin — Deploy customized fashions utilizing on-demand inference for versatile pay-as-you-go entry
- Consider the extracted mannequin — Examine routing high quality to the bottom Nova Micro and unique Claude Haiku baselines utilizing Amazon Bedrock Mannequin Analysis.
Full notebooks, coaching knowledge era scripts, and analysis utilities can be found at: GitHub repository.
Put together coaching knowledge
One of many major causes we selected mannequin distillation over different customization strategies reminiscent of supervised fine-tuning (SFT) is that it doesn’t require a totally labeled dataset. SFT requires all coaching samples to have human-generated responses as floor reality. For distillation, all you want is a immediate. Amazon Bedrock robotically calls the instructor mannequin to generate high-quality responses. We apply knowledge synthesis and augmentation strategies behind the scenes to generate a various coaching dataset of as much as 15,000 prompt-response pairs.
Nonetheless, if you’d like extra management over the coaching sign, you possibly can optionally present a labeled dataset. Every file within the JSONL file follows the bedrock-conversation-2024 schema. On this schema, the person position (enter immediate) is required and the assistant position (required response) is non-compulsory. See the next instance. For extra info, see Making ready the Coaching Dataset for Distillation.
On this publish, we ready 10,000 artificial labeled samples utilizing Nova Premier, the biggest and most succesful mannequin within the Nova household. Knowledge was generated with a balanced distribution throughout visible, audio, transcription, and metadata sign queries. Examples cowl the complete vary of anticipated search inputs, signify completely different ranges of problem, embrace edge instances and variations, and forestall overfitting to slender question patterns. The next graph reveals the distribution of weights throughout the 4 modality channels.
Determine 2: Weight distribution over 10,000 coaching examples
Should you want further examples or need to adapt the question distribution to your individual content material area, the supplied generate_training_data.py Scripts mean you can synthetically generate extra coaching knowledge utilizing Nova Premier.
Run a distillation coaching job
After you add your coaching knowledge to Amazon S3, the following step is to submit a distillation job. Mannequin distillation works by first producing a response utilizing a immediate. instructor mannequin. Then, utilizing these prompt-response pairs, scholar mannequin. The instructor for this undertaking is Amazon Nova Premier and the coed amazon nova microa quick and cost-effective mannequin optimized for high-throughput inference. Trainer route selections turn out to be coaching alerts that form scholar conduct.
Amazon Bedrock robotically manages your whole coaching orchestration and infrastructure. There is not any must provision clusters, tune hyperparameters, or arrange teacher-to-student mannequin pipelines. Specify the instructor mannequin, scholar mannequin, S3 path to the coaching knowledge, and an AWS Id and Entry Administration (IAM) position with the required permissions. Bedrock takes care of the remainder. Beneath is an instance code snippet that triggers a distillation coaching job.
Jobs run asynchronously. You may monitor your progress within the Amazon Bedrock console. Fundamental fashions > Customized fashionsor programmatically:
standing = bedrock_client.get_model_customization_job(
jobIdentifier=job_arn)['status']
print(f"Job standing: {standing}") # Coaching, Full, or Failed
Coaching time will depend on the scale of your dataset and the coed mannequin you select. For 10,000 labeled samples utilizing Nova Micro, the job is anticipated to be accomplished inside a couple of hours.
Deploy the extracted mannequin
As soon as the distillation job is full, your customized mannequin will likely be accessible in your Amazon Bedrock account and prepared for deployment. Amazon Bedrock provides two deployment choices for customized fashions. provisioned throughput For giant, predictable workloads, On-demand inference Get versatile pay-as-you-go entry with no upfront commitments.
For many groups simply beginning out, on-demand inference is the really useful path. There aren’t any endpoints to provision, no hourly commitments, and no minimal utilization necessities. The deployment code is as follows:
When the standing is displayed In service, yYou may invoke a distilled mannequin similar to some other base mannequin utilizing the usual InvokeModel or Converse API. You solely pay for the tokens you spend at Nova Micro’s inference price. $0.000035 per 1,000 enter tokens, $0.000140 per 1,000 output tokens.
Consider the extracted mannequin
Earlier than evaluating with the unique router, it’s price validating that distillation improves the fundamental mannequin’s potential to comply with the routing activity. The next desk reveals the identical prompts side-by-side that run on the bottom Nova Micro and the distilled Nova Micro.
| question | distilled nova micro | base nova micro |
"CEO discussing quarterly earnings " |
{"visible": 0.2, "audio": 0.3, "transcription": 0.4, "metadata": 0.1, "reasoning": "The question focuses on spoken content material (transcription) about earnings, however visible cues (CEO's look) and audio (tone/readability) are additionally necessary..."} |
Beneath is a JSON illustration of the knowledge requested in a video search question about CEOs speaking about quarterly earnings.
|
"sundown over mountains" |
{"visible": 0.8, "audio": 0.0, "transcription": 0.0, "metadata": 0.2, "reasoning": "The question focuses on a visible scene (sundown over mountains), with no audio or transcription parts. Metadata would possibly embrace location or time-related tags."} |
Beneath is a JSON illustration of the video search question “sundown over mountains” together with visible, audio, transcription, metadata weight (sum=1.0), and inference.
|
The base model struggles with both instructions and output format consistency. It produces free-text responses, incomplete JSON, and non-numeric weight values. The distilled model consistently returns well-formed JSON with four numeric weights that sum to 1.0, matching the schema required by the routing pipeline.
Comparing against the original Claude Haiku router, both models are evaluated against a held-out set of 100 labeled examples generated by Nova Premier. We use Amazon Bedrock Model Evaluation to run the comparison in a structured, managed workflow. To assess routing quality beyond standard metrics, we defined a custom OverallQuality rubric (see the following code block) that instructs Claude Sonnet to score each prediction on two dimensions: weight accuracy against ground truth and reasoning quality. Each dimension maps to a concrete 5-point threshold, so the rubric penalizes both numerical drift and generic boilerplate reasoning.
The extracted Nova Micro mannequin achieved a large-scale language mannequin (LLM) choose rating. 4.0 out of 5 stars It achieves related routing high quality to Claude 4.5 Haiku, however with about half the latency (833 ms vs. 1,741 ms). The fee advantages are equally necessary. Switching to the subtle Nova Micro mannequin reduces inference prices by: 95% or extra On-demand pricing has no upfront obligation and is out there for each enter and output tokens. Word: The LLM’s analysis as a choose is non-conclusive. Scores could fluctuate barely from run to run.
Determine 3: Mannequin efficiency comparability (Distilled Nova Micro vs. Claude 4.5 Haiku)
Beneath is a desk summarizing the outcomes aspect by aspect.
| metric | distilled nova micro | Claude 4.5 Haiku |
| LLM rating as a choose | 4.0/5 | 4.0/5 |
| common latency | 833ms | 1,741ms |
| enter token value | $0.000035 / 1K | $0.80–$1.00 / 1,000 |
| Output token value | $0.000140 / 1K | $4.00–$5.00 / 1,000 |
| Output format | Constant JSON | inconsistent |
cleansing
To keep away from recurring costs, notes Delete provisioned assets, reminiscent of deployed mannequin endpoints and knowledge saved in Amazon S3.
conclusion
This publish is the second a part of a two-part collection. Constructing on Half 1, this publish focuses on making use of mannequin distillation to optimize the intent routing layer constructed into video semantic search options. The strategies described right here assist tackle real-world operational tradeoffs, reminiscent of balancing routing intelligence with latency and value at scale whereas sustaining search accuracy. Through the use of Amazon Bedrock Mannequin Distillation to extract Amazon Nova Premier’s routing conduct into Amazon Nova Micro, we decreased inference prices by greater than 95% and reduce preprocessing latency in half whereas sustaining the nuanced routing high quality required for our duties. In case you are working multimodal video search at scale, mannequin distillation is a sensible option to obtain production-grade value effectivity with out sacrificing search accuracy. To discover the whole implementation, please go to: GitHub repository And take a look at the answer your self.
In regards to the writer

