How the Amazon.com Catalog Crew constructed self-learning generative AI at scale with Amazon Bedrock

The Amazon.com Catalog is the inspiration of each buyer’s procuring expertise—the definitive supply of product data with attributes that energy search, suggestions, and discovery. When a vendor lists a brand new product, the catalog system should extract structured attributes—dimensions, supplies, compatibility, and technical specs—whereas producing content material reminiscent of titles that match how clients search. A title isn’t a easy enumeration like colour or dimension; it should steadiness vendor intent, buyer search habits, and discoverability. This complexity, multiplied by tens of millions of day by day submissions, makes catalog enrichment a super proving floor for self-learning AI.

On this put up, we show how the Amazon Catalog Crew constructed a self-learning system that constantly improves accuracy whereas decreasing prices at scale utilizing Amazon Bedrock.

The problem

In generative AI deployment environments, enhancing mannequin efficiency requires fixed consideration. As a result of fashions course of tens of millions of merchandise, they inevitably encounter edge circumstances, evolving terminology, and domain-specific patterns the place accuracy might degrade. The normal method—utilized scientists analyzing failures, updating prompts, testing adjustments, and redeploying—works however is resource-intensive and struggles to maintain tempo with real-world quantity and selection. The problem isn’t whether or not we are able to enhance these programs, however the way to make enchancment scalable and computerized reasonably than depending on handbook intervention. At Amazon Catalog, we confronted this problem head-on. The tradeoffs appeared unattainable: giant fashions would ship accuracy however wouldn’t scale effectively to our quantity, whereas smaller fashions struggled with the advanced, ambiguous circumstances the place sellers wanted essentially the most assist.

Answer overview

Our breakthrough got here from an unconventional experiment. As an alternative of selecting a single mannequin, we deployed a number of smaller fashions to course of the identical merchandise. When these fashions agreed on an attribute extraction, we may belief the consequence. However after they disagreed—whether or not from real ambiguity, lacking context, or one mannequin making an error—we found one thing profound. These disagreements weren’t all the time errors, however they had been virtually all the time indicators of complexity. This led us to design a self-learning system that reimagines how generative AI scales. A number of smaller fashions course of routine circumstances by consensus, invoking bigger fashions solely when disagreements happen. The bigger mannequin is carried out as a supervisor agent with entry to specialised instruments for deeper investigation and evaluation. However the supervisor doesn’t simply resolve disputes; it generates reusable learnings saved in a dynamic information base that helps stop total courses of future disagreements. We invoke extra highly effective fashions solely when the system detects excessive studying worth at inference time, whereas correcting the output. The result’s a self-learning system the place prices lower and high quality will increase—as a result of the system learns to deal with edge circumstances that beforehand triggered supervisor calls. Error charges fell constantly, not by retraining however by gathered learnings from resolved disagreements injected into smaller mannequin prompts. The next determine reveals the structure of this self-learning system.

Within the self-learning structure, product information flows by generator-evaluator employees, with disagreements routed to a supervisor for investigation. Submit-inference, the system additionally captures suggestions indicators from sellers (reminiscent of itemizing updates and appeals) and clients (reminiscent of returns and unfavourable opinions). Learnings from the sources are saved in a hierarchical information base and injected again into employee prompts, making a steady enchancment loop.

The next describes a simplified reference structure that demonstrates how this self-learning sample may be carried out utilizing AWS providers. Whereas our manufacturing system has extra complexity, this instance illustrates the core parts and information flows.

This method may be constructed with Amazon Bedrock, which offers the important infrastructure for multi-model architectures. The power of Amazon Bedrock to entry various basis fashions permits groups to deploy smaller, environment friendly fashions like Amazon Nova Lite as employees and extra succesful fashions like Anthropic Claude Sonnet as supervisors—optimizing each value and efficiency. For even larger value effectivity at scale, groups can even deploy open supply small fashions on Amazon Elastic Compute Cloud (Amazon EC2) GPU cases, offering full management over employee mannequin choice and batch throughput optimization. For productionizing a supervisor agent with its specialised instruments and dynamic information base, Bedrock AgentCore offers the runtime scalability, reminiscence administration, and observability wanted to deploy self-learning programs reliably at scale.

Our supervisor agent integrates with Amazon’s in depth Choice and Catalog Techniques. The above diagram is a simplified view displaying the important thing options of the agent and a few of the AWS providers that make it doable. Product information flows by generator-evaluator employees (Amazon EC2 and Amazon Bedrock Runtime), with agreements saved instantly and disagreements routed to a supervisor agent (Bedrock AgentCore). The educational aggregator and reminiscence supervisor make the most of Amazon DynamoDB for the information base, with learnings injected again into employee prompts. Human assessment (Amazon Easy Queue Service (Amazon SQS)) and observability (Amazon CloudWatch) full the structure. Manufacturing implementations will doubtless require extra parts for scale, reliability, and integration with current programs.

However how did we arrive at this structure? The important thing perception got here from an surprising place.

The perception: Turning disagreements into alternatives

Our perspective shifted throughout a debugging session. When a number of smaller fashions (reminiscent of Nova Lite) disagreed on product attributes—deciphering the identical specification in a different way based mostly on how they understood technical terminology—we initially noticed this as a failure. However the information advised a distinct story: merchandise the place our smaller fashions disagreed correlated with circumstances requiring extra handbook assessment and clarification. When fashions disagreed, these had been exactly the merchandise that wanted extra investigation. The disagreements had been surfacing studying alternatives, however we couldn’t have engineers and scientists deep-dive on each case. The supervisor agent does this routinely at scale. And crucially, the aim isn’t simply to find out which mannequin was proper—it’s to extract learnings that assist stop related disagreements sooner or later. That is the important thing to environment friendly scaling. Disagreements don’t simply come from AI employees at inference time. Submit-inference, sellers categorical disagreement by itemizing updates and appeals—indicators that our authentic extraction may need missed vital context. Prospects disagree by returns and unfavourable opinions, usually indicating that product data didn’t match expectations. These post-inference human indicators feed into the identical studying pipeline, with the supervisor investigating patterns and producing learnings that assist stop related points throughout future merchandise. We discovered a candy spot: attributes with average AI employee disagreement charges yielded the richest learnings—excessive sufficient to floor significant patterns, low sufficient to point solvable ambiguity. When disagreement charges are too low, they sometimes replicate noise or elementary mannequin limitations reasonably than learnable patterns—for these, we think about using extra succesful employees. When disagreement charges are too excessive, it indicators that employee fashions or prompts aren’t but mature sufficient, triggering extreme supervisor calls that undermine the effectivity beneficial properties of the structure. These thresholds will differ by job and area; the secret’s figuring out your personal candy spot the place disagreements symbolize real complexity value investigating, reasonably than elementary gaps in employee functionality or random noise.

Deep dive: The way it works

On the coronary heart of our system are a number of light-weight employee fashions working in parallel—some as turbines extracting attributes, others as evaluators assessing these extractions. These employees may be carried out in a non-agentic approach with fastened inputs, making them batch-friendly and scalable. The generator-evaluator sample creates productive stress, conceptually much like the productive stress in generative adversarial networks (GANs), although our method operates at inference time by prompting reasonably than coaching. We explicitly immediate evaluators to be vital, instructing them to scrutinize extractions for ambiguities, lacking context, or potential misinterpretations. This adversarial dynamic surfaces disagreements that symbolize real complexity reasonably than letting ambiguous circumstances move by undetected. When the generator and evaluator agree, now we have excessive confidence within the consequence and course of it at minimal computational value. This consensus path handles most product attributes. After they disagree, we’ve recognized a case value investigating—triggering the supervisor to resolve the dispute and extract reusable learnings.

Our structure treats disagreement as a common studying sign. At inference time, worker-to-worker disagreements catch ambiguity. Submit-inference, vendor suggestions catches misalignments with intent and buyer suggestions catches misalignments with expectations. The three channels feed the supervisor, which extracts learnings that enhance accuracy throughout the board. When employees disagree, we invoke a supervisor agent—a extra succesful mannequin that resolves the dispute and investigates why it occurred. The supervisor determines what context or reasoning the employees lacked, and these insights change into reusable learnings for future circumstances. For instance, when employees disagreed about utilization classification for a product based mostly on sure technical phrases, the supervisor investigated and clarified that these phrases alone had been inadequate—visible context and different indicators wanted to be thought of collectively. The supervisor generated a studying about the way to correctly weight completely different indicators for that product class. This studying instantly up to date our information base, and when injected into employee prompts for related merchandise, helped stop future disagreements throughout 1000’s of things. Whereas the employees may theoretically be the identical mannequin because the supervisor, utilizing smaller fashions is essential for effectivity at scale. The architectural benefit emerges from this asymmetry: light-weight employees deal with routine circumstances by consensus, whereas the extra succesful supervisor is invoked solely when disagreements floor high-value studying alternatives. Because the system accumulates learnings and disagreement charges drop, supervisor calls naturally decline—effectivity beneficial properties are baked instantly into the structure. This worker-supervisor heterogeneity additionally permits richer investigation. As a result of supervisors are invoked selectively, they will afford to tug in extra indicators—buyer opinions, return causes, vendor historical past—that will be impractical to retrieve for each product however present essential context when resolving advanced disagreements. When these indicators yield generalizable insights about how clients need product data introduced—which attributes to spotlight, what terminology resonates, the way to body specs—the ensuing learnings profit future inferences throughout related merchandise with out retrieving these resource-intensive indicators once more. Over time, this creates a suggestions loop: higher product data results in fewer returns and unfavourable opinions, which in flip displays improved buyer satisfaction.

The information base: Making learnings scalable

The supervisor investigates disagreements on the particular person product stage. With tens of millions of things to course of, we’d like a scalable option to remodel these product-specific insights into reusable learnings. Our aggregation technique adapts to context: high-volume patterns get synthesized into broader learnings, whereas distinctive or vital circumstances are preserved individually. We use a hierarchical construction the place a big language mannequin (LLM)-based reminiscence supervisor navigates the information tree to put every studying. Ranging from the basis, it traverses classes and subcategories, deciding at every stage whether or not to proceed down an current path, create a brand new department, merge with current information, or exchange outdated data. This dynamic group permits the information base to evolve with rising patterns whereas sustaining logical construction. Throughout inference, employees obtain related learnings of their prompts based mostly on product class, routinely incorporating area information from previous disagreements. The information base additionally introduces traceability—when an extraction appears incorrect, we are able to pinpoint precisely which studying influenced it. This shifts auditing from an unscalable job to a sensible one: as a substitute of reviewing a pattern of tens of millions of outputs—the place human effort grows proportionally with scale—groups can audit the information base itself, which stays comparatively fastened in dimension no matter inference quantity. Area specialists can instantly contribute by including or refining entries, no retraining required. A single well-crafted studying can instantly enhance accuracy throughout 1000’s of merchandise. The information base bridges human experience and AI functionality, the place automated learnings and human insights work collectively.

Classes discovered and finest practices

When this self-learning structure works finest:

Excessive-volume inference the place enter variety drives compounded studying
High quality-critical purposes the place consensus offers pure high quality assurance
Evolving domains with new patterns and terminology continuously rising

It’s much less appropriate for low-volume situations (inadequate disagreements for studying) or use circumstances with fastened, unchanging guidelines.

Crucial success elements:

Defining disagreements: With a generator-evaluator pair, disagreement happens when the evaluator flags the extraction as needing enchancment. With a number of employees, scale thresholds accordingly. The bottom line is sustaining productive stress between employees. If disagreement charges fall outdoors the productive vary (too low or too excessive), think about extra succesful employees or refined prompts.
Monitoring studying effectiveness: Disagreement charges should lower over time—that is your major well being metric. If charges keep flat, test information retrieval, immediate injection, or evaluator criticality.
Information group: Construction learnings hierarchically and maintain them actionable. Summary steerage doesn’t assist; particular, concrete learnings instantly enhance future inferences.

Widespread pitfalls

Specializing in value over intelligence: Price discount is a byproduct, not the aim
Rubber-stamp evaluators: Evaluators that merely approve generator outputs gained’t floor significant disagreements—immediate them to actively problem and critique extractions
Poor studying extraction: Supervisors should determine generalizable patterns, not simply repair particular person circumstances
Information rot: With out group, learnings change into unsearchable and unusable

The important thing perception: deal with declining disagreement charges as your north star metric—they present the system is really studying.

Deployment methods: Two approaches

Study-then-deploy: Begin with primary prompts and let the system study aggressively in a pre-production setting. Area specialists then audit the information base—not particular person outputs—to ensure discovered patterns align with desired outcomes. When accredited, deploy with validated learnings. That is preferrred for brand new use circumstances the place you don’t but know what good appears to be like like—disagreements assist uncover the appropriate patterns, and information base auditing allows you to form them earlier than manufacturing.
Deploy-and-learn: Begin with refined prompts and good preliminary high quality, then constantly enhance by ongoing studying in manufacturing. This works finest for well-understood use circumstances the place you may outline high quality upfront however nonetheless need to seize domain-specific nuances over time.

Each approaches use the identical structure—the selection relies on whether or not you’re exploring new territory or optimizing acquainted floor.

Conclusion

What began as an experiment in catalog enrichment revealed a elementary reality: AI programs don’t must be frozen in time. By embracing disagreements as studying indicators reasonably than failures, we’ve constructed an structure that accumulates area information by precise utilization. We watched the system evolve from generic understanding to domain-specific experience. It discovered industry-specific terminology. It found contextual guidelines that modify throughout classes. It tailored to necessities no pre-trained mannequin would encounter—all with out retraining, by learnings saved in a information base and injected again into employee prompts. For groups operationalizing related architectures, Amazon Bedrock AgentCore provides purpose-built capabilities:

AgentCore Runtime handles fast consensus selections for routine circumstances whereas supporting prolonged reasoning when supervisors examine advanced disagreements
AgentCore Observability offers visibility into which learnings drive impression, serving to groups refine information propagation and keep reliability at scale

The implications lengthen past catalog administration. Excessive-volume AI purposes may gain advantage from this course of—and the flexibility of Amazon Bedrock to entry various fashions makes this structure simple to implement. The important thing perception is that this: we’ve shifted from asking “which mannequin ought to we use?” to “how can we construct programs that study our particular patterns? “Whether or not you learn-then-deploy for brand new use circumstances or deploy-and-learn for established ones, the implementation is easy: begin with employees suited to your job, select a supervisor, and let disagreements drive studying. With the appropriate structure, each inference can change into a possibility to seize area information. That’s not simply scaling—that’s constructing institutional information into your AI programs.

Acknowledgement

This work wouldn’t have been doable with out the contributions and assist from Ankur Datta (Senior Principal Utilized Scientist – chief of science in On a regular basis Necessities Shops), Zhu Cheng (Utilized Scientist), Xuan Tang (Software program Engineer), Mohammad Ghasemi (Utilized Scientist). We sincerely admire the contributions in designs, implementations, quite a few fruitful brain-storming periods, and all of the insightful concepts and strategies.

In regards to the authors

Tarik Arici is a Principal Scientist at Amazon Choice and Catalog Techniques (ASCS), the place he pioneers self-learning generative AI programs design for catalog high quality enhancement at scale. His work focuses on constructing AI programs that routinely accumulate area information by manufacturing utilization—studying from buyer opinions and returns, vendor suggestions, and mannequin disagreements to enhance high quality whereas decreasing prices. Tarik holds a PhD in Electrical and Pc Engineering from Georgia Institute of Know-how.

Sameer Thombare is a Senior Product Supervisor at Amazon with over a decade of expertise in Product Administration, Class/P&L Administration throughout various industries, together with heavy engineering, telecommunications, finance, and eCommerce. Sameer is obsessed with growing constantly enhancing closed-loop programs and leads strategic initiatives inside Amazon Choice and Catalog Techniques (ASCS) to construct a classy self-learning closed-loop system that synthesize indicators from clients, sellers, and provide chain operations to optimize outcomes. Sameer holds an MBA from the Indian Institute of Administration Bangalore and an engineering diploma from Mumbai College.

Amin Banitalebi acquired his PhD within the Digital Media on the College of British Columbia (UBC), Canada, in 2014. Since then, he has taken numerous utilized science roles spanning over areas in pc imaginative and prescient, pure language processing, suggestion programs, classical machine studying, and generative AI. Amin has co-authored over 90 publications and patents. He’s presently an Utilized Science Supervisor in Amazon On a regular basis Necessities.

Puneet Sahni is a Senior Principal Engineer at Amazon Choice and Catalog Techniques (ASCS), the place he has spent over 8 years enhancing the completeness, consistency, and correctness of catalog information. He focuses on catalog information modeling and its software to enhancing Promoting Accomplice and buyer experiences, whereas utilizing ML/DL and LLM-based enrichment to drive enhancements in catalog information high quality.

Erdinc Basci joined Amazon in 2015 and brings over 23 years of expertise {industry} expertise. At Amazon, he has led the evolution of Catalog system architectures—together with ingestion pipelines, prioritized processing, and site visitors shaping—in addition to catalog information structure enhancements reminiscent of segmented provides, product specs for manufacture-on-demand merchandise, and catalog information experimentation. Erdinc has championed a hands-on efficiency engineering tradition throughout Amazon providers unlocking $1B+ annualized value financial savings and 20%+ latency wins throughout core Shops providers. He’s presently targeted on enhancing generative AI software efficiency and GPU effectivity throughout Amazon. Erdinc holds a BS in Pc Science from Bilkent College, Turkey, and an MBA from Seattle College, US.

Mey Meenakshisundaram is a Director in Amazon Choice and Catalog Techniques, the place he leads modern GenAI options to ascertain Amazon’s worldwide catalog because the best-in-class supply for product data. His staff pioneers superior machine studying strategies, together with multi-agent programs and huge language fashions, to routinely enrich product attributes and enhance catalog high quality at scale. Excessive-quality product data within the catalog is vital for delighting clients to find the appropriate merchandise, empowering promoting companions to checklist their merchandise successfully, and enabling Amazon operations to scale back handbook effort.

How the Amazon.com Catalog Crew constructed self-learning generative AI at scale with Amazon Bedrock

The problem

Answer overview

The perception: Turning disagreements into alternatives

Deep dive: The way it works

The information base: Making learnings scalable

Classes discovered and finest practices

Conclusion

In regards to the authors

Stablecoins will rise in Africa as remittances outpace support, former UN official says

Individuals with shade blindness are 52% extra prone to die from bladder most cancers

Converter

Editors Pick

Newsletter

Categories

Related Posts

Leave a Comment Cancel Reply