At Amazon.ae, we serve roughly 10 million prospects month-to-month throughout 5 international locations within the Center East and North Africa area—United Arab Emirates (UAE), Saudi Arabia, Egypt, Türkiye, and South Africa. Our AMET (Africa, Center East, and Türkiye) Funds workforce manages fee alternatives, transactions, experiences, and affordability options throughout these various international locations, publishing on common 5 new options month-to-month. Every characteristic requires complete check case technology, which historically consumed 1 week of handbook effort per undertaking. Our high quality assurance (QA) engineers spent this time analyzing enterprise requirement paperwork (BRDs), design paperwork, UI mocks, and historic check preparations—a course of that required one full-time engineer yearly merely for check case creation.
To enhance this handbook course of, we developed SAARAM (QA Lifecycle App), a multi-agent AI resolution that helps scale back check case technology from 1 week to hours. Utilizing Amazon Bedrock with Claude Sonnet by Anthropic and the Strands Agents SDK, we diminished the time wanted to generate check circumstances from 1 week to mere hours whereas additionally enhancing check protection high quality. Our resolution demonstrates how finding out human cognitive patterns, relatively than optimizing AI algorithms alone, can create production-ready methods that improve relatively than exchange human experience.
On this publish, we clarify how we overcame the constraints of single-agent AI methods via a human-centric method, carried out structured outputs to considerably scale back hallucinations and constructed a scalable resolution now positioned for enlargement throughout the AMET QA workforce and later throughout different QA groups in Worldwide Rising Shops and Funds (IESP) Org.
Answer overview
The AMET Funds QA workforce validates code deployments affecting fee performance for thousands and thousands of consumers throughout various regulatory environments and fee strategies. Our handbook check case technology course of added turnaround time (TAT) within the product cycle, consuming helpful engineering sources on repetitive check prep and documentation duties relatively than strategic testing initiatives. We would have liked an automatic resolution that would keep our high quality requirements whereas lowering the time funding.
Our goals included lowering check case creation time from 1 week to below a couple of hours, capturing institutional information from skilled testers, standardizing testing approaches throughout groups, and minimizing the hallucination points frequent in AI methods. The answer wanted to deal with complicated enterprise necessities spanning a number of fee strategies, regional rules, and buyer segments whereas producing particular, actionable check circumstances aligned with our current check administration methods.
The structure employs a classy multi-agent workflow. To realize this, we went via 3 completely different iterations and proceed to enhance and improve as new strategies are developed and new fashions are deployed.
The problem with conventional AI approaches
Our preliminary makes an attempt adopted typical AI approaches, feeding complete BRDs to a single AI agent for check case technology. This technique ceaselessly produced generic outputs like “confirm fee works appropriately” as an alternative of the particular, actionable check circumstances our QA workforce requires. For instance, we’d like check circumstances as particular as “confirm that when a UAE buyer selects money on supply (COD) for an order above 1,000 AED with a saved bank card, the system shows the COD payment of 11 AED and processes the fee via the COD gateway with order state transitioning to ‘pending supply.’”
The only-agent method introduced a number of essential limitations. Context size restrictions prevented processing massive paperwork successfully, however the lack of specialised processing phases meant the AI couldn’t perceive testing priorities or risk-based approaches. Moreover, hallucination points created irrelevant check situations that would mislead QA efforts. The basis trigger was clear: AI tried to compress complicated enterprise logic with out the iterative pondering course of that skilled testers make use of when analyzing necessities.
The next stream chart illustrates our points when making an attempt to make use of a single agent with a complete immediate.
The human-centric breakthrough
Our breakthrough got here from a basic shift in method. As an alternative of asking, “How ought to AI take into consideration testing?”, we requested, “How do skilled people take into consideration testing?” to concentrate on following a particular step-by-step course of as an alternative of counting on the big language mannequin (LLM) to understand this by itself. This philosophy change led us to conduct analysis interviews with senior QA professionals, finding out their cognitive workflows intimately.
We found that skilled testers don’t course of paperwork holistically—they work via specialised psychological phases. First, they analyze paperwork by extracting acceptance standards, figuring out buyer journeys, understanding UX necessities, mapping product necessities, analyzing person knowledge, and assessing workstream capabilities. Then they develop exams via a scientific course of: journey evaluation, state of affairs identification, knowledge stream mapping, check case growth, and at last, group and prioritization.
We then decomposed our authentic agent into sequential pondering actions that served as particular person steps. We constructed and examined every step utilizing Amazon Q Developer for CLI to verify primary concepts had been sound and included each main and secondary inputs.
This perception led us to design SAARAM with specialised brokers that mirror these professional testing approaches. Every agent focuses on a particular facet of the testing course of, comparable to how human specialists mentally compartmentalize completely different evaluation phases.
Multi-agent structure with Strands Brokers
Based mostly on our understanding of human QA workflows, we initially tried to construct our personal brokers from scratch. We needed to create our personal looping, serial, or parallel execution. We additionally created our personal orchestration and workflow graphs, which demanded appreciable handbook effort. To deal with these challenges, we migrated to Strands Brokers SDK. This supplied the multi-agent orchestration capabilities important for coordinating complicated, interdependent duties whereas sustaining clear execution paths, serving to enhance our efficiency and scale back our growth time.
Workflow iteration 1: Finish-to-end check technology
Our first iteration of SAARAM consisted of a single enter and created our first specialised brokers. It concerned processing a piece doc via 5 specialised brokers to generate complete check protection.
Agent 1 is known as the Buyer Phase Creator, and it focuses on buyer segmentation evaluation, utilizing 4 subagents:
- Buyer Phase Discovery identifies product person segments
- Determination Matrix Generator creates parameter-based matrices
- E2E State of affairs Creation develops end-to-end (E2E) situations per phase
- Take a look at Steps Technology detailed check case growth
Agent 2 is known as the Consumer Journey Mapper, and it employs 4 subagents to map product journeys comprehensively:
- The Move Diagram and Sequence Diagram are creators utilizing Mermaid syntax.
- The E2E Situations generator builds upon these diagrams.
- The Take a look at Steps Generator is used for detailed check documentation.
Agent 3 is known as Buyer Phase x Journey Protection, and it combines inputs from brokers 1 and a pair of to create detailed segment-specific analyses. It makes use of 4 subagents:
Agent 4 is known as the State Transition Agent. It analyzes varied product state factors in buyer journey flows. Its sub-agents create Mermaid state diagrams representing completely different journey states, segment-specific state state of affairs diagrams, and generate associated check situations and steps.
The workflow, proven within the following diagram, concludes with a primary extract, rework, and cargo (ETL) course of that consolidates and deduplicates the info from the brokers, saving the ultimate output as a textual content file.
This systematic method facilitates complete protection of buyer journeys, segments, and varied diagram varieties, enabling thorough check protection technology via iterative processing by brokers and subagents.
Addressing limitations and enhancing capabilities
In our journey to develop a extra strong and environment friendly device utilizing Strands Brokers, we recognized 5 essential limitations in our preliminary method:
- Context and hallucination challenges – Our first workflow confronted limitations from segregated agent operations the place particular person brokers independently collected knowledge and created visible representations. This isolation led to restricted contextual understanding, leading to diminished accuracy and elevated hallucinations within the outputs.
- Information technology inefficiencies – The restricted context obtainable to brokers induced one other essential difficulty: the technology of extreme irrelevant knowledge. With out correct contextual consciousness, brokers produced much less centered outputs, resulting in noise that obscured helpful insights.
- Restricted parsing capabilities – The preliminary system’s knowledge parsing scope proved too slender, restricted to solely buyer segments, journey mapping, and primary necessities. This restriction prevented brokers from accessing the complete spectrum of knowledge wanted for complete evaluation.
- Single-source enter constraint – The workflow might solely course of Phrase paperwork, creating a major bottleneck. Trendy growth environments require knowledge from a number of sources, and this limitation prevented holistic knowledge assortment.
- Inflexible structure issues – Importantly, the primary workflow employed a tightly coupled system with inflexible orchestration. This structure made it tough to change, lengthen, or reuse elements, limiting the system’s adaptability to altering necessities.
In our second iteration, we wanted to implement strategic options to deal with these points.
Workflow iteration 2: Complete evaluation workflow
Our second iteration represents a whole reimagining of the agentic workflow structure. Fairly than patching particular person issues, we rebuilt from the bottom up with modularity, context-awareness, and extensibility as core ideas:
Agent 1 is the clever gateway. The file sort resolution agent serves because the system’s entry level and router. Processing documentation information, Figma designs, and code repositories, it categorizes and directs knowledge to acceptable downstream brokers. This clever routing is crucial for sustaining each effectivity and accuracy all through the workflow.
Agent 2 is for specialised knowledge extraction. The Information Extractor agent employs six specialised subagents, every centered on particular extraction domains. This parallel processing method facilitates thorough protection whereas sustaining sensible velocity. Every subagent operates with domain-specific information, extracting nuanced data that generalized approaches would possibly overlook.
Agent 3 is the Visualizer agent, and it transforms extracted knowledge into six distinct Mermaid diagram varieties, every serving particular analytical functions. Entity relation diagrams map knowledge relationships and buildings, and stream diagrams visualize processes and workflows. Requirement diagrams make clear product specs, and UX requirement visualizations illustrate person expertise flows. Course of stream diagrams element system operations, and thoughts maps reveal characteristic relationships and hierarchies. These visualizations present a number of views on the identical data, serving to each human reviewers and downstream brokers perceive patterns and connections inside complicated datasets.
Agent 4 is the Information Condenser agent, and it performs essential synthesis via clever context distillation, ensuring every downstream agent receives precisely the data wanted for its specialised process. This agent, powered by its condensed data generator, merges outputs from each the Information Extractor and Visualizer brokers whereas performing refined evaluation.
The agent extracts essential components from the complete textual content context—acceptance standards, enterprise guidelines, buyer segments, and edge circumstances—creating structured summaries that protect important particulars whereas lowering token utilization. It compares every textual content file with its corresponding Mermaid diagram, capturing data that is likely to be missed in visible representations alone. This cautious processing maintains data integrity throughout agent handoffs, ensuring necessary knowledge just isn’t misplaced because it flows via the system. The result’s a set of condensed addendums that enrich the Mermaid diagrams with complete context. This synthesis makes certain that when data strikes to check technology, it arrives full, structured, and optimized for processing.
Agent 5 is the Take a look at Generator agent brings collectively the collected, visualized, and condensed data to provide complete check suites. Working with six Mermaid diagrams plus condensed data from Agent 4, this agent employs a pipeline of 5 subagents. The Journey Evaluation Mapper, State of affairs Identification Agent, and the Information Move Mapping subagents generate complete check circumstances primarily based on their take of the enter knowledge flowing from Agent 4.With the check circumstances generated throughout three essential views, the Take a look at Instances Generator evaluates them, reformatting in keeping with inside pointers for consistency. Lastly, the Take a look at Suite Organizer performs deduplication and optimization, delivering a remaining check suite that balances comprehensiveness with effectivity.
The system now handles excess of the essential necessities and journey mapping of Workflow 1—it processes product necessities, UX specs, acceptance standards, and workstream extraction whereas accepting inputs from Figma designs, code repositories, and a number of doc varieties. Most significantly, the shift to modular structure basically modified how the system operates and evolves. Not like our inflexible first workflow, this design permits for reusing outputs from earlier brokers, integrating new testing sort brokers, and intelligently choosing check case turbines primarily based on person necessities, positioning the system for steady adaptation.
The next determine exhibits our second iteration of SAARAM with 5 essential brokers and a number of subagents with context engineering and compression.
Further Strands Brokers options
Strands Brokers supplied the inspiration for our multi-agent system, providing a model-driven method that simplified complicated agent growth. As a result of the SDK can join fashions with instruments via superior reasoning capabilities, we constructed refined workflows with only some traces of code. Past its core performance, two key options proved important for our manufacturing deployment: lowering hallucinations with structured outputs and workflow orchestration.
Lowering hallucinations with structured outputs
The structured output characteristic of Strands Brokers makes use of Pydantic fashions to rework historically unpredictable LLM outputs into dependable, type-safe responses. This method addresses a basic problem in generative AI: though LLMs excel at producing humanlike textual content, they will battle with constantly formatted outputs wanted for manufacturing methods. By imposing schemas via Pydantic validation, we be sure that responses conform to predefined buildings, enabling seamless integration with current check administration methods.
The next pattern implementation demonstrates how structured outputs work in apply:
Pydantic mechanically validates LLM responses towards outlined schemas to facilitate sort correctness and required area presence. When responses don’t match the anticipated construction, validation errors present clear suggestions about what wants correction, serving to forestall malformed knowledge from propagating via the system. In our surroundings, this method delivered constant, predictable outputs throughout the brokers no matter immediate variations or mannequin updates, minimizing a whole class of information formatting errors. Consequently, our growth workforce labored extra effectively with full IDE help.
Workflow orchestration advantages
The Strands Brokers workflow structure supplied the delicate coordination capabilities our multi-agent system required. The framework enabled structured coordination with specific process definitions, automated parallel execution for impartial duties, and sequential processing for dependent operations. This meant we might construct complicated agent-to-agent communication patterns that will have been tough to implement manually.
The next pattern snippet exhibits the right way to create a workflow in Strands Brokers SDK:
The workflow system delivered three essential capabilities for our use case. First, parallel processing optimization allowed journey evaluation, state of affairs identification, and protection evaluation to run concurrently, with impartial brokers processing completely different features with out blocking one another. The system mechanically allotted sources primarily based on availability, maximizing throughput.
Second, clever dependency administration made certain that check growth waited for state of affairs identification to be accomplished, and group duties trusted the check circumstances being generated. Context was preserved and handed effectively between dependent phases, sustaining data integrity all through the workflow.
Lastly, the built-in reliability options supplied the resilience our system required. Computerized retry mechanisms dealt with transient failures gracefully, state persistence enabled pause and resume capabilities for long-running workflows, and complete audit logging supported each debugging and efficiency optimization efforts.
The next desk exhibits examples of enter into the workflow and the potential outputs.
| Enter: Enterprise requirement doc | Output: Take a look at circumstances generated |
Useful necessities:
|
TC006: Bank card fee success State of affairs: Buyer completes buy utilizing legitimate bank card Steps: 1. Add gadgets to cart and proceed to checkout. Anticipated end result: Checkout type displayed. 2. Enter delivery data. Anticipated end result: Transport particulars saved. 3. Choose bank card fee technique. Anticipated end result: Card type proven. 4. Enter legitimate card particulars. Anticipated end result: Card validated. 5. Submit fee. Anticipated end result: Fee processed, order confirmed.TC008: Fee failure dealing with State of affairs: Fee fails as a result of inadequate funds or card decline Steps: 1. Enter card with inadequate funds. Anticipated end result: Fee declined message. 2. System gives retry choice. Anticipated end result: Fee type redisplayed. 3. Strive various fee technique. Anticipated end result: Various fee profitable. TC009: Fee gateway timeout TC010: Refund processing |
Integration with Amazon Bedrock
Amazon Bedrock served as the inspiration for our AI capabilities, offering seamless entry to Claude Sonnet by Anthropic via the Strands Brokers built-in AWS service integration. We chosen Claude Sonnet by Anthropic for its distinctive reasoning capabilities and skill to know complicated fee area necessities. The Strands Brokers versatile LLM API integration made this implementation easy. The next snippet exhibits the right way to effortlessly create an agent in Strands Brokers:
The managed service structure of Amazon Bedrock diminished infrastructure complexity from our deployment. The service supplied automated scaling that adjusted to our workload calls for, facilitating constant efficiency throughout the brokers no matter visitors patterns. Constructed-in retry logic and error dealing with improved system reliability considerably, lowering the operational overhead sometimes related to managing AI infrastructure at scale. The mixture of the delicate orchestration capabilities of Strands Brokers and the strong infrastructure of Amazon Bedrock created a production-ready system that would deal with complicated check technology workflows whereas sustaining excessive reliability and efficiency requirements.
The next diagram exhibits the deployment of the SARAAM agent with Amazon Bedrock AgentCore and Amazon Bedrock.
Outcomes and enterprise impression
The implementation of SAARAM has improved our QA processes with measurable enhancements throughout a number of dimensions. Earlier than SAARAM, our QA engineers spent 3–5 days manually analyzing BRD paperwork and UI mocks to create complete check circumstances. This handbook course of is now diminished to hours, with the system reaching:
- Take a look at case technology time: Potential diminished from 1 week to hours
- Useful resource optimization: QA effort decreased from 1.0 full-time worker (FTE) to 0.2 FTE for validation
- Protection enchancment: 40% extra edge circumstances recognized in comparison with handbook course of
- Consistency: 100% adherence to check case requirements and codecs
The accelerated check case technology has pushed enhancements in our core enterprise metrics:
- Fee success price: Elevated via complete edge case testing and risk-based check prioritization
- Fee expertise: Enhanced buyer satisfaction as a result of groups can now iterate on check protection in the course of the design section
- Developer velocity: Product and growth groups generate preliminary check circumstances throughout design, enabling early high quality suggestions
SAARAM captures and preserves institutional information that was beforehand depending on particular person QA engineers:
- Testing patterns from skilled professionals are actually codified
- Historic check case learnings are mechanically utilized to new options
- Constant testing approaches throughout completely different fee strategies and industries
- Lowered onboarding time for brand spanking new QA workforce members
This iterative enchancment signifies that the system turns into extra helpful over time.
Classes discovered
Our journey growing SAARAM supplied essential insights for constructing production-ready AI methods. Our breakthrough got here from finding out how area specialists suppose relatively than optimizing how AI processes data. Understanding the cognitive patterns of testers and QA professionals led to an structure that naturally aligns with human reasoning. This method produced higher outcomes in comparison with purely technical optimizations. Organizations constructing related methods ought to make investments time observing and interviewing area specialists earlier than designing their AI structure—the insights gained straight translate to more practical agent design.
Breaking complicated duties into specialised brokers dramatically improved each accuracy and reliability. Our multi-agent structure, enabled by the orchestration capabilities of Strands Brokers, handles nuances that monolithic approaches constantly miss. Every agent’s centered duty permits deeper area experience whereas offering higher error isolation and debugging capabilities.
A key discovery was that the Strands Brokers workflow and graph-based orchestration patterns considerably outperformed conventional supervisor agent approaches. Though supervisor brokers make dynamic routing selections that may introduce variability, workflows present “brokers on rails”—a structured path facilitating constant, reproducible outcomes. Strands Brokers gives a number of patterns, together with supervisor-based routing, workflow orchestration for sequential processing with dependencies, and graph-based coordination for complicated situations. For check technology the place consistency is paramount, the workflow sample with its specific process dependencies and parallel execution capabilities delivered the optimum stability of flexibility and management. This structured method aligns completely with manufacturing environments the place reliability issues greater than theoretical flexibility.
Implementing Pydantic fashions via the Strands Brokers structured output characteristic successfully diminished type-related hallucinations in our system. By imposing AI responses to evolve to strict schemas, we facilitate dependable, programmatically usable outputs. This method has confirmed important when consistency and reliability are nonnegotiable. The kind-safe responses and automated validation have develop into foundational to our system’s reliability.
Our condensed data generator sample demonstrates how clever context administration maintains high quality all through multistage processing. This method of realizing what to protect, condense, and go between brokers helps forestall the context degradation that sometimes happens in token-limited environments. The sample is broadly relevant to multistage AI methods going through related constraints.
What’s subsequent
The modular structure we’ve constructed with Strands Brokers permits easy adaptation to different domains inside Amazon. The identical patterns that generate fee check circumstances will be utilized to retail methods testing, customer support state of affairs technology for help workflows, and cellular utility UI and UX check case technology. Every adaptation requires solely domain-specific prompts and schemas whereas reusing the core orchestration logic. All through the event of SAARAM, the workforce efficiently addressed many challenges in check case technology—from lowering hallucinations via structured outputs to implementing refined multi-agent workflows. Nonetheless, one essential hole stays: the system hasn’t but been supplied with examples of what high-quality check circumstances really appear like in apply.
To bridge this hole, integrating Amazon Bedrock Information Bases with a curated repository of historic check circumstances would supply SAARAM with concrete, real-world examples in the course of the technology course of. By utilizing the combination capabilities of Strands Brokers with Amazon Bedrock Information Bases, the system might search via previous profitable check circumstances to seek out related situations earlier than producing new ones. When processing a BRD for a brand new fee characteristic, SAARAM would first question the information base for comparable check circumstances—whether or not for related fee strategies, buyer segments, or transaction flows—and use these as contextual examples to information its output.
Future deployment will use Amazon Bedrock AgentCore for complete agent lifecycle administration. Amazon Bedrock AgentCore Runtime supplies the manufacturing execution setting with ephemeral, session-specific state administration that maintains conversational context throughout lively periods whereas facilitating isolation between completely different person interactions. The observability capabilities of Bedrock AgentCore assist ship detailed visualizations of every step in SAARAM’s multi-agent workflow, which the workforce can use to hint execution paths via the 5 brokers, audit intermediate outputs from the Information Condenser and Take a look at Generator brokers, and establish efficiency bottlenecks via real-time dashboards powered by Amazon CloudWatch with standardized OpenTelemetry-compatible telemetry.
The service permits a number of superior capabilities important for manufacturing deployment: centralized agent administration and versioning via the Amazon Bedrock AgentCore management aircraft, A/B testing of various workflow methods and immediate variations throughout the 5 subagents throughout the Take a look at Generator, efficiency monitoring with metrics monitoring token utilization and latency throughout the parallel execution phases, automated agent updates with out disrupting lively check technology workflows, and session persistence for sustaining context when QA engineers iteratively refine check suite outputs. This integration positions SAARAM for enterprise-scale deployment whereas offering the operational visibility and reliability controls that rework it from a proof of idea right into a manufacturing system able to dealing with the AMET workforce’s bold purpose of increasing past Funds QA to serve the broader group.
Conclusion
SAARAM demonstrates how AI can change conventional QA processes when designed with human experience at its core. By lowering check case creation from 1 week to hours whereas enhancing high quality and protection, we’ve enabled quicker characteristic deployment and enhanced fee experiences for thousands and thousands of consumers throughout the MENA area. The important thing to our success wasn’t merely superior AI know-how—it was the mixture of human experience, considerate structure design, and strong engineering practices. Via cautious examine of how skilled QA professionals suppose, implementation of multi-agent methods that mirror these cognitive patterns, and minimization of AI limitations via structured outputs and context engineering, we’ve created a system that enhances relatively than replaces human experience.
For groups contemplating related initiatives, our expertise emphasizes three essential success elements: make investments time understanding the cognitive processes of area specialists, implement structured outputs to reduce hallucinations, and design multi-agent architectures that mirror human problem-solving approaches. These QA instruments aren’t meant to interchange human testers, they amplify their experience via clever automation. Should you’re interested by beginning your journey on brokers with AWS, take a look at our sample Strands Agents implementations repo or our latest launch, Amazon Bedrock AgentCore, and the end-to-end examples with deployment on our Amazon Bedrock AgentCore samples repo.
In regards to the authors
Jayashree is a High quality Assurance Engineer at Amazon Music Tech, the place she combines rigorous handbook testing experience with an rising ardour for GenAI-powered automation. Her work focuses on sustaining excessive system high quality requirements whereas exploring modern approaches to make testing extra clever and environment friendly. Dedicated to lowering testing monotony and enhancing product high quality throughout Amazon’s ecosystem, Jayashree is on the forefront of integrating synthetic intelligence into high quality assurance practices.
Harsha Pradha G is a Snr. High quality Assurance Engineer half in MENA Funds at Amazon. With a powerful basis in constructing complete high quality methods, she brings a novel perspective to the intersection of QA and AI as an rising QA-AI integrator. Her work focuses on bridging the hole between conventional testing methodologies and cutting-edge AI improvements, whereas additionally serving as an AI content material strategist and AI Creator.
Fahim Surani is Senior Options Architect as AWS, serving to prospects throughout Monetary Companies, Power, and Telecommunications design and construct cloud and generative AI options. His focus since 2022 has been driving enterprise cloud adoption, spanning cloud migrations, value optimization, event-driven architectures, together with main implementations acknowledged as early adopters of Amazon’s newest AI capabilities. Fahim’s work covers a variety of use circumstances, with a main curiosity in generative AI, agentic architectures. He’s an everyday speaker at AWS summits and trade occasions throughout the area.




