Manufacturing-grade AI brokers for monetary compliance: Classes from Stripe

This submit is co-written by Christopher Phillippi and Chrissie Cui from Stripe.

Stripe processes $1.4 trillion in annual cost quantity throughout 50 international locations, requiring compliance groups to evaluate hundreds of transactions every day. This submit explores how Stripe constructed a production-grade AI agent system on AWS utilizing Amazon Bedrock that lowered evaluate dealing with time by 26 p.c whereas sustaining human oversight. The submit covers the technical structure, infrastructure selections, and classes discovered from deploying agentic AI that achieved over 96 p.c helpfulness rankings, with human consultants firmly in command of last selections.

On this submit, you learn the way Stripe constructed a production-grade AI agent system for monetary compliance. We cowl the technical structure of Stripe’s ReAct agent framework and the infrastructure selections behind a devoted agent service. We additionally focus on the function of human oversight in sustaining accountability, and key classes about job decomposition, orchestration patterns, and value optimization by immediate caching. By the tip, you’ll perceive how one can design agentic techniques that scale compliance operations with out compromising high quality or auditability.

Stripe’s scale and compliance problem

The foundational mission of Stripe is to develop the gross home product (GDP) of the web. That pursuit requires programmable monetary infrastructure designed to help easy transactions and operational administration for companies of all scales. As of early 2026, Stripe has grown past its origins as a developer-centric cost API to turn out to be a systemic pillar of the worldwide financial system. The corporate helps hundreds of thousands of corporations throughout 50 international locations, from early-stage startups to 62 p.c of the Fortune 500, and processes roughly $1.4 trillion in annual cost quantity. This scale represents roughly 1.3 p.c of the overall world GDP, positioning Stripe on the crucial nexus of technological innovation and powerful regulatory frameworks.

The compliance scaling drawback

As Stripe’s world footprint expanded throughout 50 international locations, the group confronted a crucial problem: how one can scale compliance operations with out proportional headcount will increase whereas sustaining regulatory high quality requirements. Daily, compliance groups conduct detailed opinions to determine and mitigate monetary crime dangers. Nevertheless, expert analysts had been spending as much as 80% of their time navigating fragmented techniques to collect documentation fairly than performing high-value threat assessments. Stripe’s resolution integrates AI brokers with automated orchestration, remodeling compliance from a resource-intensive course of right into a scalable engine. This method addresses the $206 billion world compliance burden by serving to organizations determine 95% of card-testing assaults in actual time and scale back pointless buyer friction by 20%. The method additionally maintains the auditability and precision required by regulators.

Why agentic AI for compliance?

The restrictions of conventional automation for advanced, judgment-based compliance work imply AI brokers are wanted to deal with assisted investigations with scale, constant high quality, and full auditability whereas protecting people in management.

Three pillars

Oversight and accountability – Human-centered validation with configurable approval workflows and multi-layered resolution checkpoints. People keep within the driver’s seat, supported by brokers.
Transparency – Full audit path with immutable documentation of each motion, resolution, and rationale.
Effectivity – Pre-investigation and dynamic evaluation permit deeper opinions at sooner tempo.

Technical structure

The technical implementation of Stripe’s agentic compliance system consists of three key parts: job decomposition and orchestration, the ReAct agent framework, and supporting infrastructure companies. Every element performs a crucial function in reaching scalable, auditable compliance automation.

Job decomposition and evaluate orchestration

Assigning a single agent to deal with this lengthy, sophisticated evaluate in a single go wouldn’t have labored. A single, unconstrained agent would have targeted an excessive amount of on the fallacious issues and never sufficient on what was truly wanted. As a substitute, Stripe made the answer tractable by breaking the sophisticated evaluate into composable, bite-sized sub-tasks. Every sub-task might doubtlessly rely on the outcomes of different sub-tasks as a directed acyclic graph (DAG). These rails assist confirm every agentic course of is barely run on vetted questions the place high quality has been measured by high quality testing. Additionally they assist verify the investigation covers the required bases, and supply the agent enough context and focus to ship high quality outcomes.

Regardless of rigorous high quality testing of the agent responses in every sub-task, Stripe’s implementation doesn’t rely outright on the response of an agent. As a substitute, the responses are supplied as supplementary data to the human reviewer, who should finally reply every sub-task of the evaluate. This solves for oversight and accountability whereas nonetheless capturing the effectivity advantages. The high-level evaluate stream is proven within the following diagram.

Reviewers work together with the evaluate tooling, which is conscious of the present query and which subsequent questions require that reply as context. The tooling capabilities because the orchestrator, piping human-reviewed solutions as context for additional questions.

ReAct agent framework implementation

To fetch analysis for every sub-question, Stripe constructed a compliance agent utilizing a type of the ReAct (reasoning and performing) agent framework. Past utilizing a big language mannequin (LLM), a sort of basis mannequin (FM) on Amazon Bedrock for reasoning, the agentic side dynamically gathers related indicators by device calls. Stripe selected this agent framework to resolve the issue of a near-infinite variety of indicators which will or will not be related for a given topic. Brokers decide which indicators are related and suggest follow-ups till they’re sufficiently assured to supply a last reply. The high-level agent logic is proven within the following diagram.

Diagram illustrating the ReAct agent framework cycle showing the iterative process of Thought, Action (tool calls), and Observation steps until reaching a final answer

To stroll by this stream, think about being requested the question: “what’s the reply to 10 divided by the quantity π?”

When you had been a ReAct agent, your first thought could be to contemplate whether or not you have already got the reply. You don’t, so you’ll suggest an motion of taking out a calculator and inputting 10/π. The calculator would then return an statement. Your subsequent thought could be to find out whether or not you will have a solution, and also you would offer that calculation as your last reply. You may think about one thing tougher, comparable to “produce an evaluation forecasting subsequent 12 months’s firm income”, taking many cycles of database querying (Instrument) and interpretation (Thought) iterations.

Within the ReAct cycle, every time a device is requested within the Thought block, the agent framework stops the LLM execution and as an alternative programmatically runs that device. It then forces that output as an statement again to the agent earlier than permitting it to proceed. This injection sample implements a closed-loop management mechanism that:

Grounds agent reasoning in precise information – By mandating that each device output should be processed as an statement, this prevents the agent from hallucinating or fabricating device outcomes.
Maintains context coherence – Forces the agent to explicitly acknowledge and cause about each bit of retrieved data earlier than continuing.
Prevents reasoning drift – The statement step acts as a checkpoint, serving to confirm the agent’s thought course of stays anchored to factual device outputs fairly than speculative reasoning.
Helps auditability – Creates an express hint of device invocation → statement → reasoning that may be logged for compliance evaluate.

That is analogous to a suggestions management system in engineering. The agent can’t proceed to the following motion with out first processing the suggestions (statement) from its earlier motion, stopping open-loop habits that might result in hallucinations or off-track reasoning.

A problem with this method is that when a job is so sophisticated that it wants many turns and observations, the immediate can get very lengthy within the later turns, notably with verbose observations. The sub-task decomposition limits the scope of every query to maintain the variety of turns smaller. Immediate caching additionally helps with the price of enter tokens, which is the first price driver right here. With immediate caching, you solely pay for the brand new observations and ideas which are appended to the earlier messages at every flip. Amazon Bedrock gives this functionality.

Full agentic evaluate structure and infrastructure

Stripe relied on a big quantity of infrastructure to help the precise agentic execution. The next diagram reveals the total structure.

Architecture diagram showing the full agentic review system including the review interface, orchestrator, agent service, LLM Proxy service, and connections to internal signals through agent tool

The total structure consists of the evaluate interface and orchestrator lined earlier and an agent service that hosts the agent logic and facilitates execution. The agent service is supported by Stripe’s LLM Proxy service and related to inner indicators by obtainable agent instruments.

Constructing a devoted agent service

Earlier than this mission, Stripe’s agent service didn’t exist, and this mission resulted in Stripe requesting it. Initially, Stripe tried to suit an agent into a conventional ML inference engine. This method was rejected shortly for the next causes:

Compute profiles – Conventional ML is compute certain, requiring costly {hardware} comparable to GPUs, quick multi-threaded CPUs, or giant reminiscence allocations. In distinction, agentic functions are largely community certain, ready on basis fashions to complete or device calls to run.
Latency – Referencing the ReAct stream described beforehand, an agent can take an indeterminate period of time to complete, relying on what number of rounds of device calls it wants. An extended agent question or a database device name might trigger a thread to take a seat idle for minutes, in comparison with an XGBoost mannequin that will end in milliseconds.
Completely different API – In distinction to conventional ML that tends to output primary sorts (floats, Booleans, and others), brokers want extra flexibility of their schema to annotate their outcomes. Some brokers want to take care of stateful dialog states.

In consequence, Stripe stood up its personal agent service, initially resembling a stateless, synchronous inference endpoint. At this time it additionally handles stateful, multi-turn conversational brokers. It has grown from just a few brokers at launch to effectively over 100 brokers in lower than a 12 months.

LLM proxy structure

Stripe’s ReAct agent doesn’t name Amazon Bedrock instantly. As a substitute, Stripe makes use of an LLM Proxy microservice as its commonplace technique for LLM entry. The next diagram reveals the LLM Proxy structure.

Diagram showing the LLM Proxy microservice architecture that provides a single API endpoint for accessing multiple foundation models with features like noisy neighbor protection, model fallbacks, and monitoring

Stripe makes use of an LLM Proxy service for the next causes:

Noisy neighbors – Stripe has many groups utilizing LLMs for numerous functions. The LLM Proxy gives safeguards from different groups hogging the LLM bandwidth for a specific mannequin, stopping useful resource competition.
One API, many fashions – The one endpoint simplifies specifying capabilities comparable to immediate caching or device calling throughout basis fashions from Amazon and main AI corporations. Altering fashions requires solely altering the mannequin kind as an argument, as an alternative of every use case managing many various purchasers.
Mannequin fallbacks – This gives the flexibility to routinely specify default fashions within the case of useful resource constraints or outright failure.
Monitoring – By requiring authentication, the service can monitor mannequin utilization to assist forecast future useful resource demand and ensure the suitable fashions are getting used relying on the privateness of the appliance.

How architectural parts work collectively

Human reviewers drive the evaluate, utilizing agentic responses as pre-fetched analysis. As they reply, these responses can be utilized within the prompts for deeper questions throughout the identical evaluate, orchestrating evaluate questions as a directed acyclic graph (DAG).

For a given query, the agent can name instruments to dynamically entry inner information or companies as wanted. This method is used as a result of the potential related indicators that might be examined are sometimes a lot bigger than what might be included in a immediate. The tool-calling side of the agent means the thought log consists of solely the related information to reply the present query, with out extra irrelevant data, inducing focus.

The agent itself is pushed by basis fashions from Amazon and main AI corporations, that are answerable for considering and figuring out which device calls are wanted. The agent software accesses the LLM by the LLM Shopper, which abstracts away options comparable to immediate caching and mannequin fallbacks.

Amazon Bedrock integration advantages

Stripe makes use of Amazon Bedrock inside its LLM Proxy. Amazon Bedrock gives the next additional advantages:

Standardized privateness and safety – As a cost processor, Stripe should be further cautious round privateness and safety. Amazon Bedrock helps confirm that basis fashions from Amazon and main AI corporations match inside current safety and privateness constraints, with out extra evaluate overhead for every mannequin.
Function wealthy – As described earlier, Amazon Bedrock permits for immediate caching on supported fashions. Moreover, Amazon Bedrock permits for fine-tuning and serving customized fashions, which Stripe expects to concentrate on within the coming 12 months.
One API, many fashions – Integration is simple as a result of fashions fall inside the identical API. Altering fashions requires utilizing a unique mannequin title. Amazon Bedrock additionally helps many various basis fashions from Amazon and main AI corporations, offering industry-standard efficiency for Stripe.

Audit path implementation for regulatory compliance

Although Stripe finally makes use of human reviewers to make judgments and selections, the system nonetheless should confirm it stands as much as regulatory scrutiny. In consequence, Stripe applied logging so the whole agent log is retrievable for every run traditionally. Each agent motion, resolution, and rationale is documented.

Outcomes and affect: 26 p.c sooner opinions with over 96 p.c helpfulness

Stripe achieved a 26 p.c discount in median evaluate dealing with time by agentic automation, with over 96 p.c helpfulness rankings maintained from reviewers, and human reviewers in command of selections. This was achieved whereas offering full audit trails assembly examination requirements.

As Stripe continues to develop, the group will have the ability to sustain with proportional demand for threat administration. Human reviewers can focus their time on harder issues or new investigation alternatives, resulting in an improved compliance program.

Key classes discovered from manufacturing deployment

By means of the method of constructing and deploying this manufacturing agentic AI system, Stripe distilled a number of insights that formed the mission’s success and might inform comparable implementations.

Chunk-sized duties – Maintain agent duties sufficiently small for working reminiscence. Take a look at high quality incrementally fairly than diving straight into full automation.

Orchestration – Async workflow structure with DAG help is important for advanced agent interactions whereas sustaining auditability and human oversight at scale.

Infrastructure – Devoted microservice structure issues as a result of brokers have basically totally different useful resource profiles than conventional ML fashions. Conventional inference techniques are compute-bound and optimized for millisecond responses on costly GPU {hardware}. Brokers are network-bound, spending minutes ready on LLM calls and gear executions with unpredictable latency patterns. A devoted agent service handles these long-running, stateful interactions by async execution patterns. This enables threads to effectively handle a number of concurrent agent classes with out blocking on exterior calls. Token caching reduces prices by 60% by reusing frequent immediate prefixes throughout agent turns fairly than reprocessing the whole dialog historical past on every step. Value instrumentation tracks token utilization per agent invocation, serving to groups forecast spend as workloads scale and determine optimization alternatives earlier than they affect budgets. This infrastructure-first method reworked brokers from an experimental prototype right into a manufacturing service supporting greater than 100 brokers throughout Stripe.

Maintain people in management – Brokers help, however knowledgeable reviewers preserve last resolution authority. Constrain brokers with rails to certain context.

What’s subsequent

Initially, Stripe targeted on questions that may be answered earlier than the evaluate even begins. Remaining questions doubtless require upstream context recognized and validated through the evaluate. This may result in extra advanced, multi-step investigations that orchestrate real-time solutions as context through the evaluate, supporting deeper effectivity enhancements. The present 26 p.c discount represents early progress.

As a result of Stripe isn’t prepared to simply accept a rise in threat tolerance by utilizing this know-how, the workforce checks the agentic investigation element in opposition to human high quality requirements. The workforce validates with precise people earlier than permitting the element to tell reviewers in manufacturing. The workforce can be exploring methods to make use of LLMs to shortly choose and get rid of subpar approaches.

Amazon Bedrock gives customization capabilities that Stripe is exploring to additional improve its compliance system. At the moment, Stripe makes use of Retrieval Augmented Era (RAG) for dynamic data injection by device calls, which supplies its brokers entry to real-time compliance information. Wanting forward, Stripe is contemplating utilizing the fine-tuning capabilities of Amazon Bedrock to adapt mannequin habits particularly for monetary compliance duties. This might assist lock in mannequin high quality and scale back re-evaluation overhead as fashions evolve. Moreover, Amazon Bedrock gives continued pre-training choices for incorporating domain-specific data, which might assist construct extra specialised compliance experience into agent reasoning. The mannequin versioning and 6-month deprecation discover window in Amazon Bedrock helps plan these customization efforts strategically, permitting mannequin upgrades solely after they meaningfully enhance investigative capabilities. These complementary strategies work collectively to stability efficiency, stability, and adaptableness as compliance operations scale.

Conclusion

Stripe has demonstrated that brokers can pace up handbook evaluate processes, reaching a 26 p.c discount in evaluate dealing with time whereas sustaining over 96 p.c helpfulness rankings, even with people sustaining resolution authority fairly than full automation. As a substitute of counting on the facility of brokers alone, Stripe achieved this by constructing rails to constrain brokers to the bite-sized evaluate areas the place they are often profitable. To realize this, Stripe wanted new agentic serving infrastructure, impressed by however distinct from the machine studying inference techniques which have traditionally existed.

This grew to become doable with Amazon Bedrock, which supplied Stripe with the privateness protections and mannequin choice that supported this soar in evaluate effectivity, and these capabilities are anticipated to increase into many different domains.

To study extra about how one can construct comparable agentic techniques on Amazon Bedrock, see the Amazon Bedrock Consumer Information and the Amazon Bedrock immediate caching documentation. To get began, go to the Amazon Bedrock console.

Manufacturing-grade AI brokers for monetary compliance: Classes from Stripe

Stripe’s scale and compliance problem

The compliance scaling drawback

Why agentic AI for compliance?

Three pillars

Technical structure

Job decomposition and evaluate orchestration

ReAct agent framework implementation

Full agentic evaluate structure and infrastructure

Constructing a devoted agent service

LLM proxy structure

How architectural parts work collectively

Amazon Bedrock integration advantages

Audit path implementation for regulatory compliance

Outcomes and affect: 26 p.c sooner opinions with over 96 p.c helpfulness

Key classes discovered from manufacturing deployment

What’s subsequent

Conclusion

In regards to the authors

How CoinEx turned an vital gateway for Iran’s crypto economic system

AI skilled on 2,216 recipes creates a burger that beats the Huge Mac

Converter

Editors Pick

Newsletter

Categories