Observing and evaluating AI agentic workflows with Strands Brokers SDK and Arize AX

This publish is co-written with Wealthy Younger from Arize AI.

Agentic AI functions constructed on agentic workflows differ from conventional workloads in a single essential means: they’re nondeterministic. That’s, they’ll produce completely different outcomes with the identical enter. It is because the big language fashions (LLMs) they’re based mostly on use possibilities when producing every token. This inherent unpredictability can lead AI software designers to ask questions associated to the correction plan of motion, the optimum path of an agent and the right set of instruments with the fitting parameters. Organizations that wish to deploy such agentic workloads want an observability system that may be sure that they’re producing outcomes which are appropriate and may be trusted.

On this publish, we current how the Arize AX service can hint and consider AI agent duties initiated by way of Strands Agents, serving to validate the correctness and trustworthiness of agentic workflows.

Challenges with generative AI functions

The trail from a promising AI demo to a dependable manufacturing system is fraught with challenges that many organizations underestimate. Primarily based on business analysis and real-world deployments, groups face a number of essential hurdles:

Unpredictable habits at scale – Brokers that carry out properly in testing may fail with sudden inputs in manufacturing, similar to new language variations or domain-specific jargon that trigger irrelevant or misunderstood responses.
Hidden failure modes – Brokers can produce believable however fallacious outputs or skip steps unnoticed, similar to miscalculating monetary metrics in a means that appears appropriate however misleads decision-making.
Nondeterministic paths – Brokers may select inefficient or incorrect choice paths, similar to taking 10 steps to route a question that ought to take solely 5, resulting in poor person experiences.
Device integration complexity – Brokers can break when calling APIs incorrectly, for instance, passing the fallacious order ID format so {that a} refund silently fails regardless of a profitable stock replace.
Price and efficiency variability – Loops or verbose outputs could cause runaway token prices and latency spikes, similar to an agent making greater than 20 LLM calls and delaying a response from 3 to 45 seconds.

These challenges imply that conventional testing and monitoring approaches are inadequate for AI programs. Success requires a extra considerate strategy that includes a extra complete technique.

Arize AX delivers a complete observability, analysis, and experimentation framework

Arize AX is the enterprise-grade AI engineering service that helps groups monitor, consider, and debug AI functions from improvement to manufacturing lifecycle. Incorporating Arize’s Phoenix basis, AX provides enterprise necessities such because the “Alyx” AI assistant, on-line evaluations, automated immediate optimization, role-based entry management (RBAC), and enterprise scale and assist. AX affords a complete answer to organizations that caters to each technical and nontechnical personas to allow them to handle and enhance AI brokers from improvement by way of manufacturing at scale. Arize AX capabilities embrace:

Tracing – Full visibility into LLM operations utilizing OpenTelemetry to seize mannequin calls, retrieval steps, and metadata similar to tokens and latency for detailed evaluation.
Analysis – Automated high quality monitoring with LLM-as-a-judge evaluations on manufacturing samples, supporting customized evaluators and clear success metrics.
Datasets – Preserve versioned, consultant datasets for edge instances, regression checks, and A/B testing, refreshed with actual manufacturing examples.
Experiments – Run managed checks to measure the impression of modifications to prompts or fashions, validating enhancements with statistical rigor.
Playground – Interactive setting to replay traces, check immediate variations, and examine mannequin responses for efficient debugging and optimization.
Immediate administration – Model, check, and deploy prompts like code, with efficiency monitoring and gradual rollouts to catch regressions early.
Monitoring and alerting – Actual-time dashboards and alerts for latency, errors, token utilization, and drift, with escalation for essential points.
Agent visualization – Analyze and optimize agent choice paths to cut back loops and inefficiencies, refining planning methods.

These parts type a complete observability technique that treats LLM functions as mission-critical manufacturing programs requiring steady monitoring, analysis, and enchancment.

Arize AX and Strands Brokers: A robust mixture

Strands Brokers is an open supply SDK, a strong low-code framework for constructing and working AI brokers with minimal overhead. Designed to simplify the event of subtle agent workflows, Strands unifies prompts, instruments, LLM interactions, and integration protocols right into a single streamlined expertise. It helps each Amazon Bedrock hosted and exterior fashions, with built-in capabilities for Retrieval Augmented Technology (RAG), Model Context Protocol (MCP), and Agent2Agent (A2A) communication. On this part, we stroll by way of constructing an agent with Strands Agent SDK, instrumenting it with Arize AX for trace-based analysis, and optimizing its habits.

The next workflow reveals how a Strands agent handles a person job end-to-end—invoking instruments, retrieving context, and producing a response—whereas sending traces to Arize AX for analysis and optimization.

The answer follows these high-level steps:

Set up and configure the dependencies
Instrument the agent for observability
Construct the agent with Strands SDK
Take a look at the agent and generate traces
Analyze traces in Arize AI
Consider the agent’s habits
Optimize the agent
Regularly monitor the agent

Conditions

You’ll want:

An AWS account with entry to Amazon Bedrock
An Arize account along with your Area ID and API Key (enroll at no extra price at arize.com).

Set up dependencies:pip set up strands opentelemetry-sdk arize-otel

Answer walkthrough: Utilizing Arize AX with Strands Brokers

The combination between Strands Agent SDK and Arize AI’s observability system offers deep, structured visibility into the habits and selections of AI brokers. This setup permits end-to-end tracing of agent workflows—from person enter by way of planning, software invocation, and last output.

Full implementation particulars can be found within the accompanying notebook and assets within the Openinference-Arize repository in GitHub.

Set up and configure the dependencies

To put in and configure the dependencies, use the next code:

from opentelemetry import hint
from opentelemetry.sdk.hint.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from strands_to_openinference_mapping import StrandsToOpenInferenceProcessor
from arize.otel import register
import grpc

Instrument the agent for observability

To instrument the agent for observability, use the next code.

The StrandsToOpenInferenceProcessor converts native spans to OpenInference format.
trace_attributes add session and person context for richer hint filtering.

Use Arize’s OpenTelemetry integration to allow tracing:

register(
    space_id="your-arize-space-id",
    api_key="your-arize-api-key",
    project_name="strands-project",
    processor=StrandsToOpenInferenceProcessor()
)
agent = Agent(
    mannequin=mannequin,
    system_prompt=system_prompt,
    instruments=[
        retrieve, current_time, get_booking_details,
        create_booking, delete_booking
    ],
    trace_attributes={
        "session.id": "abc-1234",
        "person.id": "user-email@instance.com",
        "arize.tags": [
            "Agent-SDK",
            "Arize-Project",
            "OpenInference-Integration"
        ]
    }
)

Construct the agent with Strands SDK

Create the Restaurant Assistant agent utilizing Strands. This agent will assist prospects with restaurant data and reservations utilizing a number of instruments:

retrieve – Searches the information base for restaurant data
current_time – Will get the present time for reservation scheduling
create_booking – Creates a brand new restaurant reservation
get_booking_details – Retrieves particulars of an current reservation
delete_booking – Cancels an current reservation

The agent makes use of Anthropic’s Claude 3.7 Sonnet mannequin in Amazon Bedrock for pure language understanding and technology. Import the required instruments and outline the agent:

import get_booking_details, delete_booking, create_booking
from strands_tools import retrieve, current_time
from strands import Agent, software
from strands.fashions.bedrock import BedrockModel
import boto3
system_prompt = """You're "Restaurant Helper", a restaurant assistant serving to prospects reserving tables in several eating places. You possibly can discuss concerning the menus, create new bookings, get the main points of an current reserving or delete an current reservation. You reply all the time politely and point out your title within the reply (Restaurant Helper)..........."""
mannequin = BedrockModel(
    model_id="us.anthropic.claude-3-7-sonnet-20250219-v1:0",
)
kb_name="restaurant-assistant"
smm_client = boto3.shopper('ssm')
kb_id = smm_client.get_parameter(
    Title=f'{kb_name}-kb-id',
    WithDecryption=False
)
os.environ["KNOWLEDGE_BASE_ID"] = kb_id["Parameter"]["Value"]
agent = Agent(
    mannequin=mannequin,
    system_prompt=system_prompt,
    instruments=[
        retrieve, current_time, get_booking_details,
        create_booking, delete_booking
    ],
    trace_attributes={
        "session.id": "abc-1234",
        "person.id": "user-email-example@area.com",
        "arize.tags": [
            "Agent-SDK",
            "Arize-Project",
            "OpenInference-Integration",
        ]
    }
)

Take a look at the agent and generate traces

Take a look at the agent with a few queries to generate traces for Arize. Every interplay will create spans in OpenTelemetry that shall be processed by the customized processor and despatched to Arize AI.The primary check case is a restaurant data question. Ask about eating places in San Francisco. This can set off the information base retrieval software:

# Take a look at with a query about eating places
outcomes = agent("Hello, the place can I eat in New York?")
print(outcomes)

The second check case is for a restaurant reservation. Take a look at the reserving performance by making a reservation. This can set off the create_booking software:

# Take a look at with a reservation request
outcomes = agent("Make a reservation for tonight at Rice & Spice. At 8pm, for two folks within the title of Anna")
print(outcomes)

Analyze traces in Arize AI

After working the agent, you’ll be able to view and analyze the traces within the Arize AI dashboard, proven within the following screenshot. Hint-level visualization reveals the illustration of the hint to verify the trail that the agent took throughout execution. Within the Arize dashboard, you’ll be able to overview the traces generated by the agent. By deciding on the strands-project you outlined within the pocket book, you’ll be able to view your traces on the LLM Tracing tab. Arize offers powerful filtering capabilities that can assist you deal with particular traces. You possibly can filter by OTel attributes and metadata, for instance, to investigate efficiency throughout completely different fashions.

You may also use Alyx AI assistant, to investigate your agent’s habits by way of pure language queries and uncover insights. Within the instance beneath, we use Alyx to purpose about why a software was invoked incorrectly by the agent in one of many traces, serving to us establish the basis reason for the misstep

Selecting a selected hint offers detailed details about the agent’s runtime efficiency and decision-making course of, as proven within the following screenshot.

The graph view, proven within the following screenshot, reveals the hierarchical construction of your agent’s execution and customers can examine particular execution paths to know how the agent made selections by deciding on the graph.

You may also view session-level insights on the Periods tab subsequent to LLM Tracing. By tagging spans with session.id and person.id, you’ll be able to group associated interactions, establish the place conversations break down, monitor person frustration, and consider multiturn efficiency throughout classes.

Consider the agent’s habits

Arize’s system traces the agent’s decision-making course of, capturing particulars similar to routing selections, software calls and parameters. You possibly can consider efficiency by analyzing these traces to confirm that the agent selects optimum paths and offers correct responses. For instance, if the agent misinterprets a buyer’s request and chooses the fallacious software or makes use of incorrect parameters, Arize evaluators will establish when these failures happen.Arize has pre-built analysis templates for each step of your Agent course of:

Create a brand new job below Evals and Duties and select LLM as a decide job kind. You need to use a pre-built immediate template (software calling is used within the instance proven within the following screenshot) or you’ll be able to ask Alyx AI assistant to construct one for you. Evals will now robotically run in your traces as they movement into Arize. This makes use of AI to robotically label your knowledge and establish failures at scale with out human intervention.

Now each time the agent is invoked, hint knowledge is collected in Arize and the software calling analysis robotically runs and labels the info with a appropriate or incorrect label together with an evidence by the LLM-as-a-judge for its labeling choice. Right here is an instance of an analysis label and clarification.

Optimize the agent

The LLM-as-a-judge evaluations robotically establish and label failure instances the place the agent didn’t name the fitting software. Within the beneath screenshot these failure instances are robotically captured and added to a regression dataset, which can drive agent enchancment workflows. This manufacturing knowledge can now gasoline improvement cycles for bettering the agent.

Now, you’ll be able to join instantly with Arize’s immediate playground, an built-in improvement setting (IDE) the place you’ll be able to experiment with numerous immediate modifications and mannequin selections, examine side-by-side outcomes and check throughout the regression dataset from the earlier step. When you have got an optimum immediate and mannequin mixture, it can save you this model to the immediate hub for future model monitoring and retrieval, as proven within the following screenshot.

Experiments from the immediate testing are robotically saved, with on-line evaluations run and outcomes saved for quick evaluation and comparability to facilitate data-driven selections on what enhancements to deploy. Moreover, experiments may be included into steady integration and steady supply (CI/CD) workflows for automated regression testing and validation at any time when new immediate or software modifications are pushed to programs similar to GitHub. The screenshot beneath reveals hallucination metrics for immediate experiments.

Regularly monitor the agent

To keep up reliability and efficiency in manufacturing, it’s important to repeatedly monitor your AI brokers. Arize AI offers out-of-the-box monitoring capabilities that assist groups detect points early, optimize price, and supply high-quality person experiences.Organising screens in Arize AI affords:

Early situation detection – Establish issues earlier than they impression customers
Efficiency monitoring – Monitor traits and keep constant agent habits
Price administration – Observe token utilization to keep away from pointless bills
High quality assurance – Validate your agent is delivering correct, useful responses

You possibly can entry and configure screens on the Displays tab in your Arize mission. For particulars, seek advice from the Arize documentation on monitoring.

When monitoring your Strands Agent in manufacturing, pay shut consideration to those key metrics:

Latency – Time taken for the agent to reply to person inputs
Token utilization – Variety of tokens consumed, which instantly impacts price
Error charge – Frequency of failed responses or software invocations
Device utilization – Effectiveness and frequency of software calls
Consumer satisfaction indicators – Proxy metrics similar to software name correctness, dialog size, or decision charges

By frequently monitoring these metrics, groups can proactively enhance agent efficiency, catch regressions early, and ensure the system scales reliably in real-world use. In Arize, you’ll be able to create customized metrics instantly from OTel hint attributes or metadata, and even from analysis labels and metrics, such because the software calling correctness analysis you created beforehand. The screenshot beneath visualizes the software name correctness ratio throughout agent traces, serving to establish patterns in appropriate versus incorrect software utilization

The screenshot beneath illustrate how Arize offers customizable dashboards that allow deep observability into LLM agent efficiency, showcasing a customized monitoring dashboard monitoring core metrics similar to latency, token utilization, and the proportion of appropriate software calls.

The screenshot beneath demonstrates prebuilt templates designed to speed up setup and supply quick visibility into key agent behaviors.

Clear up

If you’re carried out experimenting, you’ll be able to clear up the AWS assets created by this pocket book by working the cleanup script: !sh cleanup.sh.

Conclusion

The important thing lesson is evident: observability, automated evaluations, experimentation and suggestions loops, and proactive alerting aren’t non-compulsory for manufacturing AI—they’re the distinction between innovation and legal responsibility. Organizations that spend money on correct AI operations infrastructure can harness the transformative energy of AI brokers whereas avoiding the pitfalls which have plagued early adopters. The mixture of Amazon Strands Brokers and Arize AI offers a complete answer that addresses these challenges:

Strands Brokers affords a model-driven strategy for constructing and working AI brokers
Arize AI provides the essential observability layer with tracing, analysis, and monitoring capabilities

The partnership between AWS and Arize AI affords a strong answer for constructing and deploying generative AI brokers. The totally managed framework of Strands Brokers simplifies agent improvement, and Arize’s observability instruments present essential insights into agent efficiency. By addressing challenges similar to nondeterminism, verifying correctness, and enabling continuous monitoring, this integration advantages organizations in that they’ll create dependable and efficient AI functions. As companies more and more undertake agentic workflows, the mix of Amazon Bedrock and Arize AI units a brand new normal for reliable AI deployment.

Get began

Now that you simply’ve realized tips on how to combine Strands Brokers with the Arize Observability Service, you can begin exploring various kinds of brokers utilizing the instance supplied on this pattern. As a subsequent step, attempt increasing this integration to incorporate automated evaluations utilizing Arize’s analysis framework to attain agent efficiency and choice high quality.

Able to construct higher brokers? Get began with an account at arize.com for no additional cost and start remodeling your AI brokers from unpredictable experiments into dependable, production-ready options. The instruments and information are right here; the one query is: what’s going to you construct?

In regards to the Authors

Rich Young is the Director of Associate Options Structure at Arize AI, centered on AI agent observability and analysis tooling. Previous to becoming a member of Arize, Wealthy led technical pre-sales at WhyLabs AI. In his pre-AI life, Wealthy held management and IC roles at enterprise know-how corporations similar to Splunk and Akamai.

Karan Singh is a Agentic AI chief at AWS, the place he works with top-tier third-party basis mannequin and agentic frameworks suppliers to develop and execute joint go-to-market methods, enabling prospects to successfully deploy and scale options to unravel enterprise agentic AI challenges. Karan holds a BS in Electrical Engineering from Manipal College, a MS in Electrical Engineering from Northwestern College, and an MBA from the Haas College of Enterprise at College of California, Berkeley.

Nolan Chen is a Associate Options Architect at AWS, the place he helps startup corporations construct progressive options utilizing the cloud. Previous to AWS, Nolan specialised in knowledge safety and serving to prospects deploy high-performing broad space networks. Nolan holds a bachelor’s diploma in mechanical engineering from Princeton College.

Venu Kanamatareddy is an AI/ML Options Architect at AWS, supporting AI-driven startups in constructing and scaling progressive options. He offers strategic and technical steering throughout the AI lifecycle from mannequin improvement to MLOps and generative AI. With expertise throughout startups and huge enterprises, he brings deep experience in cloud structure and AI options. Venu holds a level in laptop science and a grasp’s in synthetic intelligence from Liverpool John Moores College.

Observing and evaluating AI agentic workflows with Strands Brokers SDK and Arize AX

Challenges with generative AI functions

Arize AX delivers a complete observability, analysis, and experimentation framework

Arize AX and Strands Brokers: A robust mixture

Conditions

Answer walkthrough: Utilizing Arize AX with Strands Brokers

Set up and configure the dependencies

Instrument the agent for observability

Construct the agent with Strands SDK

Take a look at the agent and generate traces

Analyze traces in Arize AI

Consider the agent’s habits

Optimize the agent

Regularly monitor the agent

Clear up

Conclusion

Get began

In regards to the Authors

When Arkham discovers a crypto hack, Rubien is silenced and suffers

The way in which we prepare AIS makes them extra more likely to spit bulls

Converter

Editors Pick

Newsletter

Categories

Related Posts