Evaluating single-turn agent interactions follows a sample that almost all groups perceive effectively. You present an enter, accumulate the output, and choose the consequence. Frameworks like Strands Evaluation SDK make this course of systematic by way of evaluators that assess helpfulness, faithfulness, and tool usage. In a earlier weblog submit, we lined easy methods to construct complete analysis suites for AI brokers utilizing these capabilities. Nonetheless, manufacturing conversations not often cease at one flip.
Actual customers have interaction in exchanges that unfold over a number of turns. They ask follow-up questions when solutions are incomplete, change course when new data surfaces, and categorical frustration when their wants go unmet. A journey assistant that handles “E book me a flight to Paris” effectively in isolation would possibly wrestle when the identical consumer follows up with “Truly, can we have a look at trains as a substitute?” or “What about motels close to the Eiffel Tower?” Testing these dynamic patterns requires greater than static check circumstances with fastened inputs and anticipated outputs.
The core problem is scale as a result of you’ll be able to’t manually conduct a whole bunch of multi-turn conversations each time your agent adjustments, and writing scripted dialog flows locks you into predetermined paths that miss how actual customers behave. What analysis groups want is a strategy to generate life like, goal-driven customers programmatically and allow them to converse naturally with an agent throughout a number of turns. On this submit, we discover how ActorSimulator in Strands Evaluations SDK addresses this problem with structured consumer simulation that integrates into your analysis pipeline.
Why multi-turn analysis is essentially tougher
Single-turn analysis has an easy construction. The enter is thought forward of time, the output is self-contained, and the analysis context is proscribed to that single change. Multi-turn conversations break each one among these assumptions.
In a multi-turn interplay, every message is dependent upon all the pieces that got here earlier than it. The consumer’s second query is formed by how the agent answered the primary. A partial reply attracts a follow-up about no matter was disregarded, a misunderstanding leads the consumer to restate their unique request, and a shocking suggestion can ship the dialog in a brand new course.
These adaptive behaviors create dialog paths that may’t be predicted at test-design time. A static dataset of I/O pairs, irrespective of how massive, can’t seize this dynamic high quality as a result of the “appropriate” subsequent consumer message is dependent upon what the agent simply mentioned.
Handbook testing covers this hole in concept however fails in observe. Testers can conduct life like multi-turn conversations, however doing so for each situation, throughout each persona kind, after each agent change will not be sustainable. Because the agent’s capabilities develop, the variety of dialog paths grows combinatorially, effectively past what groups can discover manually.
Some groups flip to immediate engineering as a shortcut, asking a big language mannequin (LLM) to “act like a consumer” throughout testing. With out structured persona definitions and express aim monitoring, these approaches produce inconsistent outcomes. The simulated consumer’s habits drifts between runs, making it troublesome to match evaluations over time or determine real regressions versus random variation. A structured strategy to consumer simulation can bridge this hole by combining the realism of human dialog with the repeatability and scale of automated testing.
What makes simulated consumer
Simulation-based testing is effectively established in different engineering disciplines. Flight simulators check pilot responses to situations that might be harmful or unimaginable to breed in the actual world. Sport engines use AI-driven brokers to discover thousands and thousands of participant habits paths earlier than launch. The identical precept applies to conversational AI. You create a managed surroundings the place life like actors work together together with your system beneath circumstances you outline, then measure the outcomes.
For AI agent analysis, a helpful simulated consumer begins with a constant persona. One which behaves like a technical knowledgeable in a single flip and a confused novice within the subsequent produces unreliable analysis knowledge. Consistency means to keep up the identical communication model, experience stage, and persona traits by way of each change, simply as an actual individual would.
Equally essential is goal-driven habits. Actual customers come to an agent with one thing they need to accomplish. They persist till they obtain it, modify their strategy when one thing will not be working, and acknowledge when their aim has been met. With out express targets, a simulated consumer tends to both finish conversations too early or proceed asking questions indefinitely, neither of which displays actual utilization.
The simulated consumer should additionally reply adaptively to what the agent says, not comply with a predetermined script. When the agent asks a clarifying query, the actor ought to reply it in character. If the response is incomplete, the actor follows up on no matter was disregarded reasonably than transferring on. If the dialog drifts off subject, the actor steers it again towards the unique aim. These adaptive behaviors make simulated conversations beneficial as analysis knowledge as a result of they train the identical dialog dynamics your agent faces in manufacturing.
Constructing persona consistency, aim monitoring, and adaptive habits right into a simulation framework is what differentiates structured consumer simulation from ad-hoc prompting. ActorSimulator in Strands Evals is designed round precisely these ideas.
How ActorSimulator works

ActorSimulator implements these simulation qualities by way of a system that wraps a Strands Agent configured to behave as a practical consumer persona. The method begins with profile technology. Given a check case containing an enter question and an non-compulsory activity description, ActorSimulator makes use of an LLM to create a whole actor profile. A check case with enter “I need assistance reserving a flight to Paris” and activity description “Full flight reserving beneath price range” would possibly produce a budget-conscious traveler with beginner-level expertise and an informal communication model. Profile technology provides every simulated dialog a definite, constant character.
With the profile established, the simulator manages the dialog flip by flip. It maintains the complete dialog historical past and generates every response in context, conserving the simulated consumer’s habits aligned with their profile and targets all through. When your agent addresses solely a part of the request, the simulated consumer naturally follows up on the gaps. A clarifying query out of your agent will get a response that stays according to the persona. The dialog feels natural as a result of each response displays each the actor’s persona and all the pieces mentioned thus far.
Objective monitoring runs alongside the dialog. ActorSimulator features a built-in aim completion evaluation device that the simulated consumer can invoke to guage whether or not their unique goal has been met. When the aim is happy or the simulated consumer determines that the agent can not full their request, the simulator emits a cease sign and the dialog ends. If the utmost flip rely is reached earlier than the aim is met, the dialog additionally stops. This provides you a sign that the agent won’t be resolving consumer wants effectively. This mechanism makes positive conversations have a pure endpoint reasonably than operating indefinitely or slicing off arbitrarily.
Every response from the simulated consumer additionally contains structured reasoning alongside the message textual content. You’ll be able to examine why the simulated consumer selected to say what they mentioned, whether or not they had been following up on lacking data, expressing confusion, or redirecting the dialog. This transparency is effective throughout analysis improvement as a result of you’ll be able to see the reasoning behind every flip, making it extra easy to hint the place conversations succeed or go off observe.
Getting began with ActorSimulator
To get began, you will want to put in the Strands Analysis SDK utilizing: pip set up strands-agents-evals. For a step-by-step setup, you’ll be able to discuss with our documentation or our earlier weblog for extra particulars. Placing these ideas into observe requires minimal code. You outline a check case with an enter question and a activity description that captures the consumer’s aim. ActorSimulator handles profile technology, dialog administration, and aim monitoring mechanically.
The next instance evaluates a journey assistant agent by way of a multi-turn simulated dialog.
from strands import Agent
from strands_evals import ActorSimulator, Case, Experiment
# Outline your check case
case = Case(
enter="I need to plan a visit to Tokyo with resort and actions",
metadata={"task_description": "Full journey bundle organized"}
)
# Create the agent you need to consider
agent = Agent(
system_prompt="You're a useful journey assistant.",
callback_handler=None
)
# Create consumer simulator from check case
user_sim = ActorSimulator.from_case_for_user_simulator(
case=case,
max_turns=5
)
# Run the multi-turn dialog
user_message = case.enter
conversation_history = []
whereas user_sim.has_next():
# Agent responds to consumer
agent_response = agent(user_message)
agent_message = str(agent_response)
conversation_history.append({
"function": "assistant",
"content material": agent_message
})
# Simulator generates subsequent consumer message
user_result = user_sim.act(agent_message)
user_message = str(user_result.structured_output.message)
conversation_history.append({
"function": "consumer",
"content material": user_message
})
print(f"Dialog accomplished in {len(conversation_history) // 2} turns")
The dialog loop continues till has_next() returns False, which occurs when the simulated consumer’s targets are met or simulated consumer determines that the agent can not full the request or the utmost flip restrict is reached. The ensuing conversation_history accommodates the complete multi-turn transcript, prepared for analysis.
Integration with analysis pipelines

A standalone dialog loop is helpful for fast experiments, however manufacturing analysis requires capturing traces and feeding them into your evaluator pipeline. The subsequent instance combines ActorSimulator with OpenTelemetry telemetry collection and Strands Evals session mapping. The duty perform runs a simulated dialog and collects spans from every flip, then maps them right into a structured session for analysis.
from opentelemetry.sdk.hint.export import BatchSpanProcessor
from opentelemetry.sdk.hint.export.in_memory_span_exporter import InMemorySpanExporter
from strands import Agent
from strands_evals import ActorSimulator, Case, Experiment
from strands_evals.evaluators import HelpfulnessEvaluator
from strands_evals.telemetry import StrandsEvalsTelemetry
from strands_evals.mappers import StrandsInMemorySessionMapper
# Setup telemetry for capturing agent traces
telemetry = StrandsEvalsTelemetry()
memory_exporter = InMemorySpanExporter()
span_processor = BatchSpanProcessor(memory_exporter)
telemetry.tracer_provider.add_span_processor(span_processor)
def evaluation_task(case: Case) -> dict:
# Create simulator
user_sim = ActorSimulator.from_case_for_user_simulator(
case=case,
max_turns=3
)
# Create agent
agent = Agent(
system_prompt="You're a useful journey assistant.",
callback_handler=None
)
# Accumulate spans throughout dialog
all_target_spans = []
user_message = case.enter
whereas user_sim.has_next():
memory_exporter.clear()
agent_response = agent(user_message)
agent_message = str(agent_response)
# Seize telemetry
turn_spans = listing(memory_exporter.get_finished_spans())
all_target_spans.lengthen(turn_spans)
# Generate subsequent consumer message
user_result = user_sim.act(agent_message)
user_message = str(user_result.structured_output.message)
# Map to session for analysis
mapper = StrandsInMemorySessionMapper()
session = mapper.map_to_session(
all_target_spans,
session_id="test-session"
)
return {"output": agent_message, "trajectory": session}
# Create analysis dataset
test_cases = [
Case(
name="booking-simple",
input="I need to book a flight to Paris next week",
metadata={
"category": "booking",
"task_description": "Flight booking confirmed"
}
)
]
evaluator = HelpfulnessEvaluator()
dataset = Experiment(circumstances=test_cases, evaluator=evaluator)
# Run evaluations
report = Experiment.run_evaluations(evaluation_task)
report.run_display()
This strategy captures full traces of your agent’s habits throughout dialog turns. The spans embody device calls, mannequin invocations, and timing data for each flip within the simulated dialog. By mapping these spans right into a structured session, you make the complete multi-turn interplay obtainable to evaluators like GoalSuccessRateEvaluator and HelpfulnessEvaluator, which might then assess the dialog as a complete, reasonably than remoted turns.
Customized actor profiles for focused testing
Computerized profile technology covers most analysis situations effectively, however some testing targets require particular personas. You would possibly need to confirm that your agent handles an impatient knowledgeable consumer in another way from a affected person newbie, or that it responds appropriately to a consumer with domain-specific wants. For these circumstances, ActorSimulator accepts a completely outlined actor profile that you simply management.
from strands_evals.varieties.simulation import ActorProfile
from strands_evals import ActorSimulator
from strands_evals.simulation.prompt_templates.actor_system_prompt import (
DEFAULT_USER_SIMULATOR_PROMPT_TEMPLATE
)
# Outline a customized actor profile
actor_profile = ActorProfile(
traits={
"persona": "analytical and detail-oriented",
"communication_style": "direct and technical",
"expertise_level": "knowledgeable",
"patience_level": "low"
},
context="Skilled enterprise traveler with elite standing who values effectivity",
actor_goal="E book enterprise class flight with particular seat preferences and lounge entry"
)
# Initialize simulator with customized profile
user_sim = ActorSimulator(
actor_profile=actor_profile,
initial_query="I have to ebook a enterprise class flight to London subsequent Tuesday",
system_prompt_template=DEFAULT_USER_SIMULATOR_PROMPT_TEMPLATE,
max_turns=10
)
By defining traits like endurance stage, communication model, and experience, you’ll be able to systematically check how your agent performs throughout completely different consumer segments. An agent that scores effectively with affected person, non-technical customers however poorly with impatient consultants reveals a particular high quality hole you could tackle. Operating the identical aim throughout a number of persona configurations turns consumer simulation right into a device for understanding your agent’s strengths and weaknesses by consumer kind.
Greatest practices for simulation-based analysis
These finest practices aid you get probably the most out of simulation-based analysis:
- Set
max_turnsprimarily based on activity complexity, utilizing 3-5 for centered duties and 8-10 for multi-step workflows. If most conversations attain the restrict with out finishing the aim, improve it. - Write particular activity descriptions that the simulator can consider towards. “Assist the consumer ebook a flight” is just too obscure to evaluate completion reliably, whereas “flight reserving confirmed with dates, vacation spot, and value” provides a concrete goal.
- Use auto-generated profiles for broad protection throughout consumer varieties and customized profiles to breed particular patterns out of your manufacturing logs, equivalent to an impatient knowledgeable or a first-time consumer.
- Concentrate on patterns throughout your check suite reasonably than particular person transcripts. Constant redirects from the simulated consumer means that the agent is drifting off subject, and declining aim completion charges after an agent change factors to a regression.
- Begin with a small set of check circumstances overlaying your most typical situations and develop to edge circumstances and extra personas as your analysis observe matures.
Conclusion
We confirmed how ActorSimulator in Strands Evals permits systematic, multi-turn analysis of conversational AI brokers by way of life like consumer simulation. Somewhat than counting on static check circumstances that seize solely single exchanges, you’ll be able to outline targets and personas and let simulated customers work together together with your agent throughout pure, adaptive conversations. The ensuing transcripts feed straight into the identical analysis pipeline that you simply use for single-turn testing, providing you with helpfulness scores, aim success charges, and detailed traces throughout each dialog flip.
To get began, discover the working examples within the Strands Agents samples repository. For groups evaluating brokers deployed by way of Amazon Bedrock AgentCore, the next AgentCore evaluations sample exhibit easy methods to simulate interactions with deployed brokers. Begin with a handful of check circumstances representing your most typical consumer situations, run them by way of ActorSimulator, and consider the outcomes. As your analysis observe matures, develop to cowl extra personas, edge circumstances, and dialog patterns.
In regards to the authors

