Actual-time voice brokers with Stream Imaginative and prescient Brokers and Amazon Nova 2 Sonic

This publish was co-authored with Neevash Ramdial, Technical Advertising chief at Stream

Constructing production-grade voice brokers that really feel pure and responsive is a posh engineering problem. You should orchestrate speech-to-speech fashions, handle low-latency audio streaming, and deal with connection lifecycle. You additionally must ship constant experiences throughout internet, cell, and desktop purposes.

On this publish, you learn to mix Stream’s Imaginative and prescient Brokers open-source framework with Amazon Bedrock and Amazon Nova 2 Sonic to construct real-time voice brokers that may be production-ready in minutes. You’ll learn the way the mixing works underneath the hood, stroll by way of code examples, and discover superior capabilities like operate calling, automated reconnection, and multilingual voice assist.

The problem

Constructing voice-enabled AI purposes requires orchestrating a number of complicated techniques that should work collectively reliably. You face the problem of managing real-time audio streaming infrastructure whereas concurrently integrating speech recognition, language fashions, and text-to-speech providers. Every of those has its personal latency traits and failure modes. A typical voice interplay entails capturing audio from the person’s microphone, streaming it to a speech-to-text service, processing the transcript by way of a language mannequin, producing a response, changing that response again to speech, and delivering it to the person. All of this should occur inside a window of some hundred milliseconds to really feel pure. Delays on this pipeline can break the conversational stream and frustrate customers.Past the core AI pipeline, manufacturing voice purposes should deal with the messy realities of real-world deployment: unreliable community connections, browser compatibility points, session timeouts, and sleek degradation when providers turn into unavailable. You usually spend extra time constructing reconnection logic, managing WebRTC connections, and dealing with edge instances than on the precise AI capabilities. This infrastructure burden means groups both make investments months constructing customized options or accept restricted off-the-shelf merchandise that don’t meet their particular wants. Imaginative and prescient Brokers abstracts the infrastructure complexity whereas offering the pliability to customise the AI expertise.

Resolution overview

The answer brings collectively three key parts:

Amazon Nova 2 Sonic a speech-to-speech basis mannequin accessible by way of Amazon Bedrock that gives real-time bidirectional audio streaming, native flip detection, and performance calling capabilities. Nova 2 Sonic handles the complete speech-to-speech pipeline, accepting audio enter and producing audio output. This avoids the necessity for separate STT and TTS providers.
Stream’s Vision Agents an open-source Python framework for constructing real-time voice and video AI brokers. It gives a plugin-based structure with 25+ integrations, manufacturing deployment tooling, and consumer SDKs for React, iOS, Android, Flutter, and React Native. The system is designed with flexibility at its core. You should use Stream’s world edge community for environment friendly efficiency or combine your most well-liked real-time communication (RTC) supplier. Imaginative and prescient Brokers handles provider-specific specs by way of a clear decorator-based interface, enabling use instances like buyer assist brokers, workflow automation, and API-driven actions with minimal boilerplate code. With Imaginative and prescient Brokers, you’ll be able to construct AI purposes utilizing an open-source framework, third-party mannequin suppliers, and telephony providers.
Stream’s Edge Community a globally distributed edge community that usually delivers sub-500ms be a part of occasions and underneath 30ms audio latency, offering the real-time transport layer between purchasers and your agent backend.

Collectively, these parts create an entire stack: Stream handles the real-time media transport and client-side expertise, Amazon Nova 2 Sonic gives the AI intelligence, and Imaginative and prescient Brokers gives the glue code that ties them collectively.

Structure overview

The system is designed round a clear separation of issues: Stream’s infrastructure handles the real-time media transport and consumer connectivity, whereas Amazon Nova Sonic runs within the buyer’s personal AWS account and gives the AI intelligence. This separation helps maintain delicate information and enterprise logic stay inside the buyer’s management, whereas Stream’s globally distributed edge community delivers the low-latency media expertise customers anticipate.Stream’s edge community acts because the media dealer between end-user units and the Imaginative and prescient Agent employee processes. When a person speaks, audio is captured, encrypted, and transmitted as RTP over UDP to the closest Stream SFU (Selective Forwarding Unit). The SFU terminates the WebRTC connection, handles NAT traversal and bandwidth estimation, and forwards audio tracks to the Imaginative and prescient Agent employee as if it have been one other name participant. This implies the agent integrates naturally into the decision mannequin. The agent is one other peer, receiving and sending audio by way of the identical infrastructure utilized by human contributors.

Audio information flows bidirectionally by way of the system: incoming speech from the person is decoded to uncooked PCM by the Imaginative and prescient Agent employee, streamed to Amazon Nova Sonic through the Bedrock real-time API, and response audio frames from Nova Sonic are re-encoded, packetized as RTP, and delivered again by way of the SFU to the consumer system. Finish-to-end latency is usually underneath 500 milliseconds. Voice exercise detection (VAD) runs within the employee to detect speech boundaries and barge-in occasions, whereas echo cancellation within the browser helps stop the agent’s personal output from re-triggering the VAD loop.

Account boundaries

Buyer AWS account
- Enterprise logic and orchestration (agent insurance policies, instruments, information entry).
- Amazon Bedrock integration to entry Amazon Nova fashions.
Stream AWS account
- World WebRTC/SFU media airplane, TURN/STUN, and signaling.
- Imaginative and prescient Agent runtime (employee processes) that terminate WebRTC as robotic friends and bridge the client’s Amazon Bedrock integration.

Finish-to-end media stream

Person joins from internet or cell.
- The app embeds Stream’s audio consumer SDK, requests mic (and optionally digital camera), and joins a name kind configured for AI participation.
- Media is distributed as RTP over UDP for predictable low latency and head‑of‑line–free supply. 2. Regional SFU termination
Regional SFU termination
- A Stream SFU node within the closest area terminates the person’s WebRTC connection, dealing with bandwidth estimation, simulcast, and NAT traversal.
- The SFU forwards the related audio tracks to the Imaginative and prescient Agent employee as if it have been one other participant.
Imaginative and prescient Agent employee
- A devoted Imaginative and prescient Agent employee course of holds the PeerConnection state for that session.
- It decodes audio to uncooked PCM and the employee streams PCM frames to Amazon Bedrock service forwarding to Amazon Nova 2 Sonic as a real-time session within the buyer’s AWS account.
Amazon Nova 2 Sonic integration with Imaginative and prescient brokers by way of Amazon Bedrock
- Amazon Nova 2 Sonic detects speech boundaries and performs speech-to-speech modeling (understanding, reasoning, and TTS) with elective instrument calls into buyer techniques (RDS, APIs, data bases).
- It gracefully handles barge-in and maintains full conversational context in order that the dialog stays pure and coherent.
Streaming response again to the person
- As Amazon Nova Sonic produces response audio frames, the Imaginative and prescient Agent employee:
  1. Slices and wraps them in RTP with monotonically rising timestamps to keep away from gaps/drifts
  2. Sends RTP packets by way of the identical WebRTC session through the SFU. The browser’s WebRTC stack decodes and performs audio with sub-500 ms latency.
Barge-in, transcripts, and facet information
- Echo cancellation within the browser helps stop the agent’s personal output from retriggering VAD.
- When the person interrupts, new speech triggers an interrupt sign over an RTCDataChannel, inflicting the employee to cease forwarding Amazon Nova Sonic output and reset its native buffer.

This structure may appear complicated, however Imaginative and prescient Brokers abstracts a lot of this complexity. Let’s see what the precise code seems to be like:

Stipulations

Earlier than getting began, be sure to have the next:

AWS credentials configured through surroundings variables, IAM function, or AWS Command Line Interface (AWS CLI) profile. For manufacturing environments, use IAM roles hooked up to your compute sources as a substitute of long-term credentials. For native growth, use AWS CLI profiles (aws configure) or AWS SSO. Don’t commit .env recordsdata containing credentials to model management.
Stream account with an Audio API key and secret (you might be anticipated to obtain 333,000 participant minutes per thirty days at no further price).
Python 3.12 or later put in.
uv package supervisor put in (pip set up uv).
Imaginative and prescient Brokers put in (uv add vision-agents)

Getting began

Step 1: Create a brand new challenge listing and set up Imaginative and prescient Brokers with the AWS plugin

mkdir voice-agent
cd voice-agentuv inituv add "vision-agents[getstream,aws]"
python-dotenv

The vision-agents[aws] further installs the Amazon Bedrock plugin together with its dependencies, together with boto3, aws-sdk-bedrock-runtime, and Silero VAD for voice exercise detection.

Step 2: Configure surroundings variables

Create a “.env” file in your challenge root to handle your configuration. For AWS credentials, we suggest pointing to your AWS_PROFILE on this file so the appliance can entry your credentials when interacting with AWS sources. We don’t suggest storing your AWS entry keys instantly on this file.

For Stream API credentials, you should use a third-party library like HashiCorp Vault or AWS Secrets and techniques Supervisor, however safety issues usually are not within the scope of this publish.

# Stream API credentials
STREAM_API_KEY=check/geststream/api_key
STREAM_API_SECRET=check/getstream/api_secret
# AWS credentials
AWS_PROFILE=your_aws_profile_name
AWS_REGION=us-east-1

Imaginative and prescient Brokers mechanically discovers these surroundings variables at startup, so that you don’t must move them explicitly to every consumer.

Step 3: Construct your first voice agent

Create a important.py file with the next code:

import asyncio
from dotenv import load_dotenv

from vision_agents.core import Agent, Person, Runner
from vision_agents.core.brokers import AgentLauncher
from vision_agents.plugins import aws, getstream

load_dotenv()

async def create_agent(**kwargs) -> Agent:
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=Person(title="Useful Assistant", id="agent"),
        directions="You're a useful voice assistant. Be concise and pleasant.",
        llm=aws.Realtime(
            mannequin="amazon.nova-2-sonic-v1:0",
            region_name="us-east-1",
            voice_id="matthew",
        ),
    )
    return agent

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    name = await agent.create_call(call_type, call_id)

    async with agent.be a part of(name):
        await asyncio.sleep(2)
        await agent.llm.simple_response(
            textual content="Greet the person warmly and ask how one can assist."
        )
        await agent.end()  # Run till the decision ends

if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

Step 4: Run the voice agent

Run the agent:

uv run important.py run

In fewer than 30 strains of code, you will have a totally useful, real-time voice agent powered by Amazon Nova Sonic, accessible from a Stream consumer SDK.

Understanding the Amazon Bedrock integration

Let’s take a more in-depth have a look at how the aws.Realtime plugin works underneath the hood.

Bidirectional streaming with Amazon Nova 2 Sonic

Amazon Nova 2 Sonic makes use of an event-driven bidirectional streaming API. As an alternative of utilizing a request-response sample, this method permits almost steady audio to stream in each instructions concurrently. The Imaginative and prescient Brokers AWS plugin manages this complexity by way of a structured occasion sequence:

Session initialization – A sessionStart occasion is distributed with inference configuration (temperature, max tokens, top-p).
Immediate setup – A promptStart occasion configures the audio output format (24kHz PCM), voice choice, and power definitions.
System directions – System directions are despatched as a textual content content material block with the SYSTEM function.
Audio streaming – Microphone audio frames (~32ms every) are streamed as audioInput occasions.
Response streaming – Nova Sonic streams again audioOutput occasions with the generated speech.
Session teardown – promptEnd and sessionEnd occasions cleanly shut the connection.

Every content material block follows a three-part sample: contentStart → content material payload → contentEnd. This hierarchical construction permits the mannequin to keep up correct context all through the interplay.

Here’s what the session begin occasion seems to be like within the plugin:

def _create_session_start_event(self) -> Dict[str, Any]:
    return {
        "occasion": {
            "sessionStart": {
                "inferenceConfiguration": {
                    "maxTokens": 1024,
                    "topP": 0.9,
                    "temperature": 0.7,
                }
            },
            "turnDetectionConfiguration": {
                "endpointingSensitivity": "MEDIUM"
            },
        }
    }

Including operate calling

One of many key capabilities of Amazon Nova 2 Sonic is native operate calling throughout real-time conversations. This enables your voice agent to carry out actions like querying databases, calling APIs, and triggering workflows whereas sustaining a pure spoken dialog.Use the @llm.register_function decorator to outline features the mannequin can name:

import asyncio
from dotenv import load_dotenv
from typing import Dict, Any

from vision_agents.core import Agent, Person, Runner
from vision_agents.core.brokers import AgentLauncher
from vision_agents.plugins import aws, getstream

load_dotenv()

async def create_agent(**kwargs) -> Agent:
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=Person(title="Climate Assistant", id="agent"),
        directions="""You're a useful climate assistant. When customers ask
        about climate, use the get_weather operate to fetch present circumstances.
        You can too assist with easy calculations.""",
        llm=aws.Realtime(
            mannequin="amazon.nova-2-sonic-v1:0",
            region_name="us-east-1",
        ),
    )

    @agent.llm.register_function(
        title="get_weather",
        description="Get the present climate for a given metropolis"
    )
    async def get_weather(location: str) -> Dict[str, Any]:
        # In manufacturing, name an actual climate API
        return {
            "metropolis": location,
            "temperature": 72,
            "situation": "Sunny",
            "humidity": "45%"
        }

    @agent.llm.register_function(
        title="calculate",
        description="Carry out a mathematical calculation"
    )
    def calculate(operation: str, a: float, b: float) -> dict:
        operations = {
            "add": lambda x, y: x + y,
            "subtract": lambda x, y: x - y,
            "multiply": lambda x, y: x * y,
            "divide": lambda x, y: x / y if y != 0 else None,
        }
        outcome = operations.get(operation, lambda x, y: None)(a, b)
        return {"operation": operation, "a": a, "b": b, "outcome": outcome}

    return agent

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    await agent.create_user()
    name = await agent.create_call(call_type, call_id)

    async with agent.be a part of(name):
        await asyncio.sleep(2)
        await agent.llm.simple_response(
            textual content="Greet the person and allow them to know you'll be able to examine the climate."
        )
        await agent.end()

if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

How operate calling works with Amazon Nova 2 Sonic

When the mannequin decides to invoke a operate, the next sequence happens:

Nova 2 Sonic emits a toolUse occasion containing the operate title and arguments.
The Imaginative and prescient Brokers plugin intercepts this occasion, deserializes the arguments, and runs the registered Python operate.
The result’s despatched again to Nova through a toolResult occasion, wrapped in the usual contentStart → toolResult → contentEnd sample.
Nova 2 Sonic incorporates the operate outcome into its response and continues the spoken dialog naturally.

You possibly can construct complicated, multi-step workflows with this method. For instance, a voice agent might lookup a buyer document, examine stock, and place an order, all inside a single pure dialog.

Utilizing the usual LLM with Amazon Bedrock

Past real-time speech-to-speech, the AWS plugin additionally gives a normal LLM integration through aws.LLM. That is helpful for customized pipeline architectures the place you need to pair an Amazon Bedrock mannequin with separate STT and TTS suppliers:

from vision_agents.core import Agent, Person
from vision_agents.plugins import aws, getstream, cartesia, deepgram, smart_turn

agent = Agent(
    edge=getstream.Edge(),
    agent_user=Person(title="Customized Pipeline Agent"),
    directions="Be useful and concise.",
    llm=aws.LLM(
        mannequin="anthropic.claude-3-haiku-20240307-v1:0",
        region_name="us-east-1"
    ),
    tts=cartesia.TTS(),
    stt=deepgram.STT(),
    turn_detection=smart_turn.TurnDetection(
        buffer_duration=2.0,
        confidence_threshold=0.5
    ),
)

The usual LLM helps streaming responses through converse_stream(), full dialog historical past administration, imaginative and prescient inputs for fashions like Claude, and multi-round instrument calling with as much as 3 rounds of operate execution per request.

Textual content-to-speech with Amazon Polly

For customized pipeline architectures, the AWS plugin additionally consists of an Amazon Polly TTS integration. That is helpful while you’re utilizing a non-realtime LLM (like Claude on Amazon Bedrock or one other supplier) and wish high-quality voice synthesis:

from vision_agents.plugins import aws

tts = aws.TTS(
    region_name="us-east-1",
    voice_id="Joanna",
    engine="neural",        # 'normal' or 'neural'
    language_code="en-US"
)

Amazon Polly TTS helps each normal and neural engines, SSML enter for fine-grained speech management, and a number of languages and voices. The neural engine produces extra natural-sounding speech, making it a powerful selection while you’re constructing a customized STT → LLM → TTS pipeline on AWS infrastructure.

Clear up sources

To delete the Stream name and terminate working Imaginative and prescient Agent processes:

uv run important.py cease

Vital: Amazon Bedrock expenses apply for all API calls to Amazon Nova 2 Sonic. You possibly can run the cleanup command to terminate classes and keep away from ongoing expenses. Lively classes could proceed to incur prices till explicitly terminated.

Use instances

With the technical basis in place, it’s price exploring the place these capabilities translate into significant real-world impression. The mixture of low-latency voice, dialog administration, and power integration opens up a variety of purposes throughout industries the place pure spoken interplay can exchange or increase conventional interfaces.

Use case 1: Voice interfaces for no-screen and low-attention environments

Imaginative and prescient Brokers mixed with Amazon Nova 2 Sonic is effectively suited to environments the place customers can not reliably work together with a display screen, equivalent to driving, area service, logistics, healthcare, or on-site operations.In these contexts, voice turns into the first interface, not a comfort characteristic.

With Amazon Nova 2 Sonic, you get real-time, speech-to-speech interactions with low latency and pure turn-taking, permitting customers to talk freely, interrupt responses, and proper themselves with out breaking the stream.
Imaginative and prescient Brokers manages dialog state and job logic throughout turns, translating spoken enter into structured actions like retrieving the subsequent job task, updating job standing, logging notes, or requesting human help.

As a result of the agent maintains context all through the interplay, customers can concern follow-up instructions or clarifications with out repeating data.For instance, a supply driver can ask, “What’s my subsequent cease?” obtain spoken instructions, say “Mark the final supply as full,” after which comply with up with “Name dispatch,” all with out touching a display screen, whereas the agent updates backend techniques in actual time.

Use case 2: Excessive-volume inbound cellphone assist at scale

Imaginative and prescient Brokers mixed with Amazon Nova 2 Sonic is designed for dealing with giant volumes of inbound assist calls the place human brokers turn into a bottleneck. This use case is basically about scale: decreasing queue occasions, deflecting repetitive requests, and reserving human brokers for instances that require their involvement.

With Amazon Nova 2 Sonic, callers can have low-latency, real-time speech-to-speech conversations that enable callers to elucidate points naturally as a substitute of navigating scripted IVR bushes.
Imaginative and prescient Brokers orchestrates intent detection, dialog state, and backend integrations, equivalent to order techniques, account data, or ticketing providers, so widespread requests may be resolved mechanically inside the name.

When a problem exceeds predefined confidence thresholds or requires guide intervention, the agent escalates to a human with structured context hooked up, assuaging the necessity for callers to repeat themselves.Throughout peak hours, lots of of shoppers may name asking about supply delays. As an alternative of ready in a queue, callers are instantly answered by a voice agent that checks order standing, explains the delay, gives subsequent steps, and solely routes to a reside agent if an exception is detected.This turns the cellphone system from a queue-based price heart right into a steady, first-line decision layer.

Conclusion

This publish walked by way of how you can construct real-time voice brokers utilizing Stream’s Imaginative and prescient Brokers framework and Amazon Bedrock with Amazon Nova 2 Sonic. We lined the structure, the bidirectional streaming protocol, automated reconnection dealing with, operate calling, multilingual assist, and manufacturing deployment.The mixture of Stream’s low-latency edge community and Amazon Nova Sonic’s native speech-to-speech capabilities gives a strong basis for constructing voice AI purposes. The Imaginative and prescient Brokers framework abstracts the complicated orchestration of connection lifecycle administration, audio encoding, VAD-aware reconnection, and power execution, so you’ll be able to focus in your agent’s logic and person expertise.

In case you’re able to discover additional, we encourage you to strive extending your agent with customized features on your particular use case or discover the multilingual capabilities for world purposes. The Imaginative and prescient Brokers repository at https://github.com/GetStream/Vision-Agents is an efficient place to begin. You’ll discover further examples, plugin documentation, and group discussions. For deeper integration particulars, the AWS plugin documentation is accessible at https://visionagents.ai/integrations/aws-bedrock, and the Amazon Nova 2 Sonic documentation within the AWS Nova Person Information gives a complete reference for the bidirectional streaming API. You possibly can join a Stream developer account at https://getstream.io/ and begin constructing for in the present day at no further price.

Actual-time voice brokers with Stream Imaginative and prescient Brokers and Amazon Nova 2 Sonic

The problem

Resolution overview

Structure overview

Account boundaries

Finish-to-end media stream

Stipulations

Getting began

Step 1: Create a brand new challenge listing and set up Imaginative and prescient Brokers with the AWS plugin

Step 2: Configure surroundings variables

Step 3: Construct your first voice agent

Step 4: Run the voice agent

Understanding the Amazon Bedrock integration

Bidirectional streaming with Amazon Nova 2 Sonic

Including operate calling

How operate calling works with Amazon Nova 2 Sonic

Utilizing the usual LLM with Amazon Bedrock

Textual content-to-speech with Amazon Polly

Clear up sources

Use instances

Use case 1: Voice interfaces for no-screen and low-attention environments

Use case 2: Excessive-volume inbound cellphone assist at scale

Conclusion

In regards to the authors

China, US, UAE workforce up in uncommon Dubai cryptocurrency rip-off raid

Himalayan wolf-dog and wolf-dog hybrids emerge as a risk to wolves and people

Converter

Editors Pick