Agentic AI from First Ideas: Reflection

says that “any sufficiently superior know-how is indistinguishable from magic”. That’s precisely how a whole lot of immediately’s AI frameworks really feel. Instruments like GitHub Copilot, Claude Desktop, OpenAI Operator, and Perplexity Comet are automating on a regular basis duties that may’ve appeared inconceivable to automate simply 5 years in the past. What’s much more outstanding is that with only a few strains of code, we are able to construct our personal subtle AI instruments: ones that search via information, browse the net, click on hyperlinks, and even make purchases. It actually does really feel like magic.

Regardless that I genuinely imagine in information wizards, I don’t imagine in magic. I discover it thrilling (and sometimes useful) to know how issues are literally constructed and what’s taking place beneath the hood. That’s why I’ve determined to share a collection of posts on agentic AI design ideas that’ll assist you perceive how all these magical instruments really work.

To realize a deep understanding, we’ll construct a multi-AI agent system from scratch. We’ll keep away from utilizing frameworks like CrewAI or smolagents and as a substitute work straight with the inspiration mannequin API. Alongside the best way, we’ll discover the elemental agentic design patterns: reflection, instrument use, planning, and multi-agent setups. Then, we’ll mix all this data to construct a multi-AI agent system that may reply complicated data-related questions.

As Richard Feynman put it, “What I can not create, I don’t perceive.” So let’s begin constructing! On this article, we’ll concentrate on the reflection design sample. However first, let’s determine what precisely reflection is.

What reflection is

Let’s mirror on how we (people) often work on duties. Think about I have to share the outcomes of a latest function launch with my PM. I’ll seemingly put collectively a fast draft after which learn it a few times from starting to finish, making certain that every one elements are constant, there’s sufficient data, and there are not any typos.

Or let’s take one other instance: writing a SQL question. I’ll both write it step-by-step, checking the intermediate outcomes alongside the best way, or (if it’s easy sufficient) I’ll draft it suddenly, execute it, have a look at the end result (checking for errors or whether or not the end result matches my expectations), after which tweak the question primarily based on that suggestions. I’d rerun it, verify the end result, and iterate till it’s proper.

So we hardly ever write lengthy texts from high to backside in a single go. We often circle again, evaluate, and tweak as we go. These suggestions loops are what assist us enhance the standard of our work.

Picture by creator

LLMs use a distinct method. In the event you ask an LLM a query, by default, it can generate a solution token by token, and the LLM gained’t have the ability to evaluate its end result and repair any points. However in an agentic AI setup, we are able to create suggestions loops for LLMs too, both by asking the LLM to evaluate and enhance its personal reply or by sharing exterior suggestions with it (just like the outcomes of a SQL execution). And that’s the entire level of reflection. It sounds fairly simple, however it could yield considerably higher outcomes.

There’s a considerable physique of analysis exhibiting the advantages of reflection:

Picture from “Self-Refine: Iterative Refinement with Self-Feedback,” Madaan et al.

In “Reflexion: Language Agents with Verbal Reinforcement Learning” Shinn et al. (2023), the authors achieved a 91% move@1 accuracy on the HumanEval coding benchmark, surpassing the earlier state-of-the-art GPT-4, which scored simply 80%. In addition they discovered that Reflexion considerably outperforms all baseline approaches on the HotPotQA benchmark (a Wikipedia-based Q&A dataset that challenges brokers to parse content material and purpose over a number of supporting paperwork).

Picture from “Reflexion: Language Agents with Verbal Reinforcement Learning,” Shinn et al.

Reflection is very impactful in agentic techniques as a result of it may be used to course-correct at many steps of the method:

When a person asks a query, the LLM can use reflection to judge whether or not the request is possible.
When the LLM places collectively an preliminary plan, it could use reflection to double-check whether or not the plan is sensible and can assist obtain the objective.
After every execution step or instrument name, the agent can consider whether or not it’s on observe and whether or not it’s price adjusting the plan.
When the plan is totally executed, the agent can mirror to see whether or not it has really achieved the objective and solved the duty.

It’s clear that reflection can considerably enhance accuracy. Nonetheless, there are trade-offs price discussing. Reflection would possibly require a number of further calls to the LLM and probably different techniques, which might result in elevated latency and prices. So in enterprise instances, it’s price contemplating whether or not the standard enhancements justify the bills and delays within the person circulation.

Reflection in frameworks

Since there’s little doubt that reflection brings worth to AI brokers, it’s broadly utilized in fashionable frameworks. Let’s have a look at some examples.

The thought of reflection was first proposed within the paper “ReAct: Synergizing Reasoning and Acting in Language Models” by Yao et al. (2022). ReAct is a framework that mixes interleaving phases of Reasoning (reflection via express thought traces) and Appearing (task-relevant actions in an setting). On this framework, reasoning guides the selection of actions, and actions produce new observations that inform additional reasoning. The reasoning stage itself is a mixture of reflection and planning.

This framework grew to become fairly fashionable, so there at the moment are a number of off-the-shelf implementations, reminiscent of:

The DSPy framework by Databricks has a ReAct class,
In LangGraph, you should utilize the create_react_agent operate,
Code brokers within the smolagents library by HuggingFace are additionally primarily based on the ReAct structure.

Reflection from scratch

Now that we’ve realized the speculation and explored present implementations, it’s time to get our fingers soiled and construct one thing ourselves. Within the ReAct method, brokers use reflection at every step, combining planning with reflection. Nonetheless, to know the affect of reflection extra clearly, we’ll have a look at it in isolation.

For instance, we’ll use text-to-SQL: we’ll give an LLM a query and count on it to return a legitimate SQL question. We’ll be working with a flight delay dataset and the ClickHouse SQL dialect.

We’ll begin by utilizing direct technology with none reflection as our baseline. Then, we’ll attempt utilizing reflection by asking the mannequin to critique and enhance the SQL, or by offering it with further suggestions. After that, we’ll measure the standard of our solutions to see whether or not reflection really results in higher outcomes.

Direct technology

We’ll start with essentially the most simple method, direct technology, the place we ask the LLM to generate SQL that solutions a person question.

pip set up anthropic

We have to specify the API Key for the Anthropic API.

import os
os.environ['ANTHROPIC_API_KEY'] = config['ANTHROPIC_API_KEY']

The subsequent step is to initialise the shopper, and we’re all set.

import anthropic
shopper = anthropic.Anthropic()

Now we are able to use this shopper to ship messages to the LLM. Let’s put collectively a operate to generate SQL primarily based on a person question. I’ve specified the system immediate with primary directions and detailed details about the info schema. I’ve additionally created a operate to ship the system immediate and person question to the LLM.

base_sql_system_prompt = '''
You're a senior SQL developer and your activity is to assist generate a SQL question primarily based on person necessities. 
You're working with ClickHouse database. Specify the format (Tab Separated With Names) within the SQL question output to make sure that column names are included within the output.
Don't use rely(*) in your queries since it is a unhealthy apply with columnar databases, desire utilizing rely().
Be certain that the question is syntactically right and optimized for efficiency, making an allowance for ClickHouse particular options (i.e. that ClickHouse is a columnar database and helps capabilities like ARRAY JOIN, SAMPLE, and so forth.).
Return solely the SQL question with none further explanations or feedback.

You'll be working with flight_data desk which has the next schema:

Column Title | Information Kind | Null % | Instance Worth | Description
--- | --- | --- | --- | ---
12 months | Int64 | 0.0 | 2024 | 12 months of flight
month | Int64 | 0.0 | 1 | Month of flight (1–12)
day_of_month | Int64 | 0.0 | 1 | Day of the month
day_of_week | Int64 | 0.0 | 1 | Day of week (1=Monday … 7=Sunday)
fl_date | datetime64[ns] | 0.0 | 2024-01-01 00:00:00 | Flight date (YYYY-MM-DD)
op_unique_carrier | object | 0.0 | 9E | Distinctive provider code
op_carrier_fl_num | float64 | 0.0 | 4814.0 | Flight quantity for reporting airline
origin | object | 0.0 | JFK | Origin airport code
origin_city_name | object | 0.0 | "New York, NY" | Origin metropolis identify
origin_state_nm | object | 0.0 | New York | Origin state identify
dest | object | 0.0 | DTW | Vacation spot airport code
dest_city_name | object | 0.0 | "Detroit, MI" | Vacation spot metropolis identify
dest_state_nm | object | 0.0 | Michigan | Vacation spot state identify
crs_dep_time | Int64 | 0.0 | 1252 | Scheduled departure time (native, hhmm)
dep_time | float64 | 1.31 | 1247.0 | Precise departure time (native, hhmm)
dep_delay | float64 | 1.31 | -5.0 | Departure delay in minutes (unfavourable if early)
taxi_out | float64 | 1.35 | 31.0 | Taxi out time in minutes
wheels_off | float64 | 1.35 | 1318.0 | Wheels-off time (native, hhmm)
wheels_on | float64 | 1.38 | 1442.0 | Wheels-on time (native, hhmm)
taxi_in | float64 | 1.38 | 7.0 | Taxi in time in minutes
crs_arr_time | Int64 | 0.0 | 1508 | Scheduled arrival time (native, hhmm)
arr_time | float64 | 1.38 | 1449.0 | Precise arrival time (native, hhmm)
arr_delay | float64 | 1.61 | -19.0 | Arrival delay in minutes (unfavourable if early)
cancelled | int64 | 0.0 | 0 | Cancelled flight indicator (0=No, 1=Sure)
cancellation_code | object | 98.64 | B | Motive for cancellation (if cancelled)
diverted | int64 | 0.0 | 0 | Diverted flight indicator (0=No, 1=Sure)
crs_elapsed_time | float64 | 0.0 | 136.0 | Scheduled elapsed time in minutes
actual_elapsed_time | float64 | 1.61 | 122.0 | Precise elapsed time in minutes
air_time | float64 | 1.61 | 84.0 | Flight time in minutes
distance | float64 | 0.0 | 509.0 | Distance between origin and vacation spot (miles)
carrier_delay | int64 | 0.0 | 0 | Service-related delay in minutes
weather_delay | int64 | 0.0 | 0 | Climate-related delay in minutes
nas_delay | int64 | 0.0 | 0 | Nationwide Air System delay in minutes
security_delay | int64 | 0.0 | 0 | Safety delay in minutes
late_aircraft_delay | int64 | 0.0 | 0 | Late plane delay in minutes
'''

def generate_direct_sql(rec):
  # making an LLM name
  message = shopper.messages.create(
    mannequin = "claude-3-5-haiku-latest",
    # I selected smaller mannequin in order that it is simpler for us to see the affect 
    max_tokens = 8192,
    system=base_sql_system_prompt,
    messages = [
        {'role': 'user', 'content': rec['question']}
    ]
  )

  sql  = message.content material[0].textual content
  
  # cleansing the output
  if sql.endswith('```'):
    sql = sql[:-3]
  if sql.startswith('```sql'):
    sql = sql[6:]
  return sql

That’s it. Now let’s take a look at our text-to-SQL answer. I’ve created a small evaluation set of 20 question-and-answer pairs that we are able to use to verify whether or not our system is working properly. Right here’s one instance:

{
'query': 'What was the best velocity in mph?',
'reply': '''
    choose max(distance / (air_time / 60)) as max_speed 
    from flight_data 
    the place air_time > 0 
    format TabSeparatedWithNames'''
}

Let’s use our text-to-SQL operate to generate SQL for all person queries within the take a look at set.

# load analysis set
with open('./information/flight_data_qa_pairs.json', 'r') as f:
    qa_pairs = json.load(f)
qa_pairs_df = pd.DataFrame(qa_pairs)

tmp = []
# executing LLM for every query in our eval set
for rec in tqdm.tqdm(qa_pairs_df.to_dict('data')):
    llm_sql = generate_direct_sql(rec)
    tmp.append(
        {
            'id': rec['id'],
            'llm_direct_sql': llm_sql
        }
    )

llm_direct_df = pd.DataFrame(tmp)
direct_result_df = qa_pairs_df.merge(llm_direct_df, on = 'id')

Now we’ve got our solutions, and the following step is to measure the standard.

Measuring high quality

Sadly, there’s no single right reply on this state of affairs, so we are able to’t simply evaluate the SQL generated by the LLM to a reference reply. We have to provide you with a technique to measure high quality.

There are some features of high quality that we are able to verify with goal standards, however to verify whether or not the LLM returned the proper reply, we’ll want to make use of an LLM. So I’ll use a mixture of approaches:

First, we’ll use goal standards to verify whether or not the right format was specified within the SQL (we instructed the LLM to make use of TabSeparatedWithNames).
Second, we are able to execute the generated question and see whether or not ClickHouse returns an execution error.
Lastly, we are able to create an LLM choose that compares the output from the generated question to our reference reply and checks whether or not they differ.

Let’s begin by executing the SQL. It’s price noting that our get_clickhouse_data operate doesn’t throw an exception. As a substitute, it returns textual content explaining the error, which may be dealt with by the LLM later.

CH_HOST = 'http://localhost:8123' # default deal with 
import requests
import pandas as pd
import tqdm

# operate to execute SQL question
def get_clickhouse_data(question, host = CH_HOST, connection_timeout = 1500):
  r = requests.put up(host, params = {'question': question}, 
    timeout = connection_timeout)
  if r.status_code == 200:
      return r.textual content
  else: 
      return 'Database returned the next error:n' + r.textual content

# getting the outcomes of SQL execution
direct_result_df['llm_direct_output'] = direct_result_df['llm_direct_sql'].apply(get_clickhouse_data)
direct_result_df['answer_output'] = direct_result_df['answer'].apply(get_clickhouse_data)

The subsequent step is to create an LLM choose. For this, I’m utilizing a series‑of‑thought method that prompts the LLM to supply its reasoning earlier than giving the ultimate reply. This offers the mannequin time to suppose via the issue, which improves response high quality.

llm_judge_system_prompt = '''
You're a senior analyst and your activity is to check two SQL question outcomes and decide if they're equal. 
Focus solely on the info returned by the queries, ignoring any formatting variations. 
Consider the preliminary person question and knowledge wanted to reply it. For instance, if person requested for the typical distance, and each queries return the identical common worth however in one in all them there's additionally a rely of data, you need to contemplate them equal, since each present the identical requested data.

Reply with a JSON of the next construction:
false>

Be certain that ONLY JSON is within the output. 

You'll be working with flight_data desk which has the next schema:
Column Title | Information Kind | Null % | Instance Worth | Description
--- | --- | --- | --- | ---
12 months | Int64 | 0.0 | 2024 | 12 months of flight
month | Int64 | 0.0 | 1 | Month of flight (1–12)
day_of_month | Int64 | 0.0 | 1 | Day of the month
day_of_week | Int64 | 0.0 | 1 | Day of week (1=Monday … 7=Sunday)
fl_date | datetime64[ns] | 0.0 | 2024-01-01 00:00:00 | Flight date (YYYY-MM-DD)
op_unique_carrier | object | 0.0 | 9E | Distinctive provider code
op_carrier_fl_num | float64 | 0.0 | 4814.0 | Flight quantity for reporting airline
origin | object | 0.0 | JFK | Origin airport code
origin_city_name | object | 0.0 | "New York, NY" | Origin metropolis identify
origin_state_nm | object | 0.0 | New York | Origin state identify
dest | object | 0.0 | DTW | Vacation spot airport code
dest_city_name | object | 0.0 | "Detroit, MI" | Vacation spot metropolis identify
dest_state_nm | object | 0.0 | Michigan | Vacation spot state identify
crs_dep_time | Int64 | 0.0 | 1252 | Scheduled departure time (native, hhmm)
dep_time | float64 | 1.31 | 1247.0 | Precise departure time (native, hhmm)
dep_delay | float64 | 1.31 | -5.0 | Departure delay in minutes (unfavourable if early)
taxi_out | float64 | 1.35 | 31.0 | Taxi out time in minutes
wheels_off | float64 | 1.35 | 1318.0 | Wheels-off time (native, hhmm)
wheels_on | float64 | 1.38 | 1442.0 | Wheels-on time (native, hhmm)
taxi_in | float64 | 1.38 | 7.0 | Taxi in time in minutes
crs_arr_time | Int64 | 0.0 | 1508 | Scheduled arrival time (native, hhmm)
arr_time | float64 | 1.38 | 1449.0 | Precise arrival time (native, hhmm)
arr_delay | float64 | 1.61 | -19.0 | Arrival delay in minutes (unfavourable if early)
cancelled | int64 | 0.0 | 0 | Cancelled flight indicator (0=No, 1=Sure)
cancellation_code | object | 98.64 | B | Motive for cancellation (if cancelled)
diverted | int64 | 0.0 | 0 | Diverted flight indicator (0=No, 1=Sure)
crs_elapsed_time | float64 | 0.0 | 136.0 | Scheduled elapsed time in minutes
actual_elapsed_time | float64 | 1.61 | 122.0 | Precise elapsed time in minutes
air_time | float64 | 1.61 | 84.0 | Flight time in minutes
distance | float64 | 0.0 | 509.0 | Distance between origin and vacation spot (miles)
carrier_delay | int64 | 0.0 | 0 | Service-related delay in minutes
weather_delay | int64 | 0.0 | 0 | Climate-related delay in minutes
nas_delay | int64 | 0.0 | 0 | Nationwide Air System delay in minutes
security_delay | int64 | 0.0 | 0 | Safety delay in minutes
late_aircraft_delay | int64 | 0.0 | 0 | Late plane delay in minutes
'''

llm_judge_user_prompt_template = '''
Right here is the preliminary person question:
{user_query}

Right here is the SQL question generated by the primary analyst: 
SQL: 
{sql1} 

Database output: 
{result1}

Right here is the SQL question generated by the second analyst:
SQL:
{sql2}

Database output:
{result2}
'''

def llm_judge(rec, field_to_check):
  # assemble the person immediate 
  user_prompt = llm_judge_user_prompt_template.format(
    user_query = rec['question'],
    sql1 = rec['answer'],
    result1 = rec['answer_output'],
    sql2 = rec[field_to_check + '_sql'],
    result2 = rec[field_to_check + '_output']
  )
  
  # make an LLM name
  message = shopper.messages.create(
      mannequin = "claude-sonnet-4-5",
      max_tokens = 8192,
      temperature = 0.1,
      system = llm_judge_system_prompt,
      messages=[
          {'role': 'user', 'content': user_prompt}
      ]
  )
  information = message.content material[0].textual content
  
  # Strip markdown code blocks
  information = information.strip()
  if information.startswith('```json'):
      information = information[7:]
  elif information.startswith('```'):
      information = information[3:]
  if information.endswith('```'):
      information = information[:-3]
  
  information = information.strip()
  return json.hundreds(information)

Now, let’s run the LLM choose to get the outcomes.

tmp = []

for rec in tqdm.tqdm(direct_result_df.to_dict('data')):
  attempt:
    judgment = llm_judge(rec, 'llm_direct')
  besides Exception as e:
    print(f"Error processing file {rec['id']}: {e}")
    proceed
  tmp.append(
    {
      'id': rec['id'],
      'llm_judge_reasoning': judgment['reasoning'],
      'llm_judge_equivalence': judgment['equivalence']
    }
  )

judge_df = pd.DataFrame(tmp)
direct_result_df = direct_result_df.merge(judge_df, on = 'id')

Let’s have a look at one instance to see how the LLM choose works.

# person question 
In 2024, what share of time all airplanes spent within the air?

# right reply 
choose (sum(air_time) / sum(actual_elapsed_time)) * 100 as percentage_in_air 
the place 12 months = 2024
from flight_data 
format TabSeparatedWithNames

percentage_in_air
81.43582596894757

# generated by LLM reply 
SELECT 
    spherical(sum(air_time) / (sum(air_time) + sum(taxi_out) + sum(taxi_in)) * 100, 2) as air_time_percentage
FROM flight_data
WHERE 12 months = 2024
FORMAT TabSeparatedWithNames

air_time_percentage
81.39

# LLM choose response
{
 'reasoning': 'Each queries calculate the proportion of time airplanes 
    spent within the air, however use completely different denominators. The primary question 
    makes use of actual_elapsed_time (which incorporates air_time + taxi_out + taxi_in 
    + any floor delays), whereas the second makes use of solely (air_time + taxi_out 
    + taxi_in). The second question is method is extra correct for answering 
    "time airplanes spent within the air" because it excludes floor delays. 
    Nonetheless, the outcomes are very shut (81.44% vs 81.39%), suggesting minimal 
    affect. These are materially completely different approaches that occur to yield 
    related outcomes',
 'equivalence': FALSE
}

The reasoning is sensible, so we are able to belief our choose. Now, let’s verify all LLM-generated queries.

def get_llm_accuracy(sql, output, equivalence): 
    issues = []
    if 'format tabseparatedwithnames' not in sql.decrease():
        issues.append('No format laid out in SQL')
    if 'Database returned the next error' in output:
        issues.append('SQL execution error')
    if not equivalence and ('SQL execution error' not in issues):
        issues.append('Mistaken reply supplied')
    if len(issues) == 0:
        return 'No issues detected'
    else:
        return ' + '.be a part of(issues)

direct_result_df['llm_direct_sql_quality_heuristics'] = direct_result_df.apply(
    lambda row: get_llm_accuracy(row['llm_direct_sql'], row['llm_direct_output'], row['llm_judge_equivalence']), axis=1)

The LLM returned the right reply in 70% of instances, which isn’t unhealthy. However there’s positively room for enchancment, because it usually both gives the flawed reply or fails to specify the format accurately (generally inflicting SQL execution errors).

Including a mirrored image step

To enhance the standard of our answer, let’s attempt including a mirrored image step the place we ask the mannequin to evaluate and refine its reply.

For a mirrored image name, I’ll maintain the identical system immediate because it comprises all the mandatory details about SQL and the info schema. However I’ll tweak the person message to share the preliminary person question and the generated SQL, asking the LLM to critique and enhance it.

simple_reflection_user_prompt_template = '''
Your activity is to evaluate the SQL question generated by one other analyst and suggest enhancements if essential.
Test whether or not the question is syntactically right and optimized for efficiency. 
Take note of nuances in information (particularly time stamps sorts, whether or not to make use of whole elapsed time or time within the air, and so forth).
Be certain that the question solutions the preliminary person query precisely. 
Because the end result return the next JSON: 
{{
  'reasoning': '<your reasoning right here, 2-4 sentences on why you made modifications or not>', 
  'refined_sql': '<the improved SQL question right here>'
}}
Be certain that ONLY JSON is within the output and nothing else. Be certain that the output JSON is legitimate. 

Right here is the preliminary person question:
{user_query}

Right here is the SQL question generated by one other analyst: 
{sql} 
'''

def simple_reflection(rec) -> str:
  # setting up a person immediate
  user_prompt = simple_reflection_user_prompt_template.format(
    user_query=rec['question'],
    sql=rec['llm_direct_sql']
  )
  
  # making an LLM name
  message = shopper.messages.create(
    mannequin="claude-3-5-haiku-latest",
    max_tokens = 8192,
    system=base_sql_system_prompt,
    messages=[
        {'role': 'user', 'content': user_prompt}
    ]
  )

  information  = message.content material[0].textual content

  # strip markdown code blocks
  information = information.strip()
  if information.startswith('```json'):
    information = information[7:]
  elif information.startswith('```'):
    information = information[3:]
  if information.endswith('```'):
    information = information[:-3]
  
  information = information.strip()
  return json.hundreds(information.change('n', ' '))

Let’s refine the queries with reflection and measure the accuracy. We don’t see a lot enchancment within the ultimate high quality. We’re nonetheless at 70% right solutions.

Let’s have a look at particular examples to know what occurred. First, there are a few instances the place the LLM managed to repair the issue, both by correcting the format or by including lacking logic to deal with zero values.

Nonetheless, there are additionally instances the place the LLM overcomplicated the reply. The preliminary SQL was right (matching the golden set reply), however then the LLM determined to ‘enhance’ it. A few of these enhancements are cheap (e.g., accounting for nulls or excluding cancelled flights). Nonetheless, for some purpose, it determined to make use of ClickHouse sampling, although we don’t have a lot information and our desk doesn’t help sampling. Because of this, the refined question returned an execution error: Database returned the next error: Code: 141. DB::Exception: Storage default.flight_data would not help sampling. (SAMPLING_NOT_SUPPORTED).

Reflection with exterior suggestions

Reflection didn’t enhance accuracy a lot. That is seemingly as a result of we didn’t present any further data that may assist the mannequin generate a greater end result. Let’s attempt sharing exterior suggestions with the mannequin:

The results of our verify on whether or not the format is specified accurately
The output from the database (both information or an error message)
Let’s put collectively a immediate for this and generate a brand new model of the SQL.

feedback_reflection_user_prompt_template = '''
Your activity is to evaluate the SQL question generated by one other analyst and suggest enhancements if essential.
Test whether or not the question is syntactically right and optimized for efficiency. 
Take note of nuances in information (particularly time stamps sorts, whether or not to make use of whole elapsed time or time within the air, and so forth).
Be certain that the question solutions the preliminary person query precisely. 

Because the end result return the next JSON: 
{{
  'reasoning': '<your reasoning right here, 2-4 sentences on why you made modifications or not>', 
  'refined_sql': '<the improved SQL question right here>'
}}
Be certain that ONLY JSON is within the output and nothing else. Be certain that the output JSON is legitimate. 

Right here is the preliminary person question:
{user_query}

Right here is the SQL question generated by one other analyst: 
{sql} 

Right here is the database output of this question: 
{output}

We run an computerized verify on the SQL question to verify whether or not it has fomatting points. Here is the output: 
{formatting}
'''

def feedback_reflection(rec) -> str:
  # outline message for formatting 
  if 'No format laid out in SQL' in rec['llm_direct_sql_quality_heuristics']:
    formatting = 'SQL lacking formatting. Specify "format TabSeparatedWithNames" to make sure that column names are additionally returned'
  else: 
    formatting = 'Formatting is right'

  # setting up a person immediate
  user_prompt = feedback_reflection_user_prompt_template.format(
    user_query = rec['question'],
    sql = rec['llm_direct_sql'],
    output = rec['llm_direct_output'],
    formatting = formatting
  )

  # making an LLM name 
  message = shopper.messages.create(
    mannequin = "claude-3-5-haiku-latest",
    max_tokens = 8192,
    system = base_sql_system_prompt,
    messages = [
        {'role': 'user', 'content': user_prompt}
    ]
  )
  information  = message.content material[0].textual content

  # strip markdown code blocks
  information = information.strip()
  if information.startswith('```json'):
    information = information[7:]
  elif information.startswith('```'):
    information = information[3:]
  if information.endswith('```'):
    information = information[:-3]
  
  information = information.strip()
  return json.hundreds(information.change('n', ' '))

After operating our accuracy measurements, we are able to see that accuracy has improved considerably: 17 right solutions (85% accuracy) in comparison with 14 (70% accuracy).

If we verify the instances the place the LLM mounted the problems, we are able to see that it was in a position to right the format, deal with SQL execution errors, and even revise the enterprise logic (e.g., utilizing air time for calculating velocity).

Let’s additionally do some error evaluation to look at the instances the place the LLM made errors. Within the desk under, we are able to see that the LLM struggled with defining sure timestamps, incorrectly calculating whole time, or utilizing whole time as a substitute of air time for velocity calculations. Nonetheless, a few of the discrepancies are a bit tough:

Within the final question, the time interval wasn’t explicitly outlined, so it’s cheap for the LLM to make use of 2010–2023. I wouldn’t contemplate this an error, and I’d modify the analysis as a substitute.
One other instance is learn how to outline airline velocity: avg(distance/time) or sum(distance)/sum(time). Each choices are legitimate since nothing was specified within the person question or system immediate (assuming we don’t have a predefined calculation methodology).

General, I feel we achieved a reasonably good end result. Our ultimate 85% accuracy represents a major 15% level enchancment. You can probably transcend one iteration and run 2–3 rounds of reflection, however it’s price assessing whenever you hit diminishing returns in your particular case, since every iteration goes with elevated value and latency.

You will discover the complete code on GitHub.

Abstract

It’s time to wrap issues up. On this article, we began our journey into understanding how the magic of agentic AI techniques works. To determine it out, we’ll implement a multi-agent text-to-data instrument utilizing solely API calls to basis fashions. Alongside the best way, we’ll stroll via the important thing design patterns step-by-step: beginning immediately with reflection, and transferring on to instrument use, planning, and multi-agent coordination.

On this article, we began with essentially the most basic sample — reflection. Reflection is on the core of any agentic circulation, because the LLM must mirror on its progress towards attaining the top objective.

Reflection is a comparatively simple sample. We merely ask the identical or a distinct mannequin to analyse the end result and try to enhance it. As we realized in apply, sharing exterior suggestions with the mannequin (like outcomes from static checks or database output) considerably improves accuracy. A number of analysis research and our personal expertise with the text-to-SQL agent show the advantages of reflection. Nonetheless, these accuracy beneficial properties come at a price: extra tokens spent and better latency attributable to a number of API calls.

Thanks for studying. I hope this text was insightful. Keep in mind Einstein’s recommendation: “The essential factor is to not cease questioning. Curiosity has its personal purpose for present.” Could your curiosity lead you to your subsequent nice perception.

Reference

This text is impressed by the “Agentic AI” course by Andrew Ng from DeepLearning.AI.

Agentic AI from First Ideas: Reflection

What reflection is

Reflection in frameworks

Reflection from scratch

Direct technology

Measuring high quality

Including a mirrored image step

Reflection with exterior suggestions

Abstract

Reference

Elon Musk’s SpaceX discovers $133 million value of Bitcoin moved

Understanding the genetics of fibromyalgia sheds new gentle on its causes

Converter

Editors Pick

Newsletter

Categories

Related Posts