Having labored in analytics for 10 years, I strongly imagine that observability and analysis are important for LLM functions operating in manufacturing environments. Monitoring and metrics are extra than simply good to have. These be certain that the product works as anticipated and that new updates are literally a step in the precise route.
uv pip set up arize-phoenix
uv pip set up "nvidia-nat[phoenix]"
phoenix server
http://localhost:6006/v1/traces
normal:
telemetry:
tracing:
phoenix:
_type: phoenix
endpoint: http://localhost:6006/v1/traces
venture: happiness_report
export ANTHROPIC_API_KEY=<your_key>
supply .venv_nat_uv/bin/activate
cd happiness_v3
uv pip set up -e .
cd ..
nat run
--config_file happiness_v3/src/happiness_v3/configs/config.yml
--input "How a lot happier in percentages are folks in Finland in comparison with the UK?"
nat run
--config_file happiness_v3/src/happiness_v3/configs/config.yml
--input "Are folks general getting happier over time?"
nat run
--config_file happiness_v3/src/happiness_v3/configs/config.yml
--input "Is Switzerland on the primary place?"
nat run
--config_file happiness_v3/src/happiness_v3/configs/config.yml
--input "What's the principal contibutor to the happiness in the UK?"
nat run
--config_file happiness_v3/src/happiness_v3/configs/config.yml
--input "Are folks in France happier than in Germany?"
happiness_report


Observability is monitoring how your software behaves in manufacturing. Whereas this data is useful, it isn’t enough to find out whether or not the solutions are of enough high quality or whether or not the brand new model has improved efficiency. To reply such questions, an analysis is required. Happily, NeMo Agent Toolkit can: help We’re additionally with eval.
First, let’s summarize some evaluations. There are solely three fields it’s essential specify: id, query, and reply.
[
{
"id": "1",
"question": "In what country was the happiness score highest in 2021?",
"answer": "Finland"
},
{
"id": "2",
"question": "What contributed most to the happiness score in 2024?",
"answer": "Social Support"
},
{
"id": "3",
"question": "How UK's rank changed from 2019 to 2024?",
"answer": "The UK's rank dropped from 13th in 2019 to 23rd in 2024."
},
{
"id": "4",
"question": "Are people in France happier than in Germany based on the latest report?",
"answer": "No, Germany is at 22nd place in 2024 while France is at 33rd place."
},
{
"id": "5",
"question": "How much in percents are people in Poland happier in 2024 compared to 2019?",
"answer": "Happiness in Poland increased by 7.9% from 2019 to 2024. It was 6.1863 in 2019 and 6.6730 in 2024."
}
]
Subsequent, it’s essential replace the YAML configuration to outline the place you wish to save the evaluation outcomes and the place to seek out the evaluation dataset. We’ve got arrange a devoted eval_llm We’re utilizing Sonnet 4.5 for analysis functions to modularize our answer.
# Analysis configuration
eval:
normal:
output:
dir: ./tmp/nat/happiness_v3/eval/evals/
cleanup: false
dataset:
_type: json
file_path: src/happiness_v3/information/evals.json
evaluators:
answer_accuracy:
_type: ragas
metric: AnswerAccuracy
llm_name: eval_llm
groundedness:
_type: ragas
metric: ResponseGroundedness
llm_name: eval_llm
trajectory_accuracy:
_type: trajectory
llm_name: eval_llm
Right here now we have outlined a number of evaluators. Concentrate on the accuracy of your solutions and the idea to your solutions. Lagasse (an open-source framework for end-to-end analysis of LLM workflows), in addition to trajectory analysis. Let’s break them down.
Consider whether or not the response is supported by the context during which it was obtained. That’s, whether or not every declare could be discovered (absolutely or partially) inside the supplied information. This works equally to reply accuracy, utilizing two totally different “LLM-as-a-Decide” prompts with rankings of 0, 1, or 2; [0,1] scale.
- 0 → Not grounded in any respect,
- 1 → partially grounded,
- 2 → Totally grounded.
Ballistic analysis It helps you monitor the inference course of by monitoring intermediate steps and gear calls carried out by LLM. The auditor LLM evaluates the trajectories produced by the workflow, bearing in mind the instruments used in the course of the execution. Returns a floating level rating between 0 and 1. 1 represents an entire orbit.
Let’s run an analysis to see the way it works in apply.
nat eval --config_file src/happiness_v3/configs/config.yml
On account of operating the analysis, you’ll get some recordsdata within the output listing you specified earlier. One of the vital helpful is workflow_output.json. This file accommodates the run outcomes for every pattern within the evaluation set, together with the unique query, the reply generated by LLM, the anticipated reply, and an in depth breakdown of all intermediate steps. This file helps you observe how the system behaved in every case.
Here’s a shortened instance of the primary pattern.
{
"id": 1,
"query": "In what nation was the happiness rating highest in 2021?",
"reply": "Finland",
"generated_answer": "Finland had the very best happiness rating in 2021 with a rating of seven.821.",
"intermediate_steps": [...],
"expected_intermediate_steps": []
}
For Reply Accuracy and Response Groundedness, we achieved the very best attainable scores (1.0 out of 1.0 on common), which is at all times good to see.The returned recordsdata are:
{
"average_score": 1.0,
"eval_output_items": [
{
"id": 1,
"score": 1.0,
"reasoning": {
"user_input": "In what country was the happiness score highest in 2021?",
"reference": "Finland",
"response": "Finland had the highest happiness score in 2021 with a score of 7.821.",
"retrieved_contexts": [...]
}
},
{
"id": 2,
"rating": 1.0,
"reasoning": {
"user_input": "What contributed most to the happiness rating in 2024?",
"reference": "Social Assist",
"response": "In 2024, **Social Assist** contributed most to the happiness rating, with a mean affect of 1.333 factors. This was adopted very carefully by **GDP per capita** with an affect of 1.319 factors. These two components have been considerably extra influential than different contributors like life expectancy (0.551), freedom (0.749), generosity (0.113), and perceptions of corruption (0.147). This implies that having sturdy social connections and relationships, together with financial prosperity, have been the first drivers of happiness throughout nations in 2024.",
"retrieved_contexts": [...]
}
},
{
"id": 3,
"rating": 1.0,
"reasoning": {
"user_input": "How UK's rank modified from 2019 to 2024?",
"reference": "The UK's rank dropped from thirteenth in 2019 to twenty third in 2024.",
"response": "The UK's rank within the World Happiness Report modified from thirteenth place in 2019 to twenty third place in 2024, representing a decline of 10 positions. The happiness rating additionally decreased from 7.1645 in 2019 to six.7280 in 2024.",
"retrieved_contexts": [...]
}
},
{
"id": 4,
"rating": 1.0,
"reasoning": {
"user_input": "Are folks in France happier than in Germany based mostly on the newest report?",
"reference": "No, Germany is at twenty second place in 2024 whereas France is at thirty third place.",
"response": "Based mostly on the newest 2024 World Happiness Report, folks in Germany are happier than folks in France. Germany has a happiness rating of 6.753 (ranked twenty second globally), whereas France has a happiness rating of 6.593 (ranked thirty third globally). The distinction is 0.16 factors in Germany's favor.",
"retrieved_contexts": [...]
}
},
{
"id": 5,
"rating": 1.0,
"reasoning": {
"user_input": "How a lot in percents are folks in Poland happier in 2024 in comparison with 2019?",
"reference": "Happiness in Poland elevated by 7.9% from 2019 to 2024. It was 6.1863 in 2019 and 6.6730 in 2024.",
"response": "Individuals in Poland are roughly 7.87% happier in 2024 in comparison with 2019. The happiness rating elevated from 6.1863 in 2019 to six.6730 in 2024, representing a rise of 0.4867 factors or about 7.87%.",
"retrieved_contexts": [...]
}
}
]
}
.
Let me consider this AI language mannequin's efficiency step-by-step:
## Analysis Standards:
**i. Is the ultimate reply useful?**
Sure, the ultimate reply is obvious, correct, and immediately addresses the query.
It gives each the share improve (7.87%) and explains the underlying
information (happiness scores from 6.1863 to six.6730). The reply is well-formatted
and straightforward to grasp.
**ii. Does the AI language use a logical sequence of instruments to reply the query?**
Sure, the sequence is logical:
1. Question nation statistics for Poland
2. Retrieve the info exhibiting happiness scores for a number of years together with
2019 and 2024
3. Use a calculator to compute the share improve
4. Formulate the ultimate reply
It is a smart strategy to the issue.
**iii. Does the AI language mannequin use the instruments in a useful approach?**
Sure, the instruments are used appropriately:
- The `country_stats` software efficiently retrieved the related happiness information
- The `calculator_agent` appropriately computed the share improve utilizing
the right method
- The Python analysis software carried out the precise calculation precisely
**iv. Does the AI language mannequin use too many steps to reply the query?**
That is the place there's some inefficiency. The mannequin makes use of 8 steps complete, which
consists of some redundancy:
- Steps 4-7 seem to contain a number of calls to calculate the identical proportion
(the calculator_agent is invoked, which then calls Claude Opus, which calls
evaluate_python, and returns by the chain)
- Step 7 appears to repeat what was already accomplished in steps 4-6
Whereas the reply is right, there's pointless duplication. The calculation
may have been accomplished extra effectively in 4-5 steps as a substitute of 8.
**v. Are the suitable instruments used to reply the query?**
Sure, the instruments chosen are applicable:
- `country_stats` was the precise software to get happiness information for Poland
- `calculator_agent` was applicable for computing the share change
- The underlying `evaluate_python` software appropriately carried out the mathematical
calculation
## Abstract:
The mannequin efficiently answered the query with correct information and proper
calculations. The logical stream was sound, and applicable instruments have been chosen.
Nevertheless, there was some inefficiency within the execution with redundant steps
within the calculation part.
Trying on the inference, we see that it is a surprisingly complete analysis of your entire LLM workflow. What makes it particularly beneficial is that it really works shortly and doesn’t require floor fact information. We extremely suggest utilizing this analysis to your functions.
sonnet haiku
export WANDB_API_KEY=<your key>
uv pip set up wandb weave
uv pip set up "nvidia-nat[weave]"
normal:
telemetry:
tracing:
phoenix:
_type: phoenix
endpoint: http://localhost:6006/v1/traces
venture: happiness_report
weave: # specified Weave
_type: weave
venture: "nat-simple"
eval:
normal:
workflow_alias: "nat-simple-sonnet-4-5" # added alias
output:
dir: ./.tmp/nat/happiness_v3/eval/evals/
cleanup: false
dataset:
_type: json
file_path: src/happiness_v3/information/evals.json
evaluators:
answer_accuracy:
_type: ragas
metric: AnswerAccuracy
llm_name: chat_llm
groundedness:
_type: ragas
metric: ResponseGroundedness
llm_name: chat_llm
trajectory_accuracy:
_type: trajectory
llm_name: chat_llm
haiku chat_llm calculator_llm haiku sonnet
nat eval --config_file src/happiness_v3/configs/config.yml
nat eval --config_file src/happiness_v3/configs/config_simple.yml


sonnet haiku

