Coding implementation for parsing, analyzing, visualizing, and fine-tuning agent reasoning traces utilizing the lambda/hermes-agent-reasoning-traces dataset

by root May 2, 2026

written by root May 2, 2026 0 comment 35 views

On this tutorial, lambda/hermes-agent-reasoning-traces dataset Perceive how agent-based fashions suppose, use instruments, and generate responses throughout multi-turn conversations. First, load and examine the dataset, inspecting its construction, classes, and conversational type to get a transparent image of the data out there. Subsequent, construct a easy parser to extract key parts resembling inference traces, instrument calls, and power responses, permitting you to separate inner pondering from exterior actions. We additionally analyze patterns resembling instrument utilization frequency, dialog size, and error charges to higher perceive agent habits. We additionally create visualizations to focus on these traits and make your evaluation extra intuitive. Lastly, put together the dataset for coaching by changing it right into a format appropriate in your mannequin, making it appropriate for duties resembling supervised fine-tuning.

Copy the codecopiedUse one other browser

!pip -q set up -U datasets pandas matplotlib seaborn transformers speed up trl


import json, re, random, textwrap
from collections import Counter, defaultdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datasets import load_dataset, concatenate_datasets


random.seed(0)


CONFIG = "kimi"
ds = load_dataset("lambda/hermes-agent-reasoning-traces", CONFIG, cut up="practice")
print(ds)
print("Config:", CONFIG, "| Fields:", ds.column_names)
print("Classes:", sorted(set(ds["category"])))


COMPARE_BOTH = False
if COMPARE_BOTH:
   ds_kimi = load_dataset("lambda/hermes-agent-reasoning-traces", "kimi", cut up="practice")
   ds_glm  = load_dataset("lambda/hermes-agent-reasoning-traces", "glm-5.1", cut up="practice")
   ds_kimi = ds_kimi.add_column("supply", ["kimi"] * len(ds_kimi))
   ds_glm  = ds_glm.add_column("supply", ["glm-5.1"] * len(ds_glm))
   ds = concatenate_datasets([ds_kimi, ds_glm]).shuffle(seed=0)
   print("Mixed:", ds, "→ counts:", Counter(ds["source"]))


pattern = ds[0]
print("n=== Pattern 0 ===")
print("id        :", pattern["id"])
print("class  :", pattern["category"], "/", pattern["subcategory"])
print("job      :", pattern["task"])
print("turns     :", len(pattern["conversations"]))
print("system[0] :", pattern["conversations"][0]["value"][:220], "...n")

Arrange your atmosphere by putting in all required libraries and importing required modules. Subsequent, load the lambda/hermes-agent-reasoning-traces dataset and examine its construction, fields, and classes. Additionally, optionally mix a number of dataset configurations and discover samples to grasp conversational codecs.

Copy the codecopiedUse one other browser

THINK_RE     = re.compile(r"<suppose>(.*?)</suppose>", re.DOTALL)
TOOL_CALL_RE = re.compile(r"<tool_call>s*({.*?})s*</tool_call>", re.DOTALL)
TOOL_RESP_RE = re.compile(r"<tool_response>s*(.*?)s*</tool_response>", re.DOTALL)


def parse_assistant(worth: str) -> dict:
   ideas = [t.strip() for t in THINK_RE.findall(value)]
   calls = []
   for uncooked in TOOL_CALL_RE.findall(worth):
       strive:
           calls.append(json.masses(uncooked))
       besides json.JSONDecodeError:
           calls.append({"identify": "<malformed>", "arguments": {}})
   remaining = TOOL_CALL_RE.sub("", THINK_RE.sub("", worth)).strip()
   return {"ideas": ideas, "tool_calls": calls, "remaining": remaining}


def parse_tool(worth: str):
   uncooked = TOOL_RESP_RE.search(worth)
   if not uncooked: return {"uncooked": worth}
   physique = uncooked.group(1)
   strive:    return json.masses(physique)
   besides: return {"uncooked": physique}


first_gpt = subsequent(t for t in pattern["conversations"] if t["from"] == "gpt")
p = parse_assistant(first_gpt["value"])
print("Thought preview :", (p["thoughts"][0][:160] + "...") if p["thoughts"] else "(none)")
print("Software calls       :", [(c.get("name"), list(c.get("arguments", {}).keys())) for c in p["tool_calls"]])

Outline an everyday expression-based parser to extract inference traces, instrument calls, and power responses from datasets. We course of assistant messages to separate ideas, actions, and remaining output in a structured method. Subsequent, check the parser with a pattern dialog to make sure that the extraction works appropriately.

Copy the codecopiedUse one other browser

N = 3000
sub = ds.choose(vary(min(N, len(ds))))


tool_calls         = Counter()
parallel_widths    = Counter()
thoughts_per_turn  = []
calls_per_traj     = []
errors_per_traj    = []
turns_per_traj     = []
cat_counts         = Counter()


for ex in sub:
   cat_counts[ex["category"]] += 1
   n_calls = n_err = 0
   turns_per_traj.append(len(ex["conversations"]))
   for t in ex["conversations"]:
       if t["from"] == "gpt":
           p = parse_assistant(t["value"])
           thoughts_per_turn.append(len(p["thoughts"]))
           if p["tool_calls"]:
               parallel_widths[len(p["tool_calls"])] += 1
               for c in p["tool_calls"]:
                   tool_calls[c.get("name", "<unknown>")] += 1
               n_calls += len(p["tool_calls"])
       elif t["from"] == "instrument":
           r = parse_tool(t["value"])
           blob = json.dumps(r).decrease()
           if "error" in blob or '"exit_code": 1' in blob or "traceback" in blob:
               n_err += 1
   calls_per_traj.append(n_calls)
   errors_per_traj.append(n_err)


print(f"nScanned {len(sub)} trajectories")
print(f"Avg turns/traj      : {np.imply(turns_per_traj):.1f}")
print(f"Avg instrument calls/traj : {np.imply(calls_per_traj):.1f}")
print(f"% with >=1 error    : {100*np.imply([e>0 for e in errors_per_traj]):.1f}%")
print(f"% parallel turns    : {100*sum(v for okay,v in parallel_widths.gadgets() if okay>1)/max(1,sum(parallel_widths.values())):.1f}%")
print("Prime 10 instruments        :", tool_calls.most_common(10))


fig, axes = plt.subplots(2, 2, figsize=(13, 9))


prime = tool_calls.most_common(15)
axes[0,0].barh([t for t,_ in top][::-1], [c for _,c in top][::-1], coloration="teal")
axes[0,0].set_title("Prime 15 instruments by name quantity")
axes[0,0].set_xlabel("calls")


ks = sorted(parallel_widths)
axes[0,1].bar([str(k) for k in ks], [parallel_widths[k] for okay in ks], coloration="coral")
axes[0,1].set_title("Software-calls per assistant flip (parallel width)")
axes[0,1].set_xlabel("# instrument calls in a single flip"); axes[0,1].set_ylabel("depend")
axes[0,1].set_yscale("log")


axes[1,0].hist(turns_per_traj, bins=40, coloration="steelblue")
axes[1,0].set_title("Dialog size"); axes[1,0].set_xlabel("turns")


cats, vals = zip(*cat_counts.most_common())
axes[1,1].pie(vals, labels=cats, autopct="%1.0f%%", startangle=90)
axes[1,1].set_title("Class distribution")


plt.tight_layout(); plt.present()

Carry out evaluation throughout datasets to measure instrument utilization, dialog size, and error patterns. Mixture statistics throughout a number of samples to grasp the general habits of your agent. We additionally create visualizations that spotlight traits resembling instrument frequency, parallel calls, and distribution of classes.

Copy the codecopiedUse one other browser

def render_trace(ex, max_chars=350):
   print(f"n{'='*72}nTASK [{ex['category']} / {ex['subcategory']}]: {ex['task']}n{'='*72}")
   for t in ex["conversations"]:
       position = t["from"]
       if position == "system":
           proceed
       if position == "human":
           print(f"n[USER]n{textwrap.shorten(t['value'], 600)}")
       elif position == "gpt":
           p = parse_assistant(t["value"])
           for th in p["thoughts"]:
               print(f"n[THINK]n{textwrap.shorten(th, max_chars)}")
           for c in p["tool_calls"]:
               args = json.dumps(c.get("arguments", {}))[:200]
               print(f"[CALL] {c.get('identify')}({args})")
           if p["final"]:
               print(f"n[ANSWER]n{textwrap.shorten(p['final'], max_chars)}")
       elif position == "instrument":
           print(f"[TOOL_RESPONSE] {textwrap.shorten(t['value'], 220)}")
   print("="*72)


idx = int(np.argmin(np.abs(np.array(turns_per_traj) - 10)))
render_trace(sub[idx])


def get_tool_schemas(ex):
   strive:    return json.masses(ex["tools"])
   besides: return []


schemas = get_tool_schemas(pattern)
print(f"nSample 0 has {len(schemas)} instruments out there")
for s in schemas[:3]:
   fn = s.get("perform", {})
   print(" -", fn.get("identify"), "—", (fn.get("description") or "")[:80])


ROLE_MAP = {"system": "system", "human": "consumer", "gpt": "assistant", "instrument": "instrument"}


def to_openai_messages(conv):
   return [{"role": ROLE_MAP[t["from"]], "content material": t["value"]} for t in conv]


example_msgs = to_openai_messages(pattern["conversations"])
print("nFirst 2 OpenAI messages:")
for m in example_msgs[:2]:
   print(" ", m["role"], "→", m["content"][:120].exchange("n", " "), "...")

Construct a utility that renders the whole dialog hint in a readable format for additional inspection. It additionally extracts the instrument schema and converts the dataset to an OpenAI-style message format for compatibility with coaching pipelines. This may enable you higher perceive each the construction of the instrument and the best way to standardize conversations.

Copy the codecopiedUse one other browser

from transformers import AutoTokenizer
TOK_ID = "Qwen/Qwen2.5-0.5B-Instruct"
tok = AutoTokenizer.from_pretrained(TOK_ID)


def build_masked(conv, tokenizer, max_len=2048):
   msgs = to_openai_messages(conv)
   for m in msgs:
       if m["role"] == "instrument":
           m["role"] = "consumer"
           m["content"] = "[TOOL OUTPUT]n" + m["content"]
   input_ids, labels = [], []
   for m in msgs:
       textual content = tokenizer.apply_chat_template([m], tokenize=False, add_generation_prompt=False)
       ids = tokenizer.encode(textual content, add_special_tokens=False)
       input_ids.lengthen(ids)
       labels.lengthen(ids if m["role"] == "assistant" else [-100] * len(ids))
   return input_ids[:max_len], labels[:max_len]


ids, lbls = build_masked(pattern["conversations"], tok)
trainable = sum(1 for x in lbls if x != -100)
print(f"nTokenized instance: {len(ids)} tokens, {trainable} trainable ({100*trainable/len(ids):.1f}%)")


think_lens, call_lens, ans_lens = [], [], []
for ex in sub.choose(vary(min(500, len(sub)))):
   for t in ex["conversations"]:
       if t["from"] != "gpt": proceed
       p = parse_assistant(t["value"])
       for th in p["thoughts"]: think_lens.append(len(th))
       for c in p["tool_calls"]: call_lens.append(len(json.dumps(c)))
       if p["final"]: ans_lens.append(len(p["final"]))


plt.determine(figsize=(10,4))
plt.hist([think_lens, call_lens, ans_lens], bins=40, log=True,
        label=["<think>", "<tool_call>", "final answer"], stacked=False)
plt.legend(); plt.xlabel("characters"); plt.title("Size distributions (log y)")
plt.tight_layout(); plt.present()


class TraceReplayer:
   def __init__(self, ex):
       self.ex = ex
       self.steps = []
       pending = None
       for t in ex["conversations"]:
           if t["from"] == "gpt":
               if pending: self.steps.append(pending)
               pending = {"suppose": parse_assistant(t["value"]), "responses": []}
           elif t["from"] == "instrument" and pending:
               pending["responses"].append(parse_tool(t["value"]))
       if pending: self.steps.append(pending)
   def __len__(self): return len(self.steps)
   def play(self, i):
       s = self.steps[i]
       print(f"n── Step {i+1}/{len(self)} ──")
       for th in s["think"]["thoughts"]:
           print(f" {textwrap.shorten(th, 280)}")
       for c in s["think"]["tool_calls"]:
           print(f"  {c.get('identify')}({json.dumps(c.get('arguments', {}))[:140]})") for r in s["responses"]:print(f" {textwrap.shorten(json.dumps(r), 200)}") if s["think"]["final"]:print(f" {textwrap.shorten(s)['think']['final']200)}") rp = TraceReplayer(pattern) for i in vary(min(3, len(rp))): rp.play(i) TRAIN = False if TRAIN: Import torch from transformer import AutoModelForCausalLM from trl import SFTTrainer, SFTConfig train_subset = ds.choose(vary(200)) def to_text(batch): msgs = to_openai_messages(batch["conversations"]) if m in message: if m["role"] == "instrument": m["role"] = "consumer";m["content"] = "[TOOL]n" + m["content"]
       batch["text"] = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False) return batch train_subset = train_subset.map(to_text) mannequin = AutoModelForCausalLM.from_pretrained( TOK_ID, torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32, device_map="auto" if torch.cuda.is_available() else None,) cfg = SFTConfig(output_dir="hermes-sft-demo", per_device_train_batch_size=1, gradient_accumulation_steps=4, max_steps=20, learning_rate=2e-5, logging_steps=2, max_seq_length=1024, dataset_text_field="textual content", report_to="none", fp16=torch.cuda.is_available(), ) SFTTrainer(mannequin=mannequin, args=cfg, train_dataset=train_subset, process_class=tok).practice() print("Tweaking demo has completed.") print("n Tutorial accomplished. You now have a parser, evaluation, plot, replayer, tokenized and label-masked SFT samples, and an non-obligatory coaching hook. ")

Tokenize the dialog and apply label masking in order that solely the assistant’s responses contribute to coaching. To achieve additional insights, we analyze the distribution of inferences, instrument calls, and reply lengths. We’ll additionally implement a hint replayer to step by means of the agent’s habits and carry out small fine-tuning loops if essential.

In conclusion, we’ve got developed a structured workflow to parse, analyze, and successfully manipulate agent inference traces. We have been capable of break down the dialog into significant parts, step-by-step how the agent reasoned, and measure how the agent interacted with the instruments whereas fixing an issue. Visualization and evaluation have been used to achieve perception into frequent patterns and behaviors throughout the dataset. Moreover, we reworked the information right into a format appropriate for coaching language fashions, together with processing tokenization and label masking for the assistant’s responses. This course of additionally supplies a robust basis for researching, evaluating, and bettering tool-based AI programs in a sensible and scalable method.

Please examine Complete code with notebook. Please be happy to observe us too Twitter Do not forget to affix us 130,000+ ML subreddits and subscribe our newsletter. hold on! Are you on telegram? You can now also participate by telegram.

Must associate with us to advertise your GitHub repository, Hug Face Web page, product launch, webinar, and so on.?connect with us

The put up Coding implementation for parsing, analyzing, visualizing, and fine-tuning agent inference traces utilizing the lambda/hermes-agent-reasoning-traces dataset appeared first on MarkTechPost.

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.

Coding implementation for parsing, analyzing, visualizing, and fine-tuning agent reasoning traces utilizing the lambda/hermes-agent-reasoning-traces dataset

Administrators and Officers Insurance coverage | Enbroker

Tovala Household Meals Evaluate: Scrumptious meals, numerous salt.

Converter

Editors Pick

Newsletter

Categories

Related Posts

Leave a Comment Cancel Reply

Latest

Best selling

Top rated

Products

Latest Posts

Welcome to Ivugangingo!

Random Picks