Coding implementation of a complete enterprise AI benchmarking framework for evaluating rule-based LLM and hybrid agent AI techniques throughout real-world duties.

by root November 2, 2025

written by root November 2, 2025 0 comment 88 views

On this tutorial, we develop a complete benchmarking framework for evaluating various kinds of agent AI techniques on real-world enterprise software program duties. We design a various set of challenges, from information transformation and API integration to workflow automation and efficiency optimization, and consider how totally different brokers, together with rule-based, LLM-powered brokers, and hybrid brokers, carry out throughout these domains. By operating structured benchmarks and visualizing key efficiency metrics resembling accuracy, execution time, and success price, you may higher perceive the strengths and tradeoffs of every agent in your enterprise setting. Please examine Full code here.

import json
import time
import random
from typing import Dict, Listing, Any, Callable
from dataclasses import dataclass, asdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


@dataclass
class Job:
   id: str
   title: str
   description: str
   class: str
   complexity: int
   expected_output: Any


@dataclass
class BenchmarkResult:
   task_id: str
   agent_name: str
   success: bool
   execution_time: float
   accuracy: float
   error_message: str = ""


class EnterpriseTaskSuite:
   def __init__(self):
       self.duties = self._create_tasks()


   def _create_tasks(self) -> Listing[Task]:
       return [
           Task("data_transform", "CSV Data Transformation",
                "Transform customer data by aggregating sales", "data_processing", 3,
                {"total_sales": 15000, "avg_order": 750}),
           Task("api_integration", "REST API Integration",
                "Parse API response and extract key metrics", "integration", 2,
                {"status": "success", "active_users": 1250}),
           Task("workflow_automation", "Multi-Step Workflow",
                "Execute data validation -> processing -> reporting", "automation", 4,
                {"validated": True, "processed": 100, "report_generated": True}),
           Task("error_handling", "Error Recovery",
                "Handle malformed data gracefully", "reliability", 3,
                {"errors_caught": 5, "recovery_success": True}),
           Task("optimization", "Query Optimization",
                "Optimize database query performance", "performance", 5,
                {"execution_time_ms": 45, "rows_scanned": 1000}),
           Task("data_validation", "Schema Validation",
                "Validate data against business rules", "validation", 2,
                {"valid_records": 95, "invalid_records": 5}),
           Task("reporting", "Executive Dashboard",
                "Generate KPI summary report", "analytics", 3,
                {"revenue": 125000, "growth": 0.15, "customer_count": 450}),
           Task("integration_test", "System Integration",
                "Test end-to-end integration flow", "testing", 4,
                {"all_systems_connected": True, "latency_ms": 120}),
       ]


   def get_task(self, task_id: str) -> Job:
       return subsequent((t for t in self.duties if t.id == task_id), None)

We outline the core information constructions of our benchmark system. Creates the Job and BenchmarkResult information lessons and initializes an EnterpriseTaskSuite that holds a number of enterprise-related duties resembling information transformation, reporting, and integration. We’ve got laid the muse for constantly evaluating various kinds of brokers throughout these duties. Please examine Full code here.

class BaseAgent:
   def __init__(self, title: str):
       self.title = title


   def execute(self, process: Job) -> Dict[str, Any]:
       increase NotImplementedError


class RuleBasedAgent(BaseAgent):
   def execute(self, process: Job) -> Dict[str, Any]:
       time.sleep(random.uniform(0.1, 0.3))
       if process.class == "data_processing":
           return {"total_sales": 15000 + random.randint(-500, 500),
                   "avg_order": 750 + random.randint(-50, 50)}
       elif process.class == "integration":
           return {"standing": "success", "active_users": 1250}
       elif process.class == "automation":
           return {"validated": True, "processed": 98, "report_generated": True}
       else:
           return process.expected_output

Implement a RuleBasedAgent that introduces a fundamental agent construction and makes use of predefined guidelines to imitate conventional automation logic. We simulate how such brokers carry out duties deterministically whereas sustaining velocity and reliability, offering a baseline for comparability with extra superior brokers. Please examine Full code here.

class LLMAgent(BaseAgent):
   def execute(self, process: Job) -> Dict[str, Any]:
       time.sleep(random.uniform(0.2, 0.5))
       accuracy_boost = 0.95 if process.complexity >= 4 else 0.90
       consequence = {}
       for key, worth in process.expected_output.objects():
           if isinstance(worth, (int, float)):
               variation = worth * (1 - accuracy_boost)
               consequence[key] = worth + random.uniform(-variation, variation)
           else:
               consequence[key] = worth
       return consequence


class HybridAgent(BaseAgent):
   def execute(self, process: Job) -> Dict[str, Any]:
       time.sleep(random.uniform(0.15, 0.35))
       if process.complexity <= 2:
           return process.expected_output
       else:
           consequence = {}
           for key, worth in process.expected_output.objects():
               if isinstance(worth, (int, float)):
                   variation = worth * 0.03
                   consequence[key] = worth + random.uniform(-variation, variation)
               else:
                   consequence[key] = worth
           return consequence

We’re growing two clever agent sorts: LLMAgent, which represents an inference-based AI system, and HybridAgent, which mixes rule-based accuracy with LLM adaptability. These brokers are designed to exhibit how learning-based strategies can enhance process accuracy, particularly in advanced enterprise workflows. Please examine Full code here.

class BenchmarkEngine:
   def __init__(self, task_suite: EnterpriseTaskSuite):
       self.task_suite = task_suite
       self.outcomes: Listing[BenchmarkResult] = []


   def run_benchmark(self, agent: BaseAgent, iterations: int = 3):
       print(f"n{'='*60}")
       print(f"Benchmarking Agent: {agent.title}")
       print(f"{'='*60}")
       for process in self.task_suite.duties:
           print(f"nTask: {process.title} (Complexity: {process.complexity}/5)")
           for i in vary(iterations):
               consequence = self._execute_task(agent, process, i+1)
               self.outcomes.append(consequence)
               standing = "✓ PASS" if consequence.success else "✗ FAIL"
               print(f"  Run {i+1}: {standing} | Time: {consequence.execution_time:.3f}s | Accuracy: {consequence.accuracy:.2%}")

Right here we construct the core of a benchmarking engine that manages agent analysis throughout an outlined suite of duties. Implement strategies to run every agent a number of occasions for every process, log outcomes, and measure key parameters resembling execution time and accuracy. This creates a scientific and reproducible benchmark loop. Please examine Full code here.

 def _execute_task(self, agent: BaseAgent, process: Job, run_num: int) -> BenchmarkResult:
       start_time = time.time()
       attempt:
           output = agent.execute(process)
           execution_time = time.time() - start_time
           accuracy = self._calculate_accuracy(output, process.expected_output)
           success = accuracy >= 0.85
           return BenchmarkResult(task_id=process.id, agent_name=agent.title, success=success,
                                  execution_time=execution_time, accuracy=accuracy)
       besides Exception as e:
           execution_time = time.time() - start_time
           return BenchmarkResult(task_id=process.id, agent_name=agent.title, success=False,
                                  execution_time=execution_time, accuracy=0.0, error_message=str(e))


   def _calculate_accuracy(self, output: Dict, anticipated: Dict) -> float:
       if not output:
           return 0.0
       scores = []
       for key, expected_val in anticipated.objects():
           if key not in output:
               scores.append(0.0)
               proceed
           actual_val = output[key]
           if isinstance(expected_val, bool):
               scores.append(1.0 if actual_val == expected_val else 0.0)
           elif isinstance(expected_val, (int, float)):
               diff = abs(actual_val - expected_val)
               tolerance = abs(expected_val * 0.1)
               rating = max(0, 1 - (diff / (tolerance + 1e-9)))
               scores.append(rating)
           else:
               scores.append(1.0 if actual_val == expected_val else 0.0)
       return np.imply(scores) if scores else 0.0

Outline process execution logic and accuracy calculations. A scoring mechanism is used to measure the efficiency of every agent by evaluating the output to the anticipated outcomes. This step ensures that the benchmarking course of is quantitative and unbiased, and supplies perception into how nicely the agent is aligned with enterprise expectations. Please examine Full code here.

 def generate_report(self):
       df = pd.DataFrame([asdict(r) for r in self.results])
       print(f"n{'='*60}")
       print("BENCHMARK REPORT")
       print(f"{'='*60}n")
       for agent_name in df['agent_name'].distinctive():
           agent_df = df[df['agent_name'] == agent_name]
           print(f"{agent_name}:")
           print(f"  Success Fee: {agent_df['success'].imply():.1%}")
           print(f"  Avg Execution Time: {agent_df['execution_time'].imply():.3f}s")
           print(f"  Avg Accuracy: {agent_df['accuracy'].imply():.2%}n")
       return df


   def visualize_results(self, df: pd.DataFrame):
       fig, axes = plt.subplots(2, 2, figsize=(14, 10))
       fig.suptitle('Enterprise Agent Benchmarking Outcomes', fontsize=16, fontweight="daring")
       success_rate = df.groupby('agent_name')['success'].imply()
       axes[0, 0].bar(success_rate.index, success_rate.values, shade=['#3498db', '#e74c3c', '#2ecc71'])
       axes[0, 0].set_title('Success Fee by Agent', fontweight="daring")
       axes[0, 0].set_ylabel('Success Fee')
       axes[0, 0].set_ylim(0, 1.1)
       for i, v in enumerate(success_rate.values):
           axes[0, 0].textual content(i, v + 0.02, f'{v:.1%}', ha="middle", fontweight="daring")
       time_data = df.groupby('agent_name')['execution_time'].imply()
       axes[0, 1].bar(time_data.index, time_data.values, shade=['#3498db', '#e74c3c', '#2ecc71'])
       axes[0, 1].set_title('Common Execution Time', fontweight="daring")
       axes[0, 1].set_ylabel('Time (seconds)')
       for i, v in enumerate(time_data.values):
           axes[0, 1].textual content(i, v + 0.01, f'{v:.3f}s', ha="middle", fontweight="daring")
       df.boxplot(column='accuracy', by='agent_name', ax=axes[1, 0])
       axes[1, 0].set_title('Accuracy Distribution', fontweight="daring")
       axes[1, 0].set_xlabel('Agent')
       axes[1, 0].set_ylabel('Accuracy')
       plt.sca(axes[1, 0])
       plt.xticks(rotation=15)
       task_complexity = {t.id: t.complexity for t in self.task_suite.duties}
       df['complexity'] = df['task_id'].map(task_complexity)
       complexity_perf = df.groupby(['agent_name', 'complexity'])['accuracy'].imply().unstack()
       complexity_perf.plot(variety='line', ax=axes[1, 1], marker="o", linewidth=2)
       axes[1, 1].set_title('Accuracy by Job Complexity', fontweight="daring")
       axes[1, 1].set_xlabel('Job Complexity')
       axes[1, 1].set_ylabel('Accuracy')
       axes[1, 1].legend(title="Agent", loc="finest")
       axes[1, 1].grid(True, alpha=0.3)
       plt.tight_layout()
       plt.present()


if __name__ == "__main__":
   print("Enterprise Software program Benchmarking for Agentic Brokers")
   print("="*60)
   task_suite = EnterpriseTaskSuite()
   benchmark = BenchmarkEngine(task_suite)
   brokers = [RuleBasedAgent("Rule-Based Agent"), LLMAgent("LLM Agent"), HybridAgent("Hybrid Agent")]
   for agent in brokers:
       benchmark.run_benchmark(agent, iterations=3)
   results_df = benchmark.generate_report()
   benchmark.visualize_results(results_df)
   results_df.to_csv('agent_benchmark_results.csv', index=False)
   print("nResults exported to: agent_benchmark_results.csv")

Generate detailed reviews and create visible analytics to match efficiency. Analyze metrics resembling success price, execution time, and accuracy throughout brokers and process complexity. Lastly, export the outcomes to a CSV file to finish an entire enterprise-grade evaluation workflow.

In conclusion, now we have carried out a sturdy and scalable benchmarking system that enables us to measure and evaluate the effectivity, adaptability, and accuracy of a number of agent AI approaches. We noticed how totally different architectures carried out higher at totally different ranges of process complexity and the way visible evaluation highlighted efficiency traits. This course of permits current brokers to be evaluated and supplies a robust basis for the following technology of enterprise AI brokers, optimized for reliability and intelligence.

Please examine Full code here. Please be happy to test it out GitHub page for tutorials, code, and notebooks. Additionally, be happy to comply with us Twitter Remember to affix us 100,000+ ML subreddits and subscribe our newsletter. grasp on! Are you on telegram? You can now also participate by telegram.

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of synthetic intelligence for social good. His newest endeavor is the launch of Marktechpost, a synthetic intelligence media platform. It stands out for its thorough protection of machine studying and deep studying information, which is technically sound and simply understood by a large viewers. The platform boasts over 2 million views monthly, demonstrating its recognition amongst viewers.

🙌 Follow MARKTECHPOST: Add us as your preferred source on Google.

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.

Coding implementation of a complete enterprise AI benchmarking framework for evaluating rule-based LLM and hybrid agent AI techniques throughout real-world duties.

Michael Saylor Broadcasts 10.5% STRC Month-to-month Dividend as Bitcoin Treasuries Take $20 Billion Haircut in October

Greatest Household Board Video games 35 (2025): Catan, Ticket to Journey, Codenames

Converter

Editors Pick

Newsletter

Categories

Related Posts

Leave a Comment Cancel Reply

Latest

Best selling

Top rated

Products

Latest Posts

Welcome to Ivugangingo!

Random Picks