Across both the Sales Forecast Agent and the Pricing Forecast & Intelligence Agent we built recently, every response, recommendation, and risk alert is backed by a three-tier eval system: Golden Dataset Evaluation, Pre-Deployment Evaluation, and Continuous Evaluation. Each layer serves a distinct purpose: establishing ground truth, gating deployment, and sustaining reliability in production.
Eval Service 1: Golden Dataset Evaluation
Before any agent sees real-world data, it’s tested against a precisely constructed synthetic dataset that covers the full range of business scenarios it will encounter. For the Sales Forecast Agent, this is a JSON dataset scoped to California, Texas, New York, and Florida containing open pipeline deals, historical closed-won data, quota targets, and activity signals. Every deal and forecast scenario has a pre-validated answer, designed to exercise everything from 30/60/90-day horizon forecasting to risk detection for deals inactive over 14 days.
For the Pricing Forecast & Intelligence Agent, the golden dataset includes 9,000 historical sales transactions across 25 synthetic SKUs, 6,750 competitor price feeds, and 16 months of macro-economic indicators — dense enough to stress-test every analytics module across all four customer segments and five competitor feeds.
Golden dataset eval checks core computational logic against known answers:
- Does the forecast engine correctly apply Forecast Revenue = deal_value × probability%?
- Does the risk engine flag deals with no activity beyond 14 days?
- Does the margin audit detect transactions where the selling price falls below COGS?
- Does the competitor monitor trigger alerts when a rival’s price is more than 5% below list?
- Are elasticity classifications accurate – coefficients ≥ 2.0 mapped to “Highly Elastic,” and < 1.0 mapped to “Inelastic”?
Golden dataset eval establishes the agent’s accuracy floor before any real-world complexity is introduced.
Benchmarking the Golden Dataset: How We Know the Dataset Itself Is Trustworthy
Passing accuracy checks against a golden dataset only means something if the dataset itself is built correctly. Fusefy applies a dedicated benchmarking layer to validate the quality of the dataset before it’s used to evaluate anything else . Three dataset failure modes are explicitly guarded against:
Overfitted datasets occur when training data is too narrow. A model can score 96%+ accuracy on a backtest while still failing on new seasonal patterns or new product lines it’s never seen. This happens when the dataset is built from a single year of data, contains no adversarial months (COVID-period anomalies, product launches, sudden price changes), and never surfaces the revenue shocks the model needs to learn from. Backtest MAE stays green and live forecast error climbs silently.
Underfitted datasets occur when data is too sparse, fewer than 12 months of history, a single region, no product-level breakdown, and no contested months with meaningful revenue peaks or troughs. The model can’t learn seasonality or year-over-year growth and defaults to flat predictions, missing Q4 spikes and regional growth curves entirely.
AI-generated benchmark labels are the most dangerous anti-pattern in ML evaluation. Using an LLM to set “expected” forecast values means the model inherits the LLM’s own blind spots. Seasonal outliers and pipeline shocks get normalized away. The benchmark becomes one the model can pass. To prevent all three, Fusefy enforces the following Dataset Quality Gates before any golden dataset is used for evaluation:
| Gate | Requirement | Why It Matters |
|---|---|---|
| History length | ≥ 18 months | Captures at least one full seasonal cycle and year-over-year signal |
| Anomaly density | 10–20% of months | Ensures the model learns from revenue shocks it cannot memorise |
| Regions covered | ≥ 3 regions | Prevents incorrect generalisation to new geographies and SKUs |
| Label source | Human-verified only | LLM-smoothed labels remove the anomalies the model needs to learn |
A dataset that passes all four gates is one worth benchmarking against. A dataset that fails any of them ,however large , produces false confidence.
Eval Service 2: Pre-Deployment Evaluation
Passing the golden dataset eval moves the agent into Pre-Deployment Evaluation which is the quality gate that determines whether it’s ready for real users.
The KPIs We Validate Before Any Agent Ships
Pre-deployment eval doesn’t operate on gut feel or “MAE looks acceptable.” Every agent must clear a binary, all-or-nothing gate: all 8 KPIs must pass before deployment proceeds.Even a single failure blocks the release.
The KPIs enforced at this stage are:
| KPI | Target | What It Catches |
|---|---|---|
| Forecast Accuracy | ≥ 90% | Overall prediction reliability |
| MAE % (Mean Absolute Error) | < 5% | Magnitude of forecast error |
| Anomaly Recall | ≥ 85% | Whether the model correctly identifies revenue shocks |
| Hallucination Rate | < 2% | LLM-generated narratives unsupported by real data |
| Query Response Accuracy | ≥ 95% | Assistant correctness on business queries |
| Query Resolution Rate | ≥ 85% | Percentage of queries answered without fallback |
| API Latency (all endpoints) | < 4 seconds | Operational readiness for real users |
| Parity Validation (Pricing Agent) | 100% match | Database ingestion integrity against source files |
Average accuracy is not sufficient. A model that achieves 91% overall accuracy but scores 42% on Q4 anomaly recall has passed the wrong metric and will miss the exact months that determine annual attainment. The gate is designed so that the hardest scenarios set the bar.
AI assistant accuracy is validated first. Both agents include a conversational assistant (the Sales Forecast Agent uses Claude Sonnet, grounded via RAG on live forecast context). It’s tested against realistic business queries, “How are we tracking against Q3 quota?” for the Sales Agent, “Which products are highly price elastic?” for the Pricing Agent, with each response scored against expected answers derivable from the golden dataset.
API performance is the second dimension. Every endpoint across both agents must respond in under 4 seconds. The Sales Forecast Agent’s surface /dashboard, /forecast, /deals/at-risk, /scenario, and /assistant/chat is validated for correctness, JSON format compliance, and latency.
Parity validation is a Pricing Agent-specific check that verifies database loading integrity against source files. It confirms that what was ingested into SQLite exactly matches the golden dataset or uploaded CSV catching silent truncations, type mismatches, or missing records before they contaminate downstream queries.
Pre-deployment eval is the deployment gate. It’s the formal assertion that the agent is accurate enough, fast enough, and grounded enough to be trusted by sales leaders and pricing managers.
Eval Service 3: Continuous Evaluation
Shipping an agent is not the finish line. Business conditions shift, data quality degrades, and model calibration drifts as real-world distributions diverge from what the agent was tested against. Continuous Evaluation monitors agents in production so these changes surface as actionable signals.
Runtime KPIs: What We Monitor and When We Act
Post-deployment monitoring is governed by the same KPI discipline as the pre-deployment gate, but now measured against live traffic, with tiered alert levels that drive specific responses:
| Alert Level | Signal | Action |
|---|---|---|
| OK | MAE < 5% and Hallucination Rate < 2% | Weekly KPI review — no intervention needed |
| MEDIUM | MAE 5–7% or Hallucination Rate 2–5% | Investigate data or prompt shift |
| HIGH | MAE > 7% or Hallucination Rate > 5% | Re-run full Stage 2 pre-deployment gate |
For the Sales Forecast Agent, the primary signal is forecast accuracy drift. Two metrics define the health envelope: Forecast Accuracy % must stay ≥ 90% within 30-day horizons, and the Bias Score must stay within ±3%. A sustained positive drift means the agent is systematically optimistic about a different failure mode than random error, requiring a different response. Pipeline health metrics like Coverage Ratio, Churn Risk Score, and Stage Conversion Rate are monitored alongside forecast accuracy.
For the Pricing Forecast & Intelligence Agent, continuous eval tracks business impact over time: Revenue Uplift (target ≥ +3%), Gross Margin Improvement (target ≥ +1.5 percentage points), Price Realization Rate (alert below 85%), and Discount Leakage (alert above 12%). A negative Revenue Uplift means the agent is actively costing money.
Both agents continuously track assistant performance against the same thresholds set in pre-deployment eval now measured against real user interactions. The Sales Forecast Agent adds Context Retention Score (≥ 90% in multi-turn conversations), a metric that matters more in production than in testing, where context windows fill up and session data changes between queries. User satisfaction, targeted at ≥ 4.2 / 5.0, provides a qualitative layer that catches degradation automated metrics can miss.
The weekly canary provides the earliest warning signal: five fixed historical months are re-run every week and alerts fire if predictions deviate more than 5% from baseline backtests. This catches model drift before it compounds into board-day surprises.
The revalidation loop closes the system: a KPI breach triggers model retraining, which re-runs all 8 Stage 1 golden dataset gates, which feeds a binary Stage 2 deploy decision before anything goes back to production. The system never stays static — every drift triggers revalidation.
Continuous eval is what separates a demo-ready agent from a production-grade system — the difference between proving something works once and proving it keeps working.
One Framework, Two Agents
The three-tier eval framework applies consistently across both agents, despite the fact that they operate in different domains with different data models, business rules, and risk profiles. That consistency is deliberate. Fusefy treats eval as an architectural principle: every capability is designed from the start with the question, how will we know this is working correctly?
The result is agents that produce outputs verifiable at build time, validated before deployment, and monitored continuously in production which is exactly what business-critical AI needs to be.
AUTHOR
Sindhiya Selvaraj
With over a decade of experience, Sindhiya Selvaraj is the Chief Architect at Fusefy, leading the design of secure, scalable AI systems grounded in governance, ethics, and regulatory compliance.
