—  March 2026  |  14 min read

Why Our AI Agent's Quality Dropped 31% — And What the Data Revealed

Night Shift is an autonomous AI development system that runs 24/7, dispatching tasks to Claude, Gemini, and other LLMs during off-hours. Over its first 13 operational days, average quality scores declined from 7.37/10 to 5.04/10 — a 31% drop. This is the root cause analysis, driven by production data from 287 task executions across 5 models, and the 15 systemic fixes we deployed.

The findings challenge several intuitions: the biggest quality killer wasn't model capability — it was output truncation (costing 2.7 quality points per task). The apparent “decline” was partly an assessor calibration shift as our LLM-as-Judge activated. And our mentoring feedback loop, which injected human review into every task prompt, had 0% measurable effectiveness despite 88 tracked interventions.

287 Tasks Analyzed
-31% Quality Decline
5 Root Causes Found
15 Fixes Deployed

The System

Night Shift operates as an autonomous development agent:

The Data

287 assessed task executions over 10 days (Feb 24 — Mar 5, 2026):

PeriodTasksAvg QualityGood (≥7)Bad (≤3)Truncated
Week 1 (Feb 24–28)1697.1106 (63%)15 (9%)49 (29%)
Week 2 (Mar 1–5)1185.756 (47%)22 (19%)33 (28%)

Finding 1: Truncation Is the #1 Quality Killer

The single strongest predictor of quality is whether the output was truncated:

OutputTasksAvg QualityDelta
Full1987.3
Truncated894.6-2.7

Categories with highest truncation had lowest quality:

Root Cause

Fixed output token limits didn't account for task complexity. Research tasks naturally produce longer outputs but were given the same token budget as simple code fixes. The continuation mechanism (requesting the model to continue) didn't help — tasks with 1 continuation averaged q=5.1, worse than tasks with 0 continuations (q=7.4).

Insight: It's better to scope a task smaller than to truncate a larger one. Truncated output is almost always worse than complete-but-shorter output.

What the Literature Says About Truncation

Two recent papers validate our finding. SelfBudgeter (arXiv 2505.11274) shows that letting models self-estimate their token budget achieves 61% response compression with maintained accuracy. TALE (arXiv 2412.18547) demonstrates 68.64% token reduction with <5% accuracy loss through task-complexity-aware budgets. The core insight from both: “the optimal token budget is not fixed but varies depending on the complexity of the problem.”

Devin (Cognition) solves this differently: time-bounded rather than token-bounded execution, letting the agent decide when to terminate. CrewAI explicitly acknowledges truncated outputs as an unsolved problem. LangGraph accumulates growing state history, reaching 15K+ tokens in complex tasks — the problem is universal.

Our Fix (Deployed)

Disabled continuations entirely (MAX_CONTINUATIONS=0), added task-aware output budget estimation that adjusts tokens based on output type, category, and prompt complexity (4K–20K range), and added category-specific token overrides in the dispatcher. When a task would need continuation, the system now scopes it smaller upfront.

Finding 2: The Assessor Changed, Not Just the Quality

Week 1 had an extreme score distribution: 48% of all tasks scored exactly 8/10. Week 2 had a more normal distribution across 4–8.

The cause: our LLM-as-Judge (a secondary scorer using Haiku) was introduced mid-Week 1 with a 0.6/0.4 blend (60% heuristic, 40% LLM). The heuristic scorer had a “happy path” that defaulted to 8 for any non-truncated output with basic structure. The LLM judge corrected these inflated scores downward.

Key Insight

When measuring quality trends in autonomous systems, the measurement instrument itself can shift. Without calibration anchors (known-good and known-bad reference outputs), you can’t distinguish “quality declined” from “scoring became more accurate.”

What the Literature Says About LLM-as-Judge

MT-Bench (Zheng et al., NeurIPS 2023) is the foundational work. Key finding: few-shot calibration anchors improve scoring consistency from 65.0% to 77.5% for GPT-4. All LLM judges exhibit position bias, and prompt sensitivity varies by model.

Our heuristic scorer exhibited classic Goodhart’s Law — “when a measure becomes a target, it ceases to be a good measure.” The heuristic rewarded structure (headings, bullets, line count) as a proxy for quality. Research on RLHF reward model overoptimization (Gao et al., ICML 2023) confirms this is fundamental: increasing optimization against a given reward model eventually decreases actual quality.

AlpacaEval addresses calibration drift through length-controlled evaluation — penalizing models that score higher simply by producing longer outputs. This is exactly our problem: longer, well-structured outputs scored higher regardless of substance.

Our Fix (Deployed)

Inverted to 40% heuristic / 60% LLM-as-Judge, removed the score 5–8 gate so the judge always runs, added calibration anchors (anchors.yaml with 6 reference outputs across 3 output types), and enabled dual score logging to track heuristic vs LLM scores independently.

Finding 3: Context Bloat Degrades Output

Opus tasks on March 4 averaged 138,000 tokens of context — and produced quality of 4.7/10. The same model on Feb 25 with 91,000 tokens scored 8.7/10.

Context SizeOpus Quality
~75K tokens7.0
~91K tokens8.7
~95K tokens5.2
~139K tokens4.7

The context engine loaded everything available: full codebase maps, two complete mentoring reviews (2000 chars each), dependency outputs from parent tasks, and the task prompt itself. Without a budget, context accumulated over time as more mentoring reviews and more features were added.

Key Insight

More context ≠ better output. There’s an optimal context window for each model, beyond which the model struggles to find the relevant signal in the noise.

What the Literature Says About Context Bloat

The “Lost in the Middle” paper (Liu et al., TACL 2024) showed that LLM performance drops by more than 30% when relevant information shifts to the middle of the input context. In 20- and 30-document settings, performance can be lower than having no input documents at all — meaning context actively hurts.

Chroma Research (2025) tested all 18 frontier models and found every single one exhibits “context rot” — even with 100% perfect retrieval, performance degrades 13.9% to 85% as input length increases. The causes: (1) RoPE-based attention bias toward beginning/end tokens, (2) quadratic attention scaling, (3) semantically similar distractors interfering with relevance identification.

“Good context engineering means finding the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome.” — Anthropic, 2025

Synthesizing the research, optimal working ranges are: Claude Opus/Sonnet 40K–80K, Claude Haiku 20K–40K, GPT-4 <64K — far below the raw context window sizes. Night Shift’s Opus tasks at 138K tokens were operating at nearly 2x the optimal range.

Our Fix (Deployed)

Hard per-model context budgets (Opus 80K, Sonnet 60K, Haiku 30K), U-shaped position-aware placement (critical instructions at top and bottom, supplementary material in the middle where attention is weakest), and smart dependency compaction that extracts code blocks, headings, and conclusions — reducing MAX_DEPENDENCY_CHARS from 30K to 8K (73% reduction).

Finding 4: Auto-Generated Follow-Ups Are Lower Quality

Night Shift auto-generates follow-up tasks: research → implementation, code → tests. These follow-ups performed measurably worse:

Task OriginWeek 1 QualityWeek 2 QualityTruncation
Original (human-written)7.56.031%
Auto-generated6.85.140%
— write-tests39%
— implement-findings31%

write-tests tasks were particularly problematic: they tested the LLM’s own generated code with no access to the real codebase, producing tests that validated mock implementations.

Key Insight

Autonomous follow-up generation needs quality gates. If the parent task was poor (truncated, low score), the follow-up will be worse. Gate on parent quality ≥7 before spawning children.

The Mathematics of Cascade Failure

Research on multi-agent reliability reveals a fundamental problem: the 0.95n compound error effect. If each step in an agent workflow has 95% reliability, over 20 steps this yields only 36% success rate. A Towards Data Science analysis found that unvalidated “bag of agents” approaches create up to 17.2x error amplification.

OWASP ASI08 (2026) classifies cascading failures as a top security risk in agentic AI: “dependent agents exponentially amplify load on downstream systems.” Microsoft Azure’s agent orchestration patterns recommend output validation at every handoff — exactly what Night Shift was missing.

Our Fix (Deployed)

Quality gate (parent score ≥7 before spawning children), truncation gate (never generate follow-ups from truncated parents), MAX_FOLLOWUP_DEPTH reduced from 2 to 1, write-tests killed for non-code categories, and a new parent output validation step that checks for minimum length, code block presence (for code tasks), and heading structure (for reports).

Finding 5: The Mentoring Loop Was Decorative

We tracked 88 mentoring interventions across 8 review sessions. The context engine injected the 2 most recent reviews into every task’s system prompt. Measured effectiveness: 0%.

Why it failed:

  1. Generic injection: Same reviews given to ALL tasks regardless of category
  2. Too long: 2000 chars per review, mostly grade tables and infrastructure notes — models couldn’t extract actionable guidance
  3. Wrong audience: Interventions like “deploy depth limit” are system-level recommendations the model can’t implement
  4. No targeting: A BRIDGE task got RESEARCH feedback; a code task got report feedback

Key Insight

Mentoring feedback for autonomous agents must be (a) category-specific, (b) concise (<500 chars), (c) model-actionable (“always include a methodology section”), and (d) measured at the category level, not globally.

What the Literature Says About Agent Feedback Loops

Reflexion (Shinn et al., NeurIPS 2023) achieved 91% pass@1 on HumanEval (up from GPT-4’s 80%) using verbal reinforcement learning — the model converts feedback into natural language descriptions of what went wrong. The key: feedback is per-task, verbal, stored in episodic memory, and generated from specific failure analysis. Night Shift’s generic 2000-char reviews violated every one of these principles.

TextGrad (Yuksekgonul et al., Nature) treats AI systems as computation graphs where textual feedback serves as gradients for optimization. Each variable receives feedback specific to itself, not generic system-level observations.

OpenAI’s Self-Evolving Agents Cookbook (2025) emphasizes that “agentic systems often reach a plateau after proof-of-concept because they depend on humans to diagnose edge cases and correct failures” — exactly Night Shift’s situation.

Our Fix (Deployed)

Per-category guidance files (max 500 chars), context engine loads only the matching category’s guidance, Reflexion-style per-task reflections stored in mentoring/reflections/{category}.jsonl (last 20 per category), and feedback written as model-actionable instructions not system recommendations. Additionally, category budget caps (30% max per category per week) prevent any single category from dominating dispatch.

The Fixes: 15 Interventions Across Two Phases

All 15 fixes were deployed to production over two days (2026-03-08 to 2026-03-09). We organized them into two phases: Phase 1 addressed the 5 root causes directly, Phase 2 implemented deeper infrastructure changes informed by the literature.

Phase 1: Root Cause Fixes (6 Interventions)

#FixFileImpact
1Assessor weight inversion: 40% heuristic / 60% LLM-as-Judgequality_assessor.pyFixes score inflation
2Category-specific token overrides in dispatcherdispatcher.pyEliminates worst truncation
3Task generator quality gates + depth=1task_generator.pyPrevents cascade failures
4Per-category guidance in context engine (max 500 chars)context_engine.pyReplaces 0%-effective reviews
5Quota governor task-aware budget allocationquota_governor.pyPrevents category saturation
6MAX_FOLLOWUP_DEPTH reduced from 2 to 1task_generator.pyLimits error chain length

Phase 2: Literature-Informed Infrastructure (10 Interventions)

#FixFileResearch Basis
7Calibration anchors for LLM judgequality_assessor.py + anchors.yamlMT-Bench: +12.5% consistency
8Dual score logging (heuristic + LLM separate)learning.pyEvidently AI: decompose metrics
9Category budget caps (30% max per category)quota_governor.pyPrevents META saturation
10Continuations disabled (MAX_CONTINUATIONS=0)dispatcher.pyOur data: q=5.1→7.4 without
11Reflexion-style per-task reflectionsdispatcher.py + context_engine.pyReflexion: 91% HumanEval
12Position-aware context (U-shaped attention)context_engine.pyLost in the Middle: -30% mid
13Monthly human recalibration templatehuman_grades.yamlAlpacaEval methodology
14Compact dependency outputcontext_engine.pyContext rot: 13.9–85% decay
15Parent output validation for follow-upstask_generator.pyOWASP ASI08: validate handoffs
16Task-aware output budget estimationdispatcher.pySelfBudgeter/TALE adaptive

Key Implementation: Calibration Anchors (Fix 7)

The literature showed that few-shot calibration anchors improve LLM judge consistency from 65% to 77.5% (MT-Bench). We created anchors.yaml with reference outputs for 3 output types (code, report, research), each with a score=9 (exemplary) and score=3 (poor) example:

# quality_assessor.py - loads anchors into the LLM judge prompt
@classmethod
def _load_calibration_anchors(cls, output_type: str) -> str:
    if cls._calibration_cache is None:
        anchors_path = Path(__file__).parent.parent / "mentoring" / "calibration" / "anchors.yaml"
        # ... load YAML ...
    anchors = cls._calibration_cache.get("anchors", {}).get(output_type, [])
    parts = ["Here are calibration examples to guide your scoring:"]
    for a in anchors[:2]:
        sample = a.get("sample", "")[:400]
        parts.append(f"\nExample ({a['score']}/10 - {a['rationale']}):\n```\n{sample}\n```")
    return "\n".join(parts)

The anchors provide the judge with concrete reference points: “this is what a 9/10 code output looks like” and “this is what a 3/10 looks like.” Without anchors, the judge drifts toward whatever scoring baseline it internalized during training.

Key Implementation: Reflexion-Style Feedback (Fix 11)

Inspired by Reflexion’s per-task verbal reinforcement, we store category-specific reflections after each task:

# dispatcher.py - stores verbal reflections per category
def _store_task_reflection(self, task, quality):
    score = quality.get("score", 0)
    if score >= 7:
        reflection = f"Task {task_id} scored {score}/10. Good: {strengths[0][:80]}."
    else:
        reflection = f"Task {task_id} scored {score}/10. Issue: {issues[0][:80]}."
    # Append to mentoring/reflections/{category}.jsonl, keep last 20

The context engine then loads the 3 most recent reflections for the matching category into the prompt — targeted, concise, and model-actionable, unlike the previous 2000-char generic reviews.

Key Implementation: U-Shaped Context Placement (Fix 12)

The “Lost in the Middle” paper showed 30%+ performance drop for information placed in the middle of context. We reorganized build_prompt():

TOP (high attention):    Task prompt header - title, category, priority
MIDDLE (low attention):  Context files, previous results, dependency outputs
END (high attention):    Codebase structure, mentor guidance, task reflections

Critical instructions that the model must follow go at the beginning and end. Supplementary reference material goes in the middle where attention is weakest anyway.

Key Implementation: Compact Dependencies (Fix 14)

Instead of passing raw parent output (up to 30K chars), we extract high-signal content:

# context_engine.py - priority-based compaction
@staticmethod
def _compact_dependency(output: str, max_chars: int = 8000) -> str:
    if len(output) <= max_chars: return output
    parts, chars = [], 0
    # Priority 1: Code blocks (most actionable)
    # Priority 2: Headings + first paragraph (structure)
    # Priority 3: Conclusion/recommendation lines (key findings)
    return "\n\n".join(parts) + "\n\n... [compacted from full output]"

This reduced MAX_DEPENDENCY_CHARS from 30,000 to 8,000 — a 73% reduction in dependency context while preserving the highest-signal content.

The Unifying Principle: Precision Over Volume

Across all five findings and fifteen fixes, the same pattern emerges: more is not better.

The research literature converges on the same conclusion: replace volume with precision. Fewer tokens in context, but better-placed. Fewer follow-up tasks, but higher-quality parents. Fewer scoring dimensions, but better-calibrated.

Every fix we deployed follows this principle. The 15 interventions collectively target an estimated +3.5 quality points (from the 5.04 baseline).

Broader Implications for Autonomous AI Systems

For anyone building autonomous AI agents that operate continuously:

  1. Measure the measurer. Quality assessment tools need their own calibration. Without known-good/bad anchors, you’re measuring noise. Goodhart’s Law applies: when your heuristic becomes the optimization target, it stops measuring quality (Gao et al., ICML 2023).
  2. Truncation > capability. The limiting factor wasn’t model intelligence — it was output buffer management. SelfBudgeter and TALE show that dynamic, task-aware token budgets solve this without losing accuracy. This is an infrastructure problem, not an AI problem.
  3. Feedback loops require precision. Generic mentoring is worse than no mentoring (wastes context tokens). Reflexion (NeurIPS 2023) proved that per-task, verbal, episodic feedback achieves 91% HumanEval pass@1. Category-specific, concise, model-actionable feedback is the only kind that works.
  4. Auto-generated work amplifies problems. The 0.95n compound effect means 95% per-step reliability yields only 36% over 20 steps. If your system generates follow-up tasks, quality gates at every handoff are mandatory.
  5. Context is a resource, not a free lunch. Every frontier model exhibits “context rot” (Chroma Research, 2025). Even with perfect retrieval, performance degrades 13.9–85% as input length increases. The optimal working range is 40–80K tokens for Claude models — far below the 200K context window.

What's Next

With all 15 fixes deployed, we’re monitoring for quality recovery over the next 7–14 days. Expected trajectory:

The monitoring infrastructure is already in place: dual score logging (heuristic vs LLM separate) enables us to track whether the assessor and the actual quality are converging. If the average gap between human grades and system scores exceeds 1.5 points, we recalibrate the assessor weights.

References

  1. Liu, N.F. et al. (2024). “Lost in the Middle: How Language Models Use Long Contexts.” TACL. arXiv:2307.03172
  2. Zheng, L. et al. (2023). “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” NeurIPS 2023. arXiv:2306.05685
  3. Shinn, N. et al. (2023). “Reflexion: Language Agents with Verbal Reinforcement Learning.” NeurIPS 2023. arXiv:2303.11366
  4. Yuksekgonul, M. et al. (2024). “TextGrad: Automatic Differentiation via Text.” Nature. arXiv:2406.07496
  5. Gao, L. et al. (2023). “Scaling Laws for Reward Model Overoptimization.” ICML 2023. arXiv:2210.10760
  6. SelfBudgeter (2025). “Adaptive Token Allocation for Efficient LLM Reasoning.” arXiv:2505.11274
  7. TALE (2024). “Token-Budget-Aware LLM Reasoning.” arXiv:2412.18547
  8. Chroma Research (2025). “Context Rot: How Increasing Input Tokens Impacts LLM Performance.”
  9. Anthropic (2025). “Effective Context Engineering for AI Agents.”
  10. OpenAI (2025). “Self-Evolving Agents Cookbook.”
  11. OWASP ASI08 (2026). “Cascading Failures in Agentic AI.”
  12. Microsoft Azure (2025). “AI Agent Orchestration Patterns.”
  13. Evidently AI. “LLM-as-a-Judge: A Complete Guide.”
  14. Towards Data Science (2025). “Why Your Multi-Agent System is Failing: The 17x Error Trap.”
  15. AlpacaEval. Length-controlled evaluation methodology.

Build Autonomous AI That Learns From Its Mistakes

Night Shift is part of the NEXUS ecosystem — autonomous AI operations for teams of 5–200. See how quality control, budget governance, and self-improvement work in production.

Request a Demo    Read: 300 Tasks in 14 Days

Related Articles

Stay Updated

Get AI insights and NEXUS updates. No spam, unsubscribe anytime.

Run your company from one screen TRY NEXUS FREE