Why Our AI Agent's Quality Dropped 31% — And What the Data Revealed
Night Shift is an autonomous AI development system that runs 24/7, dispatching tasks to Claude, Gemini, and other LLMs during off-hours. Over its first 13 operational days, average quality scores declined from 7.37/10 to 5.04/10 — a 31% drop. This is the root cause analysis, driven by production data from 287 task executions across 5 models, and the 15 systemic fixes we deployed.
The findings challenge several intuitions: the biggest quality killer wasn't model capability — it was output truncation (costing 2.7 quality points per task). The apparent “decline” was partly an assessor calibration shift as our LLM-as-Judge activated. And our mentoring feedback loop, which injected human review into every task prompt, had 0% measurable effectiveness despite 88 tracked interventions.
The System
Night Shift operates as an autonomous development agent:
- Dispatch cycle: Every 2 hours, selects the highest-priority task from a 170-item backlog
- Budget governor: 3.5M tokens/week across Anthropic, Google, and open-source providers
- Quality assessor: Hybrid heuristic + LLM-as-Judge scoring (1–10 scale)
- Mentoring loop: Human reviews injected into system prompts as context
- Self-improvement: Follow-up tasks auto-generated from completed work
The Data
287 assessed task executions over 10 days (Feb 24 — Mar 5, 2026):
| Period | Tasks | Avg Quality | Good (≥7) | Bad (≤3) | Truncated |
|---|---|---|---|---|---|
| Week 1 (Feb 24–28) | 169 | 7.1 | 106 (63%) | 15 (9%) | 49 (29%) |
| Week 2 (Mar 1–5) | 118 | 5.7 | 56 (47%) | 22 (19%) | 33 (28%) |
Finding 1: Truncation Is the #1 Quality Killer
The single strongest predictor of quality is whether the output was truncated:
| Output | Tasks | Avg Quality | Delta |
|---|---|---|---|
| Full | 198 | 7.3 | — |
| Truncated | 89 | 4.6 | -2.7 |
Categories with highest truncation had lowest quality:
- LIVINGCORP: 64% truncated → q=5.5
- RESEARCH: 47% truncated → q=5.9
- BRIDGE: 11% truncated → q=7.3
Root Cause
Fixed output token limits didn't account for task complexity. Research tasks naturally produce longer outputs but were given the same token budget as simple code fixes. The continuation mechanism (requesting the model to continue) didn't help — tasks with 1 continuation averaged q=5.1, worse than tasks with 0 continuations (q=7.4).
Insight: It's better to scope a task smaller than to truncate a larger one. Truncated output is almost always worse than complete-but-shorter output.
What the Literature Says About Truncation
Two recent papers validate our finding. SelfBudgeter (arXiv 2505.11274) shows that letting models self-estimate their token budget achieves 61% response compression with maintained accuracy. TALE (arXiv 2412.18547) demonstrates 68.64% token reduction with <5% accuracy loss through task-complexity-aware budgets. The core insight from both: “the optimal token budget is not fixed but varies depending on the complexity of the problem.”
Devin (Cognition) solves this differently: time-bounded rather than token-bounded execution, letting the agent decide when to terminate. CrewAI explicitly acknowledges truncated outputs as an unsolved problem. LangGraph accumulates growing state history, reaching 15K+ tokens in complex tasks — the problem is universal.
Our Fix (Deployed)
Disabled continuations entirely (MAX_CONTINUATIONS=0), added task-aware output budget estimation that adjusts tokens based on output type, category, and prompt complexity (4K–20K range), and added category-specific token overrides in the dispatcher. When a task would need continuation, the system now scopes it smaller upfront.
Finding 2: The Assessor Changed, Not Just the Quality
Week 1 had an extreme score distribution: 48% of all tasks scored exactly 8/10. Week 2 had a more normal distribution across 4–8.
The cause: our LLM-as-Judge (a secondary scorer using Haiku) was introduced mid-Week 1 with a 0.6/0.4 blend (60% heuristic, 40% LLM). The heuristic scorer had a “happy path” that defaulted to 8 for any non-truncated output with basic structure. The LLM judge corrected these inflated scores downward.
Key Insight
When measuring quality trends in autonomous systems, the measurement instrument itself can shift. Without calibration anchors (known-good and known-bad reference outputs), you can’t distinguish “quality declined” from “scoring became more accurate.”
What the Literature Says About LLM-as-Judge
MT-Bench (Zheng et al., NeurIPS 2023) is the foundational work. Key finding: few-shot calibration anchors improve scoring consistency from 65.0% to 77.5% for GPT-4. All LLM judges exhibit position bias, and prompt sensitivity varies by model.
Our heuristic scorer exhibited classic Goodhart’s Law — “when a measure becomes a target, it ceases to be a good measure.” The heuristic rewarded structure (headings, bullets, line count) as a proxy for quality. Research on RLHF reward model overoptimization (Gao et al., ICML 2023) confirms this is fundamental: increasing optimization against a given reward model eventually decreases actual quality.
AlpacaEval addresses calibration drift through length-controlled evaluation — penalizing models that score higher simply by producing longer outputs. This is exactly our problem: longer, well-structured outputs scored higher regardless of substance.
Our Fix (Deployed)
Inverted to 40% heuristic / 60% LLM-as-Judge, removed the score 5–8 gate so the judge always runs, added calibration anchors (anchors.yaml with 6 reference outputs across 3 output types), and enabled dual score logging to track heuristic vs LLM scores independently.
Finding 3: Context Bloat Degrades Output
Opus tasks on March 4 averaged 138,000 tokens of context — and produced quality of 4.7/10. The same model on Feb 25 with 91,000 tokens scored 8.7/10.
| Context Size | Opus Quality |
|---|---|
| ~75K tokens | 7.0 |
| ~91K tokens | 8.7 |
| ~95K tokens | 5.2 |
| ~139K tokens | 4.7 |
The context engine loaded everything available: full codebase maps, two complete mentoring reviews (2000 chars each), dependency outputs from parent tasks, and the task prompt itself. Without a budget, context accumulated over time as more mentoring reviews and more features were added.
Key Insight
More context ≠ better output. There’s an optimal context window for each model, beyond which the model struggles to find the relevant signal in the noise.
What the Literature Says About Context Bloat
The “Lost in the Middle” paper (Liu et al., TACL 2024) showed that LLM performance drops by more than 30% when relevant information shifts to the middle of the input context. In 20- and 30-document settings, performance can be lower than having no input documents at all — meaning context actively hurts.
Chroma Research (2025) tested all 18 frontier models and found every single one exhibits “context rot” — even with 100% perfect retrieval, performance degrades 13.9% to 85% as input length increases. The causes: (1) RoPE-based attention bias toward beginning/end tokens, (2) quadratic attention scaling, (3) semantically similar distractors interfering with relevance identification.
“Good context engineering means finding the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome.” — Anthropic, 2025
Synthesizing the research, optimal working ranges are: Claude Opus/Sonnet 40K–80K, Claude Haiku 20K–40K, GPT-4 <64K — far below the raw context window sizes. Night Shift’s Opus tasks at 138K tokens were operating at nearly 2x the optimal range.
Our Fix (Deployed)
Hard per-model context budgets (Opus 80K, Sonnet 60K, Haiku 30K), U-shaped position-aware placement (critical instructions at top and bottom, supplementary material in the middle where attention is weakest), and smart dependency compaction that extracts code blocks, headings, and conclusions — reducing MAX_DEPENDENCY_CHARS from 30K to 8K (73% reduction).
Finding 4: Auto-Generated Follow-Ups Are Lower Quality
Night Shift auto-generates follow-up tasks: research → implementation, code → tests. These follow-ups performed measurably worse:
| Task Origin | Week 1 Quality | Week 2 Quality | Truncation |
|---|---|---|---|
| Original (human-written) | 7.5 | 6.0 | 31% |
| Auto-generated | 6.8 | 5.1 | 40% |
| — write-tests | — | — | 39% |
| — implement-findings | — | — | 31% |
write-tests tasks were particularly problematic: they tested the LLM’s own generated code with no access to the real codebase, producing tests that validated mock implementations.
Key Insight
Autonomous follow-up generation needs quality gates. If the parent task was poor (truncated, low score), the follow-up will be worse. Gate on parent quality ≥7 before spawning children.
The Mathematics of Cascade Failure
Research on multi-agent reliability reveals a fundamental problem: the 0.95n compound error effect. If each step in an agent workflow has 95% reliability, over 20 steps this yields only 36% success rate. A Towards Data Science analysis found that unvalidated “bag of agents” approaches create up to 17.2x error amplification.
OWASP ASI08 (2026) classifies cascading failures as a top security risk in agentic AI: “dependent agents exponentially amplify load on downstream systems.” Microsoft Azure’s agent orchestration patterns recommend output validation at every handoff — exactly what Night Shift was missing.
Our Fix (Deployed)
Quality gate (parent score ≥7 before spawning children), truncation gate (never generate follow-ups from truncated parents), MAX_FOLLOWUP_DEPTH reduced from 2 to 1, write-tests killed for non-code categories, and a new parent output validation step that checks for minimum length, code block presence (for code tasks), and heading structure (for reports).
Finding 5: The Mentoring Loop Was Decorative
We tracked 88 mentoring interventions across 8 review sessions. The context engine injected the 2 most recent reviews into every task’s system prompt. Measured effectiveness: 0%.
Why it failed:
- Generic injection: Same reviews given to ALL tasks regardless of category
- Too long: 2000 chars per review, mostly grade tables and infrastructure notes — models couldn’t extract actionable guidance
- Wrong audience: Interventions like “deploy depth limit” are system-level recommendations the model can’t implement
- No targeting: A BRIDGE task got RESEARCH feedback; a code task got report feedback
Key Insight
Mentoring feedback for autonomous agents must be (a) category-specific, (b) concise (<500 chars), (c) model-actionable (“always include a methodology section”), and (d) measured at the category level, not globally.
What the Literature Says About Agent Feedback Loops
Reflexion (Shinn et al., NeurIPS 2023) achieved 91% pass@1 on HumanEval (up from GPT-4’s 80%) using verbal reinforcement learning — the model converts feedback into natural language descriptions of what went wrong. The key: feedback is per-task, verbal, stored in episodic memory, and generated from specific failure analysis. Night Shift’s generic 2000-char reviews violated every one of these principles.
TextGrad (Yuksekgonul et al., Nature) treats AI systems as computation graphs where textual feedback serves as gradients for optimization. Each variable receives feedback specific to itself, not generic system-level observations.
OpenAI’s Self-Evolving Agents Cookbook (2025) emphasizes that “agentic systems often reach a plateau after proof-of-concept because they depend on humans to diagnose edge cases and correct failures” — exactly Night Shift’s situation.
Our Fix (Deployed)
Per-category guidance files (max 500 chars), context engine loads only the matching category’s guidance, Reflexion-style per-task reflections stored in mentoring/reflections/{category}.jsonl (last 20 per category), and feedback written as model-actionable instructions not system recommendations. Additionally, category budget caps (30% max per category per week) prevent any single category from dominating dispatch.
The Fixes: 15 Interventions Across Two Phases
All 15 fixes were deployed to production over two days (2026-03-08 to 2026-03-09). We organized them into two phases: Phase 1 addressed the 5 root causes directly, Phase 2 implemented deeper infrastructure changes informed by the literature.
Phase 1: Root Cause Fixes (6 Interventions)
| # | Fix | File | Impact |
|---|---|---|---|
| 1 | Assessor weight inversion: 40% heuristic / 60% LLM-as-Judge | quality_assessor.py | Fixes score inflation |
| 2 | Category-specific token overrides in dispatcher | dispatcher.py | Eliminates worst truncation |
| 3 | Task generator quality gates + depth=1 | task_generator.py | Prevents cascade failures |
| 4 | Per-category guidance in context engine (max 500 chars) | context_engine.py | Replaces 0%-effective reviews |
| 5 | Quota governor task-aware budget allocation | quota_governor.py | Prevents category saturation |
| 6 | MAX_FOLLOWUP_DEPTH reduced from 2 to 1 | task_generator.py | Limits error chain length |
Phase 2: Literature-Informed Infrastructure (10 Interventions)
| # | Fix | File | Research Basis |
|---|---|---|---|
| 7 | Calibration anchors for LLM judge | quality_assessor.py + anchors.yaml | MT-Bench: +12.5% consistency |
| 8 | Dual score logging (heuristic + LLM separate) | learning.py | Evidently AI: decompose metrics |
| 9 | Category budget caps (30% max per category) | quota_governor.py | Prevents META saturation |
| 10 | Continuations disabled (MAX_CONTINUATIONS=0) | dispatcher.py | Our data: q=5.1→7.4 without |
| 11 | Reflexion-style per-task reflections | dispatcher.py + context_engine.py | Reflexion: 91% HumanEval |
| 12 | Position-aware context (U-shaped attention) | context_engine.py | Lost in the Middle: -30% mid |
| 13 | Monthly human recalibration template | human_grades.yaml | AlpacaEval methodology |
| 14 | Compact dependency output | context_engine.py | Context rot: 13.9–85% decay |
| 15 | Parent output validation for follow-ups | task_generator.py | OWASP ASI08: validate handoffs |
| 16 | Task-aware output budget estimation | dispatcher.py | SelfBudgeter/TALE adaptive |
Key Implementation: Calibration Anchors (Fix 7)
The literature showed that few-shot calibration anchors improve LLM judge consistency from 65% to 77.5% (MT-Bench). We created anchors.yaml with reference outputs for 3 output types (code, report, research), each with a score=9 (exemplary) and score=3 (poor) example:
# quality_assessor.py - loads anchors into the LLM judge prompt
@classmethod
def _load_calibration_anchors(cls, output_type: str) -> str:
if cls._calibration_cache is None:
anchors_path = Path(__file__).parent.parent / "mentoring" / "calibration" / "anchors.yaml"
# ... load YAML ...
anchors = cls._calibration_cache.get("anchors", {}).get(output_type, [])
parts = ["Here are calibration examples to guide your scoring:"]
for a in anchors[:2]:
sample = a.get("sample", "")[:400]
parts.append(f"\nExample ({a['score']}/10 - {a['rationale']}):\n```\n{sample}\n```")
return "\n".join(parts)
The anchors provide the judge with concrete reference points: “this is what a 9/10 code output looks like” and “this is what a 3/10 looks like.” Without anchors, the judge drifts toward whatever scoring baseline it internalized during training.
Key Implementation: Reflexion-Style Feedback (Fix 11)
Inspired by Reflexion’s per-task verbal reinforcement, we store category-specific reflections after each task:
# dispatcher.py - stores verbal reflections per category
def _store_task_reflection(self, task, quality):
score = quality.get("score", 0)
if score >= 7:
reflection = f"Task {task_id} scored {score}/10. Good: {strengths[0][:80]}."
else:
reflection = f"Task {task_id} scored {score}/10. Issue: {issues[0][:80]}."
# Append to mentoring/reflections/{category}.jsonl, keep last 20
The context engine then loads the 3 most recent reflections for the matching category into the prompt — targeted, concise, and model-actionable, unlike the previous 2000-char generic reviews.
Key Implementation: U-Shaped Context Placement (Fix 12)
The “Lost in the Middle” paper showed 30%+ performance drop for information placed in the middle of context. We reorganized build_prompt():
TOP (high attention): Task prompt header - title, category, priority
MIDDLE (low attention): Context files, previous results, dependency outputs
END (high attention): Codebase structure, mentor guidance, task reflections
Critical instructions that the model must follow go at the beginning and end. Supplementary reference material goes in the middle where attention is weakest anyway.
Key Implementation: Compact Dependencies (Fix 14)
Instead of passing raw parent output (up to 30K chars), we extract high-signal content:
# context_engine.py - priority-based compaction
@staticmethod
def _compact_dependency(output: str, max_chars: int = 8000) -> str:
if len(output) <= max_chars: return output
parts, chars = [], 0
# Priority 1: Code blocks (most actionable)
# Priority 2: Headings + first paragraph (structure)
# Priority 3: Conclusion/recommendation lines (key findings)
return "\n\n".join(parts) + "\n\n... [compacted from full output]"
This reduced MAX_DEPENDENCY_CHARS from 30,000 to 8,000 — a 73% reduction in dependency context while preserving the highest-signal content.
The Unifying Principle: Precision Over Volume
Across all five findings and fifteen fixes, the same pattern emerges: more is not better.
- More output tokens without budget management → truncation → Fix: task-aware budgets (SelfBudgeter/TALE)
- More scoring criteria without calibration → inflated scores → Fix: calibration anchors (MT-Bench)
- More mentoring feedback without targeting → 0% effectiveness → Fix: per-task reflections (Reflexion)
- More context without budgeting → quality degradation → Fix: U-shaped placement + compaction (Lost in the Middle)
- More follow-up tasks without quality gates → error amplification → Fix: parent validation + depth limits (OWASP ASI08)
The research literature converges on the same conclusion: replace volume with precision. Fewer tokens in context, but better-placed. Fewer follow-up tasks, but higher-quality parents. Fewer scoring dimensions, but better-calibrated.
Every fix we deployed follows this principle. The 15 interventions collectively target an estimated +3.5 quality points (from the 5.04 baseline).
Broader Implications for Autonomous AI Systems
For anyone building autonomous AI agents that operate continuously:
- Measure the measurer. Quality assessment tools need their own calibration. Without known-good/bad anchors, you’re measuring noise. Goodhart’s Law applies: when your heuristic becomes the optimization target, it stops measuring quality (Gao et al., ICML 2023).
- Truncation > capability. The limiting factor wasn’t model intelligence — it was output buffer management. SelfBudgeter and TALE show that dynamic, task-aware token budgets solve this without losing accuracy. This is an infrastructure problem, not an AI problem.
- Feedback loops require precision. Generic mentoring is worse than no mentoring (wastes context tokens). Reflexion (NeurIPS 2023) proved that per-task, verbal, episodic feedback achieves 91% HumanEval pass@1. Category-specific, concise, model-actionable feedback is the only kind that works.
- Auto-generated work amplifies problems. The 0.95n compound effect means 95% per-step reliability yields only 36% over 20 steps. If your system generates follow-up tasks, quality gates at every handoff are mandatory.
- Context is a resource, not a free lunch. Every frontier model exhibits “context rot” (Chroma Research, 2025). Even with perfect retrieval, performance degrades 13.9–85% as input length increases. The optimal working range is 40–80K tokens for Claude models — far below the 200K context window.
What's Next
With all 15 fixes deployed, we’re monitoring for quality recovery over the next 7–14 days. Expected trajectory:
- Week 1 post-fix: Quality should recover to 6.5+ as calibration anchors and budget fixes take effect
- Week 2 post-fix: Quality should stabilize at 7.0+ as reflexion memory accumulates category-specific feedback
- Monthly: Human recalibration (20 random tasks, grade A–F, compare vs system scores)
The monitoring infrastructure is already in place: dual score logging (heuristic vs LLM separate) enables us to track whether the assessor and the actual quality are converging. If the average gap between human grades and system scores exceeds 1.5 points, we recalibrate the assessor weights.
References
- Liu, N.F. et al. (2024). “Lost in the Middle: How Language Models Use Long Contexts.” TACL. arXiv:2307.03172
- Zheng, L. et al. (2023). “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” NeurIPS 2023. arXiv:2306.05685
- Shinn, N. et al. (2023). “Reflexion: Language Agents with Verbal Reinforcement Learning.” NeurIPS 2023. arXiv:2303.11366
- Yuksekgonul, M. et al. (2024). “TextGrad: Automatic Differentiation via Text.” Nature. arXiv:2406.07496
- Gao, L. et al. (2023). “Scaling Laws for Reward Model Overoptimization.” ICML 2023. arXiv:2210.10760
- SelfBudgeter (2025). “Adaptive Token Allocation for Efficient LLM Reasoning.” arXiv:2505.11274
- TALE (2024). “Token-Budget-Aware LLM Reasoning.” arXiv:2412.18547
- Chroma Research (2025). “Context Rot: How Increasing Input Tokens Impacts LLM Performance.”
- Anthropic (2025). “Effective Context Engineering for AI Agents.”
- OpenAI (2025). “Self-Evolving Agents Cookbook.”
- OWASP ASI08 (2026). “Cascading Failures in Agentic AI.”
- Microsoft Azure (2025). “AI Agent Orchestration Patterns.”
- Evidently AI. “LLM-as-a-Judge: A Complete Guide.”
- Towards Data Science (2025). “Why Your Multi-Agent System is Failing: The 17x Error Trap.”
- AlpacaEval. Length-controlled evaluation methodology.
Build Autonomous AI That Learns From Its Mistakes
Night Shift is part of the NEXUS ecosystem — autonomous AI operations for teams of 5–200. See how quality control, budget governance, and self-improvement work in production.
Request a Demo Read: 300 Tasks in 14 DaysRelated Articles
- Night Shift: 300 Tasks in 14 Days — the production data from Night Shift’s first two weeks
- Night Shift: How AI Writes Code While You Sleep — the original Night Shift deep dive
- From 0 to 3,000 Tests: Building Quality into AI-Generated Code — how Night Shift maintains code quality
- Temporal Benchmarks for AI Agents — measuring what matters for autonomous systems
- Research Publications — papers on autonomous AI and evolutionary optimization
- Night Shift Product Page — autonomous AI development symbiont for your team