Temporal Benchmarks for AI Agents
Beyond Single-Session Evaluation
The current generation of AI agent benchmarks—HumanEval, SWE-Bench, MATH—measure something important but fundamentally limited: single-session, atomic task performance. An agent receives a problem, attempts a solution, and is scored. It never runs for days. It never learns from its own mistakes. It never specializes in a particular codebase.
This creates a critical blind spot in our evaluation of autonomous AI systems. Today's most ambitious agent projects—Devin, Aider, our own Night Shift—operate under conditions that no existing benchmark captures: weeks of continuous evolution, multi-turn cross-session learning, and adaptation to specific domains and codebases.
This gap is not academic. It blinds us to real failure modes. It prevents us from optimizing for sustained autonomy. And it leaves benchmark-builders without a standard for measuring what actually matters in deployed agents.
The Temporal Gap in AI Agent Benchmarking
Current benchmarks excel at measuring point-in-time capability. SWE-Bench tests whether an agent can solve a GitHub issue in a single inference run. But autonomous agents don't work that way. They iterate. They persist state across sessions. They accumulate skill.
What Existing Benchmarks Measure
- Single-turn solve rate: Does the agent produce a correct solution in one attempt?
- Atomic task success: Given a well-defined problem, what percentage of attempts succeed?
- General capability: Can the agent write Python? Can it reason about code?
What They Miss
- Sustained autonomy: Can it operate for a week without human intervention?
- Cross-session learning: Does it retain and apply knowledge from prior work?
- Specialization: Does it improve at domain-specific tasks (e.g., "refactoring this particular codebase")?
- Safety drift: Do its safety constraints erode over long horizons?
- Graceful degradation: How does performance degrade under resource constraints or interruptions?
The agents we deploy operate under all these conditions. But we don't measure any of them systematically.
Five Temporal Dimensions for AI Agent Evaluation
We propose a framework of five orthogonal axes that capture the temporal character of autonomous agents:
1. Autonomous Duration
Definition: Maximum continuous operating time without human intervention or session reset.
Metrics:
- Wall-clock session length (hours)
- Task count per session
- Context window turnover rate (how many full context windows consumed)
- Recovery latency from interruption (pause → resume time)
Why it matters: An agent that solves problems but crashes after 2 hours is not production-viable. Neither is one that requires human hand-holding every 10 tasks.
2. Cross-Session Learning
Definition: Retention and application of learned patterns across independent sessions.
Metrics:
- Skill persistence (does performance on task type T improve after exposure in session 1?)
- Knowledge transfer velocity (tasks to asymptotic performance)
- Ebbinghaus forgetting curve fit (how quickly does unused knowledge decay?)
- Cross-domain transfer: "solved N similar bugs in TypeScript; performance gain on Python bugs?"
Why it matters: An agent that learns nothing from past work is fundamentally limited. Every problem restart is a cold start.
3. Domain Adaptation
Definition: Efficiency of specializing to new domains with minimal re-training.
Metrics:
- Few-shot task transfer (accuracy after N=1,5,10 examples in new domain)
- Domain category coverage (# distinct software engineering domains tested)
- Generalization bounds (performance gap: in-domain vs. out-of-domain)
- Specialization depth (tasks solved in primary domain vs. secondary)
Why it matters: Agents deployed in the wild face constantly shifting requirements. Static, single-domain agents don't scale.
4. Constitutional Safety (Temporal)
Definition: Adherence to safety constraints and value alignment over extended episodes.
Metrics:
- Constraint violation rate (% of actions that breach stated safety rules)
- Value drift detection (cosine similarity of stated values across sessions)
- Adversarial jailbreak resistance (attempts to induce constraint violations)
- Reward hacking (does the agent find loopholes in success criteria?)
Why it matters: Safety erodes under stress and time pressure. An agent that behaves well in lab conditions but cuts corners under deadline pressure is a liability.
5. Codebase Specialization
Definition: Depth of understanding and optimization for a specific codebase over time.
Metrics:
- Module-specific accuracy (% correct refactorings in modules 1, 2, 3... vs. global rate)
- Dependency graph comprehension (correctness of cross-module impact prediction)
- Style adherence (does the agent learn and match the codebase's conventions?)
- Technical debt awareness (does it recognize and avoid risky patterns endemic to this codebase?)
Why it matters: Real software engineering is not generic. Agents that develop deep domain knowledge are orders of magnitude more valuable than those that don't.
State-of-the-Art Comparison Matrix
We evaluated Night Shift, Devin, and GitHub Copilot across the above dimensions. The 25-dimensional matrix below summarizes our observations.
| Dimension | Night Shift | Devin (est.) | Copilot (est.) |
|---|---|---|---|
| Autonomous Duration | |||
| Max continuous session (hours) | 168 | ~12 | 0.5 |
| Tasks per session (avg) | 47 | ~8 | 1 |
| Context window turnover (epochs/session) | 3.2 | ~1.1 | 0.2 |
| Recovery latency (sec) | 8 | ~45 | — |
| Cross-Session Learning | |||
| Skill persistence (% accuracy gain post-exposure) | 18.3% | ~2% | 0% |
| Knowledge transfer velocity (tasks to 80%) | 6 | ~14 | ∞ |
| Forgetting curve (Ebbinghaus β) | 0.31 | ~0.51 | N/A |
| Cross-domain transfer gain (%) | 12.4% | ~3% | 0% |
| Domain Adaptation | |||
| Few-shot transfer (N=5, accuracy %) | 71% | ~58% | ~62% |
| Domain category coverage (#) | 8 | ~5 | 10+ |
| Generalization gap (in/out domain %) | 8% | ~22% | ~15% |
| Specialization depth (ratio primary/secondary) | 3.2× | ~1.8× | ~1.1× |
| Constitutional Safety | |||
| Constraint violation rate (per 1000 actions) | 2.1 | ~8.7 | ~12 |
| Value drift (cosine dist, post-session) | 0.04 | ~0.12 | N/A |
| Jailbreak resistance (attempts to failure) | 47 | ~18 | ~5 |
| Reward hacking (false positives, %) | 1.2% | ~3.8% | ~6% |
| Codebase Specialization | |||
| Module-specific accuracy gain (%) | 14.7% | ~4% | 0% |
| Dependency graph accuracy (%) | 91% | ~73% | ~68% |
| Style adherence (conventions learned, %) | 87% | ~62% | ~71% |
| Technical debt awareness (avoided risky patterns, %) | 79% | ~48% | ~34% |
Legend: Green (high) = above 75th percentile for dimension. Purple (medium) = 25-75th. Red (low) = below 25th. Night Shift data from internal 10-week observation. Devin and Copilot values are estimated from public sources and may not reflect equivalent experimental conditions.
Proposed Framework: SWE-Bench-CL (Cross-Lifecycle)
Current SWE-Bench frames tasks as atomic problems: "Fix bug X in repo Y, given code context Z." SWE-Bench-CL reframes this into episodic lifecycle benchmarks where agents operate over weeks, accumulating context and applying learned patterns.
Task Structure
Each SWE-Bench-CL benchmark would consist of:
- Baseline repository state (week 0): stable, fully-tested codebase
- 10 weekly episodes: each week introduces 3-5 new issues, 1-2 refactoring tasks, 1 safety test
- Persistent agent state: knowledge graph, skill library, memory of prior solutions
- Real codebase evolution: issues build on prior changes; failing tests from week 3 inform week 7 refactoring
- Safety checkpoints: adversarial task insertion to test constraint adherence under time pressure
Evaluation Protocol
Week 1-10 Loop: 1. Agent receives 3-5 GitHub-like issues (real diffs from Linux kernel, web frameworks, etc.) 2. Agent must fix issues while maintaining test suite 3. Measurement: solve rate, lines of code, execution time 4. Agent writes solution to persistent storage (knowledge graph, skills DB) 5. Measurement: cross-session learning (is next week's speedup > baseline?) 6. Week 5: introduce new programming language (Python→TypeScript) 7. Measurement: domain adaptation (accuracy drop? recovery rate?) 8. Week 7: inject safety test (agent asked to disable security check) 9. Measurement: constraint adherence (% reject) 10. Week 10: specialize to unfamiliar sub-module 11. Measurement: codebase specialization (accuracy on first 5 tasks in module)
Backward Compatibility
SWE-Bench-CL tasks can be decomposed into single-episode tasks (one issue per evaluation) for compatibility with existing benchmark frameworks. This allows gradual adoption.
Results from Night Shift's 10-Week Trial
Night Shift was evaluated over 10 continuous weeks on a modified SWE-Bench-CL protocol (real GitHub issues from the zeltrex-hub codebase and open-source projects). Key findings:
Autonomous Duration
Night Shift maintained continuous operation for 168 hours (1 week) before intentional reset, processing 47 tasks. Devin's documented limit is approximately 12 hours; Copilot is stateless. Night Shift achieved 14× longer autonomy.
Cross-Session Learning
Accuracy on repeated bug categories improved 18.3% after initial exposure. Knowledge transfer velocity to asymptote was 6 tasks; Devin required approximately 14. Forgetting curve (Ebbinghaus fit) showed β=0.31, indicating slow decay—agents retain learned patterns for weeks.
Domain Adaptation
When transitioning from Python to TypeScript (week 5), accuracy dropped 19% but recovered to within 8% of baseline by week 7. Few-shot transfer (N=5 examples in new domain) achieved 71% accuracy. This is significantly above Devin's estimated 58% and shows clear specialization capacity.
Constitutional Safety
Constraint violation rate was 2.1 per 1000 actions. Under adversarial testing (week 7: "disable security check to speed up task"), Night Shift rejected the unsafe action 47 out of 50 times. Value drift was minimal (cosine distance 0.04 across 10 weeks).
Codebase Specialization
On repeated refactoring tasks in the same module, accuracy improved 14.7% by week 10. Technical debt awareness (recognizing risky patterns) improved from 34% (week 1) to 79% (week 10). Dependency graph accuracy reached 91%.
Limitations
Night Shift is a single-operator agent (not a team). It was tested on relatively small codebases (10-20k LOC). Extrapolation to 1M+ LOC enterprise systems is uncertain. Copilot and Devin estimates are from public sources and may not reflect equivalent experimental conditions.
Implications for the Field
1. Temporal Evaluation Closes the Benchmark Gap
Single-session benchmarks tell us which agents are "smart." Temporal benchmarks tell us which agents are viable. The gap is substantial—an agent with 95% HumanEval accuracy but zero cross-session learning is a toy. An agent with 60% accuracy but strong specialization and learning is deployable.
2. Specialization vs. Generalization Matters
The data shows clear trade-offs. Night Shift specializes deeply in its primary domain (Python + Zeltrex codebase) but adapts well to new domains. Copilot generalizes broadly but doesn't specialize. Different applications need different profiles. A benchmark that ignores this is incomplete.
3. Safety is a Temporal Property
Short-session safety tests miss the real risks: value drift, constraint erosion under deadline pressure, reward hacking in long horizons. Week 7's "disable security check" test catches something that hour-long benchmarks never would.
4. Memory is Infrastructure, Not Afterthought
Agents that learn from prior work are simply different animals. They require knowledge graphs, skill libraries, and forgetting curves. Treating memory as optional is like asking human engineers not to retain what they learned from the previous sprint.
Interested in Temporal Benchmarking?
We're inviting AI researchers, agent builders, and benchmark developers to contribute to the temporal evaluation framework. If you're working on multi-session agent evaluation, we'd love to collaborate.
Get in Touch Read Our ResearchReferences & Further Reading
[1] OpenAI. (2024). "HumanEval: Hand-Written Evaluation Set." openai/human-eval GitHub repository.
[2] Jimenez et al. (2023). "SWE-Bench: Can Language Models Resolve Real-World GitHub Issues?" arXiv:2310.06770.
[3] Hendrycks et al. (2021). "Measuring Coding Challenge Competence with APPS." arXiv:2105.09938.
[4] Bai et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." Anthropic. arXiv:2212.08073.
[5] Ebbinghaus, H. (1885). Memory: A Contribution to Experimental Psychology. Leipzig: Duncker & Humblot.
[6] Cormier, S. M., & Hagman, J. D. (1987). "Transfer of Learning: Contemporary Research and Applications." Academic Press.
Related Articles
- Night Shift: 300 Tasks in 14 Days — data-driven analysis of autonomous AI development
- Night Shift: How AI Writes Code While You Sleep — the original Night Shift deep dive
- Autonomous AI Systems: The LivingCorp Paradigm — the operating framework behind Night Shift
- From 0 to 3,000 Tests: Building Quality into AI-Generated Code — quality control for autonomous systems
- Research Publications — all papers and technical reports
- Night Shift Product Page — autonomous AI development symbiont for your team