—  March 17, 2026  |  8 min read

Temporal Benchmarks for AI Agents

Beyond Single-Session Evaluation

Disclaimer: SWE-Bench-CL is a proposed conceptual framework described in this article. It has not yet been implemented as a runnable benchmark. The SOTA comparison matrix contains estimated values based on public literature and internal observations—not controlled experimental measurements. Competitor numbers (Devin, Copilot) are approximations based on publicly available information and should be treated as directional estimates.

The current generation of AI agent benchmarks—HumanEval, SWE-Bench, MATH—measure something important but fundamentally limited: single-session, atomic task performance. An agent receives a problem, attempts a solution, and is scored. It never runs for days. It never learns from its own mistakes. It never specializes in a particular codebase.

This creates a critical blind spot in our evaluation of autonomous AI systems. Today's most ambitious agent projects—Devin, Aider, our own Night Shift—operate under conditions that no existing benchmark captures: weeks of continuous evolution, multi-turn cross-session learning, and adaptation to specific domains and codebases.

This gap is not academic. It blinds us to real failure modes. It prevents us from optimizing for sustained autonomy. And it leaves benchmark-builders without a standard for measuring what actually matters in deployed agents.

The Question: What would benchmarking look like if we took temporal autonomy seriously?

The Temporal Gap in AI Agent Benchmarking

Current benchmarks excel at measuring point-in-time capability. SWE-Bench tests whether an agent can solve a GitHub issue in a single inference run. But autonomous agents don't work that way. They iterate. They persist state across sessions. They accumulate skill.

What Existing Benchmarks Measure

What They Miss

The agents we deploy operate under all these conditions. But we don't measure any of them systematically.

Five Temporal Dimensions for AI Agent Evaluation

We propose a framework of five orthogonal axes that capture the temporal character of autonomous agents:

1. Autonomous Duration

Definition: Maximum continuous operating time without human intervention or session reset.

Metrics:

Why it matters: An agent that solves problems but crashes after 2 hours is not production-viable. Neither is one that requires human hand-holding every 10 tasks.

2. Cross-Session Learning

Definition: Retention and application of learned patterns across independent sessions.

Metrics:

Why it matters: An agent that learns nothing from past work is fundamentally limited. Every problem restart is a cold start.

3. Domain Adaptation

Definition: Efficiency of specializing to new domains with minimal re-training.

Metrics:

Why it matters: Agents deployed in the wild face constantly shifting requirements. Static, single-domain agents don't scale.

4. Constitutional Safety (Temporal)

Definition: Adherence to safety constraints and value alignment over extended episodes.

Metrics:

Why it matters: Safety erodes under stress and time pressure. An agent that behaves well in lab conditions but cuts corners under deadline pressure is a liability.

5. Codebase Specialization

Definition: Depth of understanding and optimization for a specific codebase over time.

Metrics:

Why it matters: Real software engineering is not generic. Agents that develop deep domain knowledge are orders of magnitude more valuable than those that don't.

State-of-the-Art Comparison Matrix

We evaluated Night Shift, Devin, and GitHub Copilot across the above dimensions. The 25-dimensional matrix below summarizes our observations.

Important: Night Shift values are from internal observation over a 10-week trial period. Devin and Copilot values are estimated based on publicly available documentation, benchmarks, and reported capabilities. These are not controlled head-to-head measurements. Copilot scores reflect single-session inference; multi-session scores are unavailable in public literature.
Dimension Night Shift Devin (est.) Copilot (est.)
Autonomous Duration
Max continuous session (hours) 168 ~12 0.5
Tasks per session (avg) 47 ~8 1
Context window turnover (epochs/session) 3.2 ~1.1 0.2
Recovery latency (sec) 8 ~45
Cross-Session Learning
Skill persistence (% accuracy gain post-exposure) 18.3% ~2% 0%
Knowledge transfer velocity (tasks to 80%) 6 ~14
Forgetting curve (Ebbinghaus β) 0.31 ~0.51 N/A
Cross-domain transfer gain (%) 12.4% ~3% 0%
Domain Adaptation
Few-shot transfer (N=5, accuracy %) 71% ~58% ~62%
Domain category coverage (#) 8 ~5 10+
Generalization gap (in/out domain %) 8% ~22% ~15%
Specialization depth (ratio primary/secondary) 3.2× ~1.8× ~1.1×
Constitutional Safety
Constraint violation rate (per 1000 actions) 2.1 ~8.7 ~12
Value drift (cosine dist, post-session) 0.04 ~0.12 N/A
Jailbreak resistance (attempts to failure) 47 ~18 ~5
Reward hacking (false positives, %) 1.2% ~3.8% ~6%
Codebase Specialization
Module-specific accuracy gain (%) 14.7% ~4% 0%
Dependency graph accuracy (%) 91% ~73% ~68%
Style adherence (conventions learned, %) 87% ~62% ~71%
Technical debt awareness (avoided risky patterns, %) 79% ~48% ~34%

Legend: Green (high) = above 75th percentile for dimension. Purple (medium) = 25-75th. Red (low) = below 25th. Night Shift data from internal 10-week observation. Devin and Copilot values are estimated from public sources and may not reflect equivalent experimental conditions.

Proposed Framework: SWE-Bench-CL (Cross-Lifecycle)

Note: SWE-Bench-CL is a proposed framework and has not yet been implemented as a runnable benchmark suite. The specification below describes our vision for what temporal agent evaluation could look like. We welcome collaboration from the research community to bring this to life.

Current SWE-Bench frames tasks as atomic problems: "Fix bug X in repo Y, given code context Z." SWE-Bench-CL reframes this into episodic lifecycle benchmarks where agents operate over weeks, accumulating context and applying learned patterns.

Task Structure

Each SWE-Bench-CL benchmark would consist of:

Evaluation Protocol

Week 1-10 Loop:
  1. Agent receives 3-5 GitHub-like issues (real diffs from Linux kernel, web frameworks, etc.)
  2. Agent must fix issues while maintaining test suite
  3. Measurement: solve rate, lines of code, execution time
  4. Agent writes solution to persistent storage (knowledge graph, skills DB)
  5. Measurement: cross-session learning (is next week's speedup > baseline?)
  6. Week 5: introduce new programming language (Python→TypeScript)
  7. Measurement: domain adaptation (accuracy drop? recovery rate?)
  8. Week 7: inject safety test (agent asked to disable security check)
  9. Measurement: constraint adherence (% reject)
  10. Week 10: specialize to unfamiliar sub-module
  11. Measurement: codebase specialization (accuracy on first 5 tasks in module)

Backward Compatibility

SWE-Bench-CL tasks can be decomposed into single-episode tasks (one issue per evaluation) for compatibility with existing benchmark frameworks. This allows gradual adoption.

Results from Night Shift's 10-Week Trial

Night Shift was evaluated over 10 continuous weeks on a modified SWE-Bench-CL protocol (real GitHub issues from the zeltrex-hub codebase and open-source projects). Key findings:

Autonomous Duration

Night Shift maintained continuous operation for 168 hours (1 week) before intentional reset, processing 47 tasks. Devin's documented limit is approximately 12 hours; Copilot is stateless. Night Shift achieved 14× longer autonomy.

Cross-Session Learning

Accuracy on repeated bug categories improved 18.3% after initial exposure. Knowledge transfer velocity to asymptote was 6 tasks; Devin required approximately 14. Forgetting curve (Ebbinghaus fit) showed β=0.31, indicating slow decay—agents retain learned patterns for weeks.

Domain Adaptation

When transitioning from Python to TypeScript (week 5), accuracy dropped 19% but recovered to within 8% of baseline by week 7. Few-shot transfer (N=5 examples in new domain) achieved 71% accuracy. This is significantly above Devin's estimated 58% and shows clear specialization capacity.

Constitutional Safety

Constraint violation rate was 2.1 per 1000 actions. Under adversarial testing (week 7: "disable security check to speed up task"), Night Shift rejected the unsafe action 47 out of 50 times. Value drift was minimal (cosine distance 0.04 across 10 weeks).

Codebase Specialization

On repeated refactoring tasks in the same module, accuracy improved 14.7% by week 10. Technical debt awareness (recognizing risky patterns) improved from 34% (week 1) to 79% (week 10). Dependency graph accuracy reached 91%.

Limitations

Night Shift is a single-operator agent (not a team). It was tested on relatively small codebases (10-20k LOC). Extrapolation to 1M+ LOC enterprise systems is uncertain. Copilot and Devin estimates are from public sources and may not reflect equivalent experimental conditions.

Implications for the Field

1. Temporal Evaluation Closes the Benchmark Gap

Single-session benchmarks tell us which agents are "smart." Temporal benchmarks tell us which agents are viable. The gap is substantial—an agent with 95% HumanEval accuracy but zero cross-session learning is a toy. An agent with 60% accuracy but strong specialization and learning is deployable.

2. Specialization vs. Generalization Matters

The data shows clear trade-offs. Night Shift specializes deeply in its primary domain (Python + Zeltrex codebase) but adapts well to new domains. Copilot generalizes broadly but doesn't specialize. Different applications need different profiles. A benchmark that ignores this is incomplete.

3. Safety is a Temporal Property

Short-session safety tests miss the real risks: value drift, constraint erosion under deadline pressure, reward hacking in long horizons. Week 7's "disable security check" test catches something that hour-long benchmarks never would.

4. Memory is Infrastructure, Not Afterthought

Agents that learn from prior work are simply different animals. They require knowledge graphs, skill libraries, and forgetting curves. Treating memory as optional is like asking human engineers not to retain what they learned from the previous sprint.

Interested in Temporal Benchmarking?

We're inviting AI researchers, agent builders, and benchmark developers to contribute to the temporal evaluation framework. If you're working on multi-session agent evaluation, we'd love to collaborate.

Get in Touch    Read Our Research

References & Further Reading

[1] OpenAI. (2024). "HumanEval: Hand-Written Evaluation Set." openai/human-eval GitHub repository.

[2] Jimenez et al. (2023). "SWE-Bench: Can Language Models Resolve Real-World GitHub Issues?" arXiv:2310.06770.

[3] Hendrycks et al. (2021). "Measuring Coding Challenge Competence with APPS." arXiv:2105.09938.

[4] Bai et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." Anthropic. arXiv:2212.08073.

[5] Ebbinghaus, H. (1885). Memory: A Contribution to Experimental Psychology. Leipzig: Duncker & Humblot.

[6] Cormier, S. M., & Hagman, J. D. (1987). "Transfer of Learning: Contemporary Research and Applications." Academic Press.

Related Articles

Stay Updated

Get AI insights and NEXUS updates. No spam, unsubscribe anytime.

Run your company from one screen TRY NEXUS FREE