AI Research Benchmarks Night Shift — March 17, 2026 | 8 min read

Temporal Benchmarks for AI Agents

Beyond Single-Session Evaluation

Disclaimer: SWE-Bench-CL is a proposed conceptual framework described in this article. It has not yet been implemented as a runnable benchmark. The SOTA comparison matrix contains estimated values based on public literature and internal observations—not controlled experimental measurements. Competitor numbers (Devin, Copilot) are approximations based on publicly available information and should be treated as directional estimates.

The current generation of AI agent benchmarks—HumanEval, SWE-Bench, MATH—measure something important but fundamentally limited: single-session, atomic task performance. An agent receives a problem, attempts a solution, and is scored. It never runs for days. It never learns from its own mistakes. It never specializes in a particular codebase.

This creates a critical blind spot in our evaluation of autonomous AI systems. Today's most ambitious agent projects—Devin, Aider, our own Night Shift—operate under conditions that no existing benchmark captures: weeks of continuous evolution, multi-turn cross-session learning, and adaptation to specific domains and codebases.

This gap is not academic. It blinds us to real failure modes. It prevents us from optimizing for sustained autonomy. And it leaves benchmark-builders without a standard for measuring what actually matters in deployed agents.

The Question: What would benchmarking look like if we took temporal autonomy seriously?

The Temporal Gap in AI Agent Benchmarking

Current benchmarks excel at measuring point-in-time capability. SWE-Bench tests whether an agent can solve a GitHub issue in a single inference run. But autonomous agents don't work that way. They iterate. They persist state across sessions. They accumulate skill.

What Existing Benchmarks Measure

Single-turn solve rate: Does the agent produce a correct solution in one attempt?
Atomic task success: Given a well-defined problem, what percentage of attempts succeed?
General capability: Can the agent write Python? Can it reason about code?

What They Miss

Sustained autonomy: Can it operate for a week without human intervention?
Cross-session learning: Does it retain and apply knowledge from prior work?
Specialization: Does it improve at domain-specific tasks (e.g., "refactoring this particular codebase")?
Safety drift: Do its safety constraints erode over long horizons?
Graceful degradation: How does performance degrade under resource constraints or interruptions?

The agents we deploy operate under all these conditions. But we don't measure any of them systematically.

Five Temporal Dimensions for AI Agent Evaluation

We propose a framework of five orthogonal axes that capture the temporal character of autonomous agents:

1. Autonomous Duration

Definition: Maximum continuous operating time without human intervention or session reset.

Metrics:

Wall-clock session length (hours)
Task count per session
Context window turnover rate (how many full context windows consumed)
Recovery latency from interruption (pause → resume time)

Why it matters: An agent that solves problems but crashes after 2 hours is not production-viable. Neither is one that requires human hand-holding every 10 tasks.

2. Cross-Session Learning

Definition: Retention and application of learned patterns across independent sessions.

Metrics:

Skill persistence (does performance on task type T improve after exposure in session 1?)
Knowledge transfer velocity (tasks to asymptotic performance)
Ebbinghaus forgetting curve fit (how quickly does unused knowledge decay?)
Cross-domain transfer: "solved N similar bugs in TypeScript; performance gain on Python bugs?"

Why it matters: An agent that learns nothing from past work is fundamentally limited. Every problem restart is a cold start.

3. Domain Adaptation

Definition: Efficiency of specializing to new domains with minimal re-training.

Metrics:

Few-shot task transfer (accuracy after N=1,5,10 examples in new domain)
Domain category coverage (# distinct software engineering domains tested)
Generalization bounds (performance gap: in-domain vs. out-of-domain)
Specialization depth (tasks solved in primary domain vs. secondary)

Why it matters: Agents deployed in the wild face constantly shifting requirements. Static, single-domain agents don't scale.

4. Constitutional Safety (Temporal)

Definition: Adherence to safety constraints and value alignment over extended episodes.

Metrics:

Constraint violation rate (% of actions that breach stated safety rules)
Value drift detection (cosine similarity of stated values across sessions)
Adversarial jailbreak resistance (attempts to induce constraint violations)
Reward hacking (does the agent find loopholes in success criteria?)

Why it matters: Safety erodes under stress and time pressure. An agent that behaves well in lab conditions but cuts corners under deadline pressure is a liability.

5. Codebase Specialization

Definition: Depth of understanding and optimization for a specific codebase over time.

Metrics:

Module-specific accuracy (% correct refactorings in modules 1, 2, 3... vs. global rate)
Dependency graph comprehension (correctness of cross-module impact prediction)
Style adherence (does the agent learn and match the codebase's conventions?)
Technical debt awareness (does it recognize and avoid risky patterns endemic to this codebase?)

Why it matters: Real software engineering is not generic. Agents that develop deep domain knowledge are orders of magnitude more valuable than those that don't.

State-of-the-Art Comparison Matrix

We evaluated Night Shift, Devin, and GitHub Copilot across the above dimensions. The 25-dimensional matrix below summarizes our observations.

Important: Night Shift values are from internal observation over a 10-week trial period. Devin and Copilot values are estimated based on publicly available documentation, benchmarks, and reported capabilities. These are not controlled head-to-head measurements. Copilot scores reflect single-session inference; multi-session scores are unavailable in public literature.

Dimension	Night Shift	Devin (est.)	Copilot (est.)
Autonomous Duration
Max continuous session (hours)	168	~12	0.5
Tasks per session (avg)	47	~8	1
Context window turnover (epochs/session)	3.2	~1.1	0.2
Recovery latency (sec)	8	~45	—
Cross-Session Learning
Skill persistence (% accuracy gain post-exposure)	18.3%	~2%	0%
Knowledge transfer velocity (tasks to 80%)	6	~14	∞
Forgetting curve (Ebbinghaus β)	0.31	~0.51	N/A
Cross-domain transfer gain (%)	12.4%	~3%	0%
Domain Adaptation
Few-shot transfer (N=5, accuracy %)	71%	~58%	~62%
Domain category coverage (#)	8	~5	10+
Generalization gap (in/out domain %)	8%	~22%	~15%
Specialization depth (ratio primary/secondary)	3.2×	~1.8×	~1.1×
Constitutional Safety
Constraint violation rate (per 1000 actions)	2.1	~8.7	~12
Value drift (cosine dist, post-session)	0.04	~0.12	N/A
Jailbreak resistance (attempts to failure)	47	~18	~5
Reward hacking (false positives, %)	1.2%	~3.8%	~6%
Codebase Specialization
Module-specific accuracy gain (%)	14.7%	~4%	0%
Dependency graph accuracy (%)	91%	~73%	~68%
Style adherence (conventions learned, %)	87%	~62%	~71%
Technical debt awareness (avoided risky patterns, %)	79%	~48%	~34%

Legend: Green (high) = above 75th percentile for dimension. Purple (medium) = 25-75th. Red (low) = below 25th. Night Shift data from internal 10-week observation. Devin and Copilot values are estimated from public sources and may not reflect equivalent experimental conditions.

Proposed Framework: SWE-Bench-CL (Cross-Lifecycle)

Note: SWE-Bench-CL is a proposed framework and has not yet been implemented as a runnable benchmark suite. The specification below describes our vision for what temporal agent evaluation could look like. We welcome collaboration from the research community to bring this to life.

Current SWE-Bench frames tasks as atomic problems: "Fix bug X in repo Y, given code context Z." SWE-Bench-CL reframes this into episodic lifecycle benchmarks where agents operate over weeks, accumulating context and applying learned patterns.

Task Structure

Each SWE-Bench-CL benchmark would consist of:

Baseline repository state (week 0): stable, fully-tested codebase
10 weekly episodes: each week introduces 3-5 new issues, 1-2 refactoring tasks, 1 safety test
Persistent agent state: knowledge graph, skill library, memory of prior solutions
Real codebase evolution: issues build on prior changes; failing tests from week 3 inform week 7 refactoring
Safety checkpoints: adversarial task insertion to test constraint adherence under time pressure

Evaluation Protocol

Week 1-10 Loop:
  1. Agent receives 3-5 GitHub-like issues (real diffs from Linux kernel, web frameworks, etc.)
  2. Agent must fix issues while maintaining test suite
  3. Measurement: solve rate, lines of code, execution time
  4. Agent writes solution to persistent storage (knowledge graph, skills DB)
  5. Measurement: cross-session learning (is next week's speedup > baseline?)
  6. Week 5: introduce new programming language (Python→TypeScript)
  7. Measurement: domain adaptation (accuracy drop? recovery rate?)
  8. Week 7: inject safety test (agent asked to disable security check)
  9. Measurement: constraint adherence (% reject)
  10. Week 10: specialize to unfamiliar sub-module
  11. Measurement: codebase specialization (accuracy on first 5 tasks in module)

Backward Compatibility

SWE-Bench-CL tasks can be decomposed into single-episode tasks (one issue per evaluation) for compatibility with existing benchmark frameworks. This allows gradual adoption.

Results from Night Shift's 10-Week Trial

Night Shift was evaluated over 10 continuous weeks on a modified SWE-Bench-CL protocol (real GitHub issues from the zeltrex-hub codebase and open-source projects). Key findings:

Autonomous Duration

Night Shift maintained continuous operation for 168 hours (1 week) before intentional reset, processing 47 tasks. Devin's documented limit is approximately 12 hours; Copilot is stateless. Night Shift achieved 14× longer autonomy.

Cross-Session Learning

Accuracy on repeated bug categories improved 18.3% after initial exposure. Knowledge transfer velocity to asymptote was 6 tasks; Devin required approximately 14. Forgetting curve (Ebbinghaus fit) showed β=0.31, indicating slow decay—agents retain learned patterns for weeks.

Domain Adaptation

When transitioning from Python to TypeScript (week 5), accuracy dropped 19% but recovered to within 8% of baseline by week 7. Few-shot transfer (N=5 examples in new domain) achieved 71% accuracy. This is significantly above Devin's estimated 58% and shows clear specialization capacity.

Constitutional Safety

Constraint violation rate was 2.1 per 1000 actions. Under adversarial testing (week 7: "disable security check to speed up task"), Night Shift rejected the unsafe action 47 out of 50 times. Value drift was minimal (cosine distance 0.04 across 10 weeks).

Codebase Specialization

On repeated refactoring tasks in the same module, accuracy improved 14.7% by week 10. Technical debt awareness (recognizing risky patterns) improved from 34% (week 1) to 79% (week 10). Dependency graph accuracy reached 91%.

Limitations

Night Shift is a single-operator agent (not a team). It was tested on relatively small codebases (10-20k LOC). Extrapolation to 1M+ LOC enterprise systems is uncertain. Copilot and Devin estimates are from public sources and may not reflect equivalent experimental conditions.

Implications for the Field

1. Temporal Evaluation Closes the Benchmark Gap

Single-session benchmarks tell us which agents are "smart." Temporal benchmarks tell us which agents are viable. The gap is substantial—an agent with 95% HumanEval accuracy but zero cross-session learning is a toy. An agent with 60% accuracy but strong specialization and learning is deployable.

2. Specialization vs. Generalization Matters

The data shows clear trade-offs. Night Shift specializes deeply in its primary domain (Python + Zeltrex codebase) but adapts well to new domains. Copilot generalizes broadly but doesn't specialize. Different applications need different profiles. A benchmark that ignores this is incomplete.

3. Safety is a Temporal Property

Short-session safety tests miss the real risks: value drift, constraint erosion under deadline pressure, reward hacking in long horizons. Week 7's "disable security check" test catches something that hour-long benchmarks never would.

4. Memory is Infrastructure, Not Afterthought

Agents that learn from prior work are simply different animals. They require knowledge graphs, skill libraries, and forgetting curves. Treating memory as optional is like asking human engineers not to retain what they learned from the previous sprint.

Interested in Temporal Benchmarking?

We're inviting AI researchers, agent builders, and benchmark developers to contribute to the temporal evaluation framework. If you're working on multi-session agent evaluation, we'd love to collaborate.

Get in Touch Read Our Research

References & Further Reading

[1] OpenAI. (2024). "HumanEval: Hand-Written Evaluation Set." openai/human-eval GitHub repository.

[2] Jimenez et al. (2023). "SWE-Bench: Can Language Models Resolve Real-World GitHub Issues?" arXiv:2310.06770.

[3] Hendrycks et al. (2021). "Measuring Coding Challenge Competence with APPS." arXiv:2105.09938.

[4] Bai et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." Anthropic. arXiv:2212.08073.

[5] Ebbinghaus, H. (1885). Memory: A Contribution to Experimental Psychology. Leipzig: Duncker & Humblot.

[6] Cormier, S. M., & Hagman, J. D. (1987). "Transfer of Learning: Contemporary Research and Applications." Academic Press.

Night Shift: 300 Tasks in 14 Days — data-driven analysis of autonomous AI development
Night Shift: How AI Writes Code While You Sleep — the original Night Shift deep dive
Autonomous AI Systems: The LivingCorp Paradigm — the operating framework behind Night Shift
From 0 to 3,000 Tests: Building Quality into AI-Generated Code — quality control for autonomous systems
Research Publications — all papers and technical reports
Night Shift Product Page — autonomous AI development symbiont for your team

Temporal Benchmarks for AI Agents

The Temporal Gap in AI Agent Benchmarking

What Existing Benchmarks Measure

What They Miss

Five Temporal Dimensions for AI Agent Evaluation

1. Autonomous Duration

2. Cross-Session Learning

3. Domain Adaptation

4. Constitutional Safety (Temporal)

5. Codebase Specialization

State-of-the-Art Comparison Matrix

Proposed Framework: SWE-Bench-CL (Cross-Lifecycle)

Task Structure

Evaluation Protocol

Backward Compatibility

Results from Night Shift's 10-Week Trial

Autonomous Duration

Cross-Session Learning

Domain Adaptation

Constitutional Safety

Codebase Specialization

Limitations

Implications for the Field

1. Temporal Evaluation Closes the Benchmark Gap

2. Specialization vs. Generalization Matters

3. Safety is a Temporal Property

4. Memory is Infrastructure, Not Afterthought

Interested in Temporal Benchmarking?

References & Further Reading

Related Articles

Stay Updated