2026

CALIBRATION COLLAPSE: WHEN METACOGNITIVE CONFIDENCE PREDICTIONS DEGENERATE TO CONSTANTS IN AUTONOMOUS AI SYSTEMS

Vasyl Golubenko TOV ZELTREX April 2026

We report the first empirical documentation of calibration collapse — a failure mode in which a deployed metacognitive planning system degenerates to outputting a constant confidence value across all predictions, eliminating discriminative signal while preserving the structural appearance of self-awareness. In a production system operating over 14 days and 69 task executions, the metacognitive planner produced an identical confidence score of 0.7575 across all 60 recorded predictions, regardless of task category, model assignment, or historical quality outcomes. Root cause analysis revealed an empty calibration boundary table (0 rows despite 219 quality observations) — a metacognitive Potemkin village. We formalize calibration collapse as a distinct failure class from overconfidence, propose variance-based detection metrics, and outline a mandatory calibration bootstrapping protocol.

metacognition confidence calibration autonomous agents failure modes LLM
DOWNLOAD PDF
7 pages 15 references
2026

SILENT STARVATION: DEPENDENCY DEADLOCK AND PLANNING LOOP FAILURE MODES IN AUTONOMOUS AI TASK SCHEDULING

Vasyl Golubenko TOV ZELTREX April 2026

We report the first empirical documentation of silent starvation — a compound failure in which a dependency chain deadlock combines with a decoupled planning layer to produce an agent that runs hundreds of planning cycles while executing zero tasks, without generating any error signals. In a production system, we observed 452 identical task selections over 7 days, producing 0 task executions and 0 errors. Root cause analysis revealed status promotion failure, backlog desynchronization, and planning-execution decoupling. We formalize a taxonomy of four autonomous scheduling failure modes, propose detection heuristics based on selection entropy and execution ratio monitoring, and validate that correcting the root cause immediately restored task flow.

autonomous agents task scheduling dependency resolution failure modes deadlock detection
DOWNLOAD PDF
7 pages 16 references
2026

WHY OUR AUTONOMOUS AI AGENT'S QUALITY DROPPED 31% — AND WHAT THE DATA REVEALED

Vasyl Golubenko TOV ZELTREX March 2026

Night Shift is an autonomous AI development system that runs 24/7. Over 31 operational days, we observed quality scores decline 31%, followed by a 4-day outage caused by a safety mechanism feedback loop. Root cause analysis of 287+ task executions revealed six findings: output truncation is the primary quality determinant, LLM-as-Judge calibration shift can masquerade as quality decline, and safety mechanisms without escape hatches cause worse outages than the failures they prevent.

autonomous agents quality assessment LLM-as-Judge safety mechanisms root cause analysis
PDF
11 pages 16 references
2026

FROM SEARCH TO GROWTH: A COMPARATIVE TAXONOMY OF AUTONOMOUS AI EXPERIMENTATION PARADIGMS

Vasyl Golubenko TOV ZELTREX March 2026

We present a four-level taxonomy of autonomous AI experimentation paradigms: Hill Climber, Evolutionary Optimizer, Cognitive Architect, and Organizational Evolver. Comparing Karpathy's AutoResearch (greedy single-metric optimization) with GODEGEN/Night Shift (evolutionary multi-objective growth), we analyze 17 architectural dimensions and identify seven transferable design patterns. We demonstrate that constitutional safety constraints, not merely git-based rollback, are the critical differentiator for production deployment of self-modifying agents. Drawing on 287+ autonomous tasks and 46 references, we articulate the philosophical distinction between search (convergence to optima) and growth (expansion of capability frontiers).

Autonomous Experimentation Evolutionary Growth Constitutional Safety AutoResearch GODEGEN Taxonomy
4
Paradigm Levels
17
Dimensions
7
Design Patterns
46
References
2026

THE LIVING AGENT: PRODUCTION EVOLUTIONARY SELF-IMPROVEMENT IN AUTONOMOUS AI SYSTEMS

Vasyl Golubenko TOV ZELTREX March 2026

We present GODEGEN, the first production AI agent system combining five capabilities previously seen only in isolation: evolutionary optimization, persistent digital identity, constitutional self-modification, autonomous multi-day operation, and agent distillation. Operating on a real codebase for 10+ consecutive days at $0.24/task, we score against 5 leading systems on a 25-dimension evaluation matrix, achieving 80/100. We propose four new temporal benchmarks (ImproveBench, AdaptBench, CostBench, AutonomyBench) and argue that evolutionary improvement at 4,380 steps/year creates a fundamentally different trajectory than static model upgrades.

Self-Improving Agents Evolutionary Optimization Digital Personality Autonomous AI Agent Distillation Constitutional AI
5/5
Pillar Score
25
Eval Dimensions
280+
Tasks Completed
32
References
2026

DIGITAL DEVELOPMENTAL PSYCHOLOGY: A COGNITIVE FRAMEWORK FOR SELF-IMPROVING AI AGENTS

Vasyl Golubenko TOV ZELTREX March 2026

We propose Digital Developmental Psychology (DDP), a framework mapping nine theories from cognitive and developmental psychology onto the architectural components of a production autonomous agent. Piaget's stages, Vygotsky's ZPD, Bloom's taxonomy, Ebbinghaus's forgetting curve, Berlyne's curiosity theory, Kahneman's dual process theory, and four more are operationalized as eight software modules (2,363 LOC, 125 tests). We derive 21 testable predictions and validate three from 280+ production tasks: quality distributions are non-normal (Piaget), Create-level tasks score lower than Apply-level (Bloom), and skill usage follows a power law (ACT-R).

Developmental Psychology Cognitive Architecture Self-Improving Agents Piaget Stages ZPD Intrinsic Motivation
9
Psychology Theories
21
Predictions
2,363
Lines of Code
37
References
2026

CONSTITUTIONAL SELF-MODIFICATION: A 7-LAYER SAFETY FRAMEWORK FOR AUTONOMOUS CODE-GENERATING AGENTS

Vasyl Golubenko TOV ZELTREX March 2026

Autonomous AI coding agents that can merge their own code introduce a novel safety challenge. We present a 7-layer defense-in-depth framework enforcing 19 constitutional rules across input validation, path constraints, quality gates, integration guards, merge analysis, deployment rollback, and circuit breakers. Validated on 280+ tasks over 10 days with zero safety violations and 73% auto-merge success rate. A systematic audit of 23 modules revealed 8 critical "dead integration" gaps --- safety features fully coded but never activated. We introduce the Safety Activation Rate (SAR) metric and compare against 6 existing frameworks.

Constitutional AI Safe Self-Modification Defense in Depth CI/CD Safety RSI Safety Dead Integration
7
Safety Layers
19
Constitutional Rules
0
Safety Violations
25
References
2026

BENCHMARKING SELF-IMPROVEMENT: TEMPORAL EVALUATION METRICS FOR AUTONOMOUS AI AGENTS

Vasyl Golubenko TOV ZELTREX March 2026

Existing benchmarks for AI coding agents measure point-in-time capability. We propose four novel temporal benchmarks — ImproveBench, AdaptBench, CostBench, and AutonomyBench — built on non-parametric statistical foundations (Mann-Kendall trend test, Theil-Sen slope estimator, EWMA smoothing). Validated on 280+ production tasks from a 10-day autonomous deployment, achieving a composite score of 62.1/100 (C+). Our results demonstrate that temporal metrics reveal performance characteristics invisible to point-in-time evaluation: strong domain adaptation combined with stagnant quality improvement — a nuance no existing benchmark captures.

Benchmark Self-Improving Agents Temporal Evaluation Mann-Kendall Production Deployment Autonomous AI
4
Novel Benchmarks
280+
Tasks Validated
1,168
LOC Implementation
21
References
2026

THE TEMPORAL DIMENSION GAP: WHY LIVING AGENTS DIVERGE FROM TOOL AGENTS IN AUTONOMOUS SOFTWARE ENGINEERING

Vasyl Golubenko TOV ZELTREX March 2026

We identify and analyze a previously unrecognized evaluation gap in AI coding agents: temporal capabilities — properties that emerge and compound over an agent's operational lifetime. Current benchmarks measure point-in-time task performance but ignore cross-session learning, autonomous operation duration, domain adaptation speed, codebase specialization, and constitutional safety for self-modification. We extend our 20-dimension SOTA comparison matrix with 5 temporal dimensions, scoring 10 production systems. GODEGEN scores 24/25 on temporal dimensions versus 13/25 for the best competitor (Devin 2.0). We survey 70+ papers, identify 6 literature gaps, and propose four novel temporal benchmarks (ImproveBench, AdaptBench, CostBench, AutonomyBench) with Mann-Kendall trend testing and Theil-Sen slope estimation. Analysis of GitHub Copilot's Agentic Memory launch (March 4, 2026) reveals it stores codebase facts but not self-improvement data, preserving GODEGEN's structural advantage.

Temporal Evaluation Self-Improving Agents Cross-Session Learning Agent Benchmarks Living Agents Evolutionary Optimization
10
Systems Evaluated
25
Eval Dimensions
6
Literature Gaps
70
References
2026

NIGHT SHIFT: A PRODUCTION AUTONOMOUS AI DEVELOPMENT SYSTEM WITH EVOLUTIONARY TASK OPTIMIZATION

Vasyl Golubenko TOV ZELTREX March 2026

We present Night Shift, a production-deployed autonomous AI development system that operates continuously on a 2-hour dispatch cycle, generating code, specifications, research reports, and documentation without human intervention. Unlike benchmark-oriented agent systems, Night Shift has been deployed to a production server for 10+ consecutive days, completing 269 tasks at a total cost of $65.98 USD. The system introduces four key innovations: an evolutionary task optimization engine based on genetic algorithms, a continuation-based anti-truncation mechanism achieving 100% completion rate, cascade model routing across local GPU and cloud API tiers, and a human-AI symbiont loop where daily mentoring reviews shape system behavior. We compare against SWE-agent, Devin, MetaGPT, and AlphaEvolve, demonstrating that production deployment demands fundamentally different design priorities than benchmark performance.

Autonomous Agents Software Engineering Genetic Algorithms LLM Agents Cost Optimization Human-AI Collaboration
269
Tasks Completed
$0.25
Avg Cost / Task
100%
Completion Rate
33
References
2026

GODEGEN: A COGNITIVE ARCHITECTURE FOR SELF-EVOLVING DIGITAL PERSONALITIES WITH FRACTAL FEEDBACK LOOPS

Vasyl Golubenko TOV ZELTREX March 2026

We present GODEGEN, a cognitive architecture that transforms autonomous AI development agents from task executors into self-evolving Digital Personalities. Building on Night Shift, we introduce six cognitive modules: a four-persona agent swarm, a SQLite-backed knowledge graph, a four-stage information refiner, a Voyager-inspired skill library, a DP-level epoch evolution manager, and a formal identity document. A systematic production-readiness audit revealed 8 critical wiring gaps — a failure mode we term "dead integration." Gap closure (91 tests) produced a system where all six modules initialized on the first production run. We compare against 12 state-of-the-art systems across 15 capabilities, finding GODEGEN leads in persistent identity evolution and fractal scaling, but lags in formal benchmarks and tool use.

Cognitive Architecture Gap Analysis Agent Swarm Knowledge Graph Production AI Fractal Organization
6
Cognitive Modules
8
Gaps Closed
91
Tests Added
25
References
2026

SELF-IMPROVING AI SYSTEMS: MATHEMATICAL FOUNDATIONS AND PRODUCTION VALIDATION

Vasyl Golubenko, Viktor Zhelizko TOV ZELTREX March 2026

We formalize the mathematical foundations of self-improving AI systems through 12 equations validated against 323 production tasks over 17 days. We derive the Quality Decay Function (R²=0.91) linking infrastructure failures to output degradation, adapt Rechenberg's 1/5 evolutionary rule for AI configuration optimization, and apply Mann-Kendall trend detection for quality monitoring. Production data demonstrates 40% cost reduction through multi-provider routing and mentoring adoption curves following logistic growth with structural changes at τ=3 days and behavioral at τ=7 days.

Self-Improving AI Mathematical Foundations Evolutionary Optimization Quality Monitoring Production Validation Rechenberg 1/5
12
Equations
323
Tasks Validated
17
Days Production
30
References
2026

SELF-HEALING ARCHITECTURE FOR AUTONOMOUS AI AGENTS: A SIX-LAYER RESILIENCE FRAMEWORK

Vasyl Golubenko, Viktor Zhelizko TOV ZELTREX March 2026

We present a six-layer self-healing architecture for autonomous AI agents, validated in production over 17 days (323 tasks, $84 total cost). The framework combines WAL-mode persistence, exponential retry with jitter, idempotent task execution, watchdog monitoring, circuit breakers, and MAPE-K autonomic control. A fix cascade analysis reveals mean chain length E[L]=3.0 with coupling probability p=0.67. The system recovered from a 3-day outage autonomously, achieving 99.7% availability. We formalize self-healing response time, redundancy cost models, and demonstrate Goodhart's Law effects in autonomous maintenance.

Self-Healing Resilience Framework MAPE-K Crash Recovery Fix Cascade Autonomic Computing
6
Resilience Layers
99.7%
Availability
10
Equations
27
References
2026

WHEN YOUR AI AGENT WRITES JAVA FOR A PYTHON PROJECT: OUTPUT-TYPE-AWARE MODEL ROUTING

Vasyl Golubenko TOV ZELTREX March 2026

We document a catastrophic failure mode in multi-model autonomous development systems where a local 32B-parameter LLM consistently hallucinated wrong programming languages for code tasks — producing Java Spring Boot and TypeScript for a Python-only codebase. We introduce output-type-aware routing that routes code tasks to API-backed models while keeping prose on the local GPU, preserving 99.5% of cost savings while eliminating the failure mode. We formalize task-type sensitivity and propose a 2x2 routing matrix generalizable to any multi-model agent architecture.

Model Routing Language Hallucination Multi-Model Systems Cost Optimization
0%
Code Merge Rate (Before)
98.8%
Cost Reduction
11
References
2026

WHEN EVOLUTION CAN'T EVOLVE: DISCONNECTED GENOMES IN AUTONOMOUS AI SYSTEMS

Vasyl Golubenko TOV ZELTREX March 2026

We report a failure mode where 477 genome mutations across 28 generations produced zero fitness improvement in an evolutionary optimization engine for autonomous AI development. Root cause: a 5-link causal disconnection chain between genotype (evolved parameters) and phenotype (task outcomes). We formalize Disconnected Evolution as Goodhart's Law applied to evolutionary optimization, survey related phenomena across evolutionary computation, NAS, meta-learning, and prompt optimization (25+ papers), and propose detection methods and architectural fixes.

Evolutionary Optimization Goodhart's Law Genotype-Phenotype Disconnected Evolution Autonomous Agents
0/477
Mutations Improved
5
Causal Break Links
23
References

ABOUT OUR RESEARCH

ZELTREX research focuses on production-grade autonomous AI systems — agents that operate reliably under budget constraints, integrate with real development workflows, and improve through sustained human-AI collaboration. We prioritize empirical results from deployed systems over benchmark performance.