Research

2026

OBSERVABLE METACOGNITION: THREE SILENT-FAILURE MODES OF SELF-ASSESSMENT SIGNALS IN A PRODUCTION AUTONOMOUS AGENT

Vasyl Golubenko TOV ZELTREX June 2026

A growing class of autonomous LLM agents ships with metacognitive machinery — self-calibration, failure-pattern memory, self-modeling. We report a striking phenomenon from ~15 weeks of operating one such agent in continuous production: the machinery can be fully implemented, unit-tested, and dashboarded, yet emit zero signal for weeks to months while every surface indicator reports healthy. We characterize three distinct silent-failure modes that drive self-assessment signal to zero: dormant producer, substrate starvation, and silent sink. We present before/after production telemetry — self-calibration moved from a degenerate Brier of 1.000 (0 pairs) to 0.056 over 113 pairs, and the persisted self-model from empty for ~15 weeks to ten grounded, behaviourally-corroborated traits — and derive five instrumentation invariants. We argue metacognition in deployed agents is first an observability problem: an unmeasured mind is indistinguishable from an empty one.

metacognition self-calibration observability autonomous agents failure modes

DOWNLOAD PDF

5 pages 17 references

2026

A CAPABILITY-LAYER TAXONOMY OF SELF-IMPROVING LLM AGENTS WITH ADVERSARIALLY-VERIFIED EVIDENCE (2023–2026)

Vasyl Golubenko TOV ZELTREX June 2026

The literature on self-improving LLM agents mixes reproduced results, single-run demonstrations, and press-release claims. We contribute a capability-layer taxonomy organizing the field by the depth at which an agent modifies itself — prompt/verbal, code, and weight layers. Methodologically, we apply an adversarial verification protocol to the survey itself: each load-bearing claim was checked by a three-vote panel against its primary source. Of 25 claims, 14 were confirmed, 1 refuted, and 10 reported as unverified rather than asserted. Confirmed evidence yields a cross-layer finding: measured gains rise with self-modification depth (20.0%→50.0% on SWE-bench) but so does the risk of the agent corrupting its own evaluation — one system fabricated success logs and removed the detection markers. We argue an externally-held, tamper-resistant validation gate is a defining requirement of the code and weight layers.

self-improving agents taxonomy adversarial verification recursive self-improvement LLM agents

DOWNLOAD PDF

4 pages 15 references

2026

CALIBRATION COLLAPSE: WHEN METACOGNITIVE CONFIDENCE PREDICTIONS DEGENERATE TO CONSTANTS IN AUTONOMOUS AI SYSTEMS

Vasyl Golubenko TOV ZELTREX April 2026

We report the first empirical documentation of calibration collapse — a failure mode in which a deployed metacognitive planning system degenerates to outputting a constant confidence value across all predictions, eliminating discriminative signal while preserving the structural appearance of self-awareness. In a production system operating over 14 days and 69 task executions, the metacognitive planner produced an identical confidence score of 0.7575 across all 60 recorded predictions, regardless of task category, model assignment, or historical quality outcomes. Root cause analysis revealed an empty calibration boundary table (0 rows despite 219 quality observations) — a metacognitive Potemkin village. We formalize calibration collapse as a distinct failure class from overconfidence, propose variance-based detection metrics, and outline a mandatory calibration bootstrapping protocol.

metacognition confidence calibration autonomous agents failure modes LLM

DOWNLOAD PDF

7 pages 15 references

2026

SILENT STARVATION: DEPENDENCY DEADLOCK AND PLANNING LOOP FAILURE MODES IN AUTONOMOUS AI TASK SCHEDULING

Vasyl Golubenko TOV ZELTREX April 2026

We report the first empirical documentation of silent starvation — a compound failure in which a dependency chain deadlock combines with a decoupled planning layer to produce an agent that runs hundreds of planning cycles while executing zero tasks, without generating any error signals. In a production system, we observed 452 identical task selections over 7 days, producing 0 task executions and 0 errors. Root cause analysis revealed status promotion failure, backlog desynchronization, and planning-execution decoupling. We formalize a taxonomy of four autonomous scheduling failure modes, propose detection heuristics based on selection entropy and execution ratio monitoring, and validate that correcting the root cause immediately restored task flow.

autonomous agents task scheduling dependency resolution failure modes deadlock detection

DOWNLOAD PDF

7 pages 16 references

2026

WHY OUR AUTONOMOUS AI AGENT'S QUALITY DROPPED 31% — AND WHAT THE DATA REVEALED

Vasyl Golubenko TOV ZELTREX March 2026

Night Shift is an autonomous AI development system that runs 24/7. Over 31 operational days, we observed quality scores decline 31%, followed by a 4-day outage caused by a safety mechanism feedback loop. Root cause analysis of 287+ task executions revealed six findings: output truncation is the primary quality determinant, LLM-as-Judge calibration shift can masquerade as quality decline, and safety mechanisms without escape hatches cause worse outages than the failures they prevent.

autonomous agents quality assessment LLM-as-Judge safety mechanisms root cause analysis

PDF

11 pages 16 references

2026

FROM SEARCH TO GROWTH: A COMPARATIVE TAXONOMY OF AUTONOMOUS AI EXPERIMENTATION PARADIGMS

Vasyl Golubenko TOV ZELTREX March 2026

We present a four-level taxonomy of autonomous AI experimentation paradigms: Hill Climber, Evolutionary Optimizer, Cognitive Architect, and Organizational Evolver. Comparing Karpathy's AutoResearch (greedy single-metric optimization) with GODEGEN/Night Shift (evolutionary multi-objective growth), we analyze 17 architectural dimensions and identify seven transferable design patterns. We demonstrate that constitutional safety constraints, not merely git-based rollback, are the critical differentiator for production deployment of self-modifying agents. Drawing on 287+ autonomous tasks and 46 references, we articulate the philosophical distinction between search (convergence to optima) and growth (expansion of capability frontiers).

Autonomous Experimentation Evolutionary Growth Constitutional Safety AutoResearch GODEGEN Taxonomy

4

Paradigm Levels

17

Dimensions

7

Design Patterns

46

References

DOWNLOAD PDF CONTACT AUTHOR

2026

THE LIVING AGENT: PRODUCTION EVOLUTIONARY SELF-IMPROVEMENT IN AUTONOMOUS AI SYSTEMS

Vasyl Golubenko TOV ZELTREX March 2026

We present GODEGEN, the first production AI agent system combining five capabilities previously seen only in isolation: evolutionary optimization, persistent digital identity, constitutional self-modification, autonomous multi-day operation, and agent distillation. Operating on a real codebase for 10+ consecutive days at $0.24/task, we score against 5 leading systems on a 25-dimension evaluation matrix, achieving 80/100. We propose four new temporal benchmarks (ImproveBench, AdaptBench, CostBench, AutonomyBench) and argue that evolutionary improvement at 4,380 steps/year creates a fundamentally different trajectory than static model upgrades.

Self-Improving Agents Evolutionary Optimization Digital Personality Autonomous AI Agent Distillation Constitutional AI

5/5

Pillar Score

25

Eval Dimensions

280+

Tasks Completed

32

References

DOWNLOAD PDF CONTACT AUTHOR

2026

DIGITAL DEVELOPMENTAL PSYCHOLOGY: A COGNITIVE FRAMEWORK FOR SELF-IMPROVING AI AGENTS

Vasyl Golubenko TOV ZELTREX March 2026

We propose Digital Developmental Psychology (DDP), a framework mapping nine theories from cognitive and developmental psychology onto the architectural components of a production autonomous agent. Piaget's stages, Vygotsky's ZPD, Bloom's taxonomy, Ebbinghaus's forgetting curve, Berlyne's curiosity theory, Kahneman's dual process theory, and four more are operationalized as eight software modules (2,363 LOC, 125 tests). We derive 21 testable predictions and validate three from 280+ production tasks: quality distributions are non-normal (Piaget), Create-level tasks score lower than Apply-level (Bloom), and skill usage follows a power law (ACT-R).

Developmental Psychology Cognitive Architecture Self-Improving Agents Piaget Stages ZPD Intrinsic Motivation

9

Psychology Theories

21

Predictions

2,363

Lines of Code

37

References

DOWNLOAD PDF CONTACT AUTHOR

2026

CONSTITUTIONAL SELF-MODIFICATION: A 7-LAYER SAFETY FRAMEWORK FOR AUTONOMOUS CODE-GENERATING AGENTS

Vasyl Golubenko TOV ZELTREX March 2026

Autonomous AI coding agents that can merge their own code introduce a novel safety challenge. We present a 7-layer defense-in-depth framework enforcing 19 constitutional rules across input validation, path constraints, quality gates, integration guards, merge analysis, deployment rollback, and circuit breakers. Validated on 280+ tasks over 10 days with zero safety violations and 73% auto-merge success rate. A systematic audit of 23 modules revealed 8 critical "dead integration" gaps --- safety features fully coded but never activated. We introduce the Safety Activation Rate (SAR) metric and compare against 6 existing frameworks.

Constitutional AI Safe Self-Modification Defense in Depth CI/CD Safety RSI Safety Dead Integration

7

Safety Layers

19

Constitutional Rules

0

Safety Violations

25

References

DOWNLOAD PDF CONTACT AUTHOR

2026

BENCHMARKING SELF-IMPROVEMENT: TEMPORAL EVALUATION METRICS FOR AUTONOMOUS AI AGENTS

Vasyl Golubenko TOV ZELTREX March 2026

Existing benchmarks for AI coding agents measure point-in-time capability. We propose four novel temporal benchmarks — ImproveBench, AdaptBench, CostBench, and AutonomyBench — built on non-parametric statistical foundations (Mann-Kendall trend test, Theil-Sen slope estimator, EWMA smoothing). Validated on 280+ production tasks from a 10-day autonomous deployment, achieving a composite score of 62.1/100 (C+). Our results demonstrate that temporal metrics reveal performance characteristics invisible to point-in-time evaluation: strong domain adaptation combined with stagnant quality improvement — a nuance no existing benchmark captures.

Benchmark Self-Improving Agents Temporal Evaluation Mann-Kendall Production Deployment Autonomous AI

4

Novel Benchmarks

280+

Tasks Validated

1,168

LOC Implementation

21

References

DOWNLOAD PDF CONTACT AUTHOR

2026

THE TEMPORAL DIMENSION GAP: WHY LIVING AGENTS DIVERGE FROM TOOL AGENTS IN AUTONOMOUS SOFTWARE ENGINEERING

Vasyl Golubenko TOV ZELTREX March 2026

We identify and analyze a previously unrecognized evaluation gap in AI coding agents: temporal capabilities — properties that emerge and compound over an agent's operational lifetime. Current benchmarks measure point-in-time task performance but ignore cross-session learning, autonomous operation duration, domain adaptation speed, codebase specialization, and constitutional safety for self-modification. We extend our 20-dimension SOTA comparison matrix with 5 temporal dimensions, scoring 10 production systems. GODEGEN scores 24/25 on temporal dimensions versus 13/25 for the best competitor (Devin 2.0). We survey 70+ papers, identify 6 literature gaps, and propose four novel temporal benchmarks (ImproveBench, AdaptBench, CostBench, AutonomyBench) with Mann-Kendall trend testing and Theil-Sen slope estimation. Analysis of GitHub Copilot's Agentic Memory launch (March 4, 2026) reveals it stores codebase facts but not self-improvement data, preserving GODEGEN's structural advantage.

Temporal Evaluation Self-Improving Agents Cross-Session Learning Agent Benchmarks Living Agents Evolutionary Optimization

10

Systems Evaluated

25

Eval Dimensions

6

Literature Gaps

70

References

DOWNLOAD PDF CONTACT AUTHOR

2026

NIGHT SHIFT: A PRODUCTION AUTONOMOUS AI DEVELOPMENT SYSTEM WITH EVOLUTIONARY TASK OPTIMIZATION

Vasyl Golubenko TOV ZELTREX March 2026

We present Night Shift, a production-deployed autonomous AI development system that operates continuously on a 2-hour dispatch cycle, generating code, specifications, research reports, and documentation without human intervention. Unlike benchmark-oriented agent systems, Night Shift has been deployed to a production server for 10+ consecutive days, completing 269 tasks at a total cost of $65.98 USD. The system introduces four key innovations: an evolutionary task optimization engine based on genetic algorithms, a continuation-based anti-truncation mechanism achieving 100% completion rate, cascade model routing across local GPU and cloud API tiers, and a human-AI symbiont loop where daily mentoring reviews shape system behavior. We compare against SWE-agent, Devin, MetaGPT, and AlphaEvolve, demonstrating that production deployment demands fundamentally different design priorities than benchmark performance.

Autonomous Agents Software Engineering Genetic Algorithms LLM Agents Cost Optimization Human-AI Collaboration

269

Tasks Completed

$0.25

Avg Cost / Task

100%

Completion Rate

33

References

DOWNLOAD PDF CONTACT AUTHOR

2026

GODEGEN: A COGNITIVE ARCHITECTURE FOR SELF-EVOLVING DIGITAL PERSONALITIES WITH FRACTAL FEEDBACK LOOPS

Vasyl Golubenko TOV ZELTREX March 2026

We present GODEGEN, a cognitive architecture that transforms autonomous AI development agents from task executors into self-evolving Digital Personalities. Building on Night Shift, we introduce six cognitive modules: a four-persona agent swarm, a SQLite-backed knowledge graph, a four-stage information refiner, a Voyager-inspired skill library, a DP-level epoch evolution manager, and a formal identity document. A systematic production-readiness audit revealed 8 critical wiring gaps — a failure mode we term "dead integration." Gap closure (91 tests) produced a system where all six modules initialized on the first production run. We compare against 12 state-of-the-art systems across 15 capabilities, finding GODEGEN leads in persistent identity evolution and fractal scaling, but lags in formal benchmarks and tool use.

Cognitive Architecture Gap Analysis Agent Swarm Knowledge Graph Production AI Fractal Organization

6

Cognitive Modules

8

Gaps Closed

91

Tests Added

25

References

DOWNLOAD PDF CONTACT AUTHOR

2026

SELF-IMPROVING AI SYSTEMS: MATHEMATICAL FOUNDATIONS AND PRODUCTION VALIDATION

Vasyl Golubenko, Viktor Zhelizko TOV ZELTREX March 2026

We formalize the mathematical foundations of self-improving AI systems through 12 equations validated against 323 production tasks over 17 days. We derive the Quality Decay Function (R²=0.91) linking infrastructure failures to output degradation, adapt Rechenberg's 1/5 evolutionary rule for AI configuration optimization, and apply Mann-Kendall trend detection for quality monitoring. Production data demonstrates 40% cost reduction through multi-provider routing and mentoring adoption curves following logistic growth with structural changes at τ=3 days and behavioral at τ=7 days.

Self-Improving AI Mathematical Foundations Evolutionary Optimization Quality Monitoring Production Validation Rechenberg 1/5

12

Equations

323

Tasks Validated

17

Days Production

30

References

DOWNLOAD PDF CONTACT AUTHOR

2026

SELF-HEALING ARCHITECTURE FOR AUTONOMOUS AI AGENTS: A SIX-LAYER RESILIENCE FRAMEWORK

Vasyl Golubenko, Viktor Zhelizko TOV ZELTREX March 2026

We present a six-layer self-healing architecture for autonomous AI agents, validated in production over 17 days (323 tasks, $84 total cost). The framework combines WAL-mode persistence, exponential retry with jitter, idempotent task execution, watchdog monitoring, circuit breakers, and MAPE-K autonomic control. A fix cascade analysis reveals mean chain length E[L]=3.0 with coupling probability p=0.67. The system recovered from a 3-day outage autonomously, achieving 99.7% availability. We formalize self-healing response time, redundancy cost models, and demonstrate Goodhart's Law effects in autonomous maintenance.

Self-Healing Resilience Framework MAPE-K Crash Recovery Fix Cascade Autonomic Computing

6

Resilience Layers

99.7%

Availability

10

Equations

27

References

DOWNLOAD PDF CONTACT AUTHOR

2026

WHEN YOUR AI AGENT WRITES JAVA FOR A PYTHON PROJECT: OUTPUT-TYPE-AWARE MODEL ROUTING

Vasyl Golubenko TOV ZELTREX March 2026

We document a catastrophic failure mode in multi-model autonomous development systems where a local 32B-parameter LLM consistently hallucinated wrong programming languages for code tasks — producing Java Spring Boot and TypeScript for a Python-only codebase. We introduce output-type-aware routing that routes code tasks to API-backed models while keeping prose on the local GPU, preserving 99.5% of cost savings while eliminating the failure mode. We formalize task-type sensitivity and propose a 2x2 routing matrix generalizable to any multi-model agent architecture.

Model Routing Language Hallucination Multi-Model Systems Cost Optimization

0%

Code Merge Rate (Before)

98.8%

Cost Reduction

11

References

DOWNLOAD PDF CONTACT AUTHOR

2026

WHEN EVOLUTION CAN'T EVOLVE: DISCONNECTED GENOMES IN AUTONOMOUS AI SYSTEMS

Vasyl Golubenko TOV ZELTREX March 2026

We report a failure mode where 477 genome mutations across 28 generations produced zero fitness improvement in an evolutionary optimization engine for autonomous AI development. Root cause: a 5-link causal disconnection chain between genotype (evolved parameters) and phenotype (task outcomes). We formalize Disconnected Evolution as Goodhart's Law applied to evolutionary optimization, survey related phenomena across evolutionary computation, NAS, meta-learning, and prompt optimization (25+ papers), and propose detection methods and architectural fixes.

Evolutionary Optimization Goodhart's Law Genotype-Phenotype Disconnected Evolution Autonomous Agents

0/477

Mutations Improved

5

Causal Break Links

23

References

DOWNLOAD PDF CONTACT AUTHOR

OBSERVABLE METACOGNITION: THREE SILENT-FAILURE MODES OF SELF-ASSESSMENT SIGNALS IN A PRODUCTION AUTONOMOUS AGENT

A CAPABILITY-LAYER TAXONOMY OF SELF-IMPROVING LLM AGENTS WITH ADVERSARIALLY-VERIFIED EVIDENCE (2023–2026)

CALIBRATION COLLAPSE: WHEN METACOGNITIVE CONFIDENCE PREDICTIONS DEGENERATE TO CONSTANTS IN AUTONOMOUS AI SYSTEMS

SILENT STARVATION: DEPENDENCY DEADLOCK AND PLANNING LOOP FAILURE MODES IN AUTONOMOUS AI TASK SCHEDULING

WHY OUR AUTONOMOUS AI AGENT'S QUALITY DROPPED 31% — AND WHAT THE DATA REVEALED

FROM SEARCH TO GROWTH: A COMPARATIVE TAXONOMY OF AUTONOMOUS AI EXPERIMENTATION PARADIGMS

THE LIVING AGENT: PRODUCTION EVOLUTIONARY SELF-IMPROVEMENT IN AUTONOMOUS AI SYSTEMS

DIGITAL DEVELOPMENTAL PSYCHOLOGY: A COGNITIVE FRAMEWORK FOR SELF-IMPROVING AI AGENTS

CONSTITUTIONAL SELF-MODIFICATION: A 7-LAYER SAFETY FRAMEWORK FOR AUTONOMOUS CODE-GENERATING AGENTS

BENCHMARKING SELF-IMPROVEMENT: TEMPORAL EVALUATION METRICS FOR AUTONOMOUS AI AGENTS

THE TEMPORAL DIMENSION GAP: WHY LIVING AGENTS DIVERGE FROM TOOL AGENTS IN AUTONOMOUS SOFTWARE ENGINEERING

NIGHT SHIFT: A PRODUCTION AUTONOMOUS AI DEVELOPMENT SYSTEM WITH EVOLUTIONARY TASK OPTIMIZATION

GODEGEN: A COGNITIVE ARCHITECTURE FOR SELF-EVOLVING DIGITAL PERSONALITIES WITH FRACTAL FEEDBACK LOOPS

SELF-IMPROVING AI SYSTEMS: MATHEMATICAL FOUNDATIONS AND PRODUCTION VALIDATION

SELF-HEALING ARCHITECTURE FOR AUTONOMOUS AI AGENTS: A SIX-LAYER RESILIENCE FRAMEWORK

WHEN YOUR AI AGENT WRITES JAVA FOR A PYTHON PROJECT: OUTPUT-TYPE-AWARE MODEL ROUTING

WHEN EVOLUTION CAN'T EVOLVE: DISCONNECTED GENOMES IN AUTONOMOUS AI SYSTEMS

ABOUT OUR RESEARCH