2026
Vasyl Golubenko
TOV ZELTREX
April 2026
We report the first empirical documentation of calibration collapse — a failure mode in which a deployed metacognitive planning system degenerates to outputting a constant confidence value across all predictions, eliminating discriminative signal while preserving the structural appearance of self-awareness. In a production system operating over 14 days and 69 task executions, the metacognitive planner produced an identical confidence score of 0.7575 across all 60 recorded predictions, regardless of task category, model assignment, or historical quality outcomes. Root cause analysis revealed an empty calibration boundary table (0 rows despite 219 quality observations) — a metacognitive Potemkin village. We formalize calibration collapse as a distinct failure class from overconfidence, propose variance-based detection metrics, and outline a mandatory calibration bootstrapping protocol.
metacognition
confidence calibration
autonomous agents
failure modes
LLM
2026
Vasyl Golubenko
TOV ZELTREX
April 2026
We report the first empirical documentation of silent starvation — a compound failure in which a dependency chain deadlock combines with a decoupled planning layer to produce an agent that runs hundreds of planning cycles while executing zero tasks, without generating any error signals. In a production system, we observed 452 identical task selections over 7 days, producing 0 task executions and 0 errors. Root cause analysis revealed status promotion failure, backlog desynchronization, and planning-execution decoupling. We formalize a taxonomy of four autonomous scheduling failure modes, propose detection heuristics based on selection entropy and execution ratio monitoring, and validate that correcting the root cause immediately restored task flow.
autonomous agents
task scheduling
dependency resolution
failure modes
deadlock detection
2026
Vasyl Golubenko
TOV ZELTREX
March 2026
Night Shift is an autonomous AI development system that runs 24/7. Over 31 operational days, we observed quality scores decline 31%, followed by a 4-day outage caused by a safety mechanism feedback loop. Root cause analysis of 287+ task executions revealed six findings: output truncation is the primary quality determinant, LLM-as-Judge calibration shift can masquerade as quality decline, and safety mechanisms without escape hatches cause worse outages than the failures they prevent.
autonomous agents
quality assessment
LLM-as-Judge
safety mechanisms
root cause analysis
PDF
11 pages
16 references
2026
Vasyl Golubenko
TOV ZELTREX
March 2026
We present a four-level taxonomy of autonomous AI experimentation paradigms: Hill Climber, Evolutionary Optimizer, Cognitive Architect, and Organizational Evolver. Comparing Karpathy's AutoResearch (greedy single-metric optimization) with GODEGEN/Night Shift (evolutionary multi-objective growth), we analyze 17 architectural dimensions and identify seven transferable design patterns. We demonstrate that constitutional safety constraints, not merely git-based rollback, are the critical differentiator for production deployment of self-modifying agents. Drawing on 287+ autonomous tasks and 46 references, we articulate the philosophical distinction between search (convergence to optima) and growth (expansion of capability frontiers).
Autonomous Experimentation
Evolutionary Growth
Constitutional Safety
AutoResearch
GODEGEN
Taxonomy
2026
Vasyl Golubenko
TOV ZELTREX
March 2026
We present GODEGEN, the first production AI agent system combining five capabilities previously seen only in isolation: evolutionary optimization, persistent digital identity, constitutional self-modification, autonomous multi-day operation, and agent distillation. Operating on a real codebase for 10+ consecutive days at $0.24/task, we score against 5 leading systems on a 25-dimension evaluation matrix, achieving 80/100. We propose four new temporal benchmarks (ImproveBench, AdaptBench, CostBench, AutonomyBench) and argue that evolutionary improvement at 4,380 steps/year creates a fundamentally different trajectory than static model upgrades.
Self-Improving Agents
Evolutionary Optimization
Digital Personality
Autonomous AI
Agent Distillation
Constitutional AI
2026
Vasyl Golubenko
TOV ZELTREX
March 2026
We propose Digital Developmental Psychology (DDP), a framework mapping nine theories from cognitive and developmental psychology onto the architectural components of a production autonomous agent. Piaget's stages, Vygotsky's ZPD, Bloom's taxonomy, Ebbinghaus's forgetting curve, Berlyne's curiosity theory, Kahneman's dual process theory, and four more are operationalized as eight software modules (2,363 LOC, 125 tests). We derive 21 testable predictions and validate three from 280+ production tasks: quality distributions are non-normal (Piaget), Create-level tasks score lower than Apply-level (Bloom), and skill usage follows a power law (ACT-R).
Developmental Psychology
Cognitive Architecture
Self-Improving Agents
Piaget Stages
ZPD
Intrinsic Motivation
2026
Vasyl Golubenko
TOV ZELTREX
March 2026
Autonomous AI coding agents that can merge their own code introduce a novel safety challenge. We present a 7-layer defense-in-depth framework enforcing 19 constitutional rules across input validation, path constraints, quality gates, integration guards, merge analysis, deployment rollback, and circuit breakers. Validated on 280+ tasks over 10 days with zero safety violations and 73% auto-merge success rate. A systematic audit of 23 modules revealed 8 critical "dead integration" gaps --- safety features fully coded but never activated. We introduce the Safety Activation Rate (SAR) metric and compare against 6 existing frameworks.
Constitutional AI
Safe Self-Modification
Defense in Depth
CI/CD Safety
RSI Safety
Dead Integration
2026
Vasyl Golubenko
TOV ZELTREX
March 2026
Existing benchmarks for AI coding agents measure point-in-time capability. We propose four novel temporal benchmarks — ImproveBench, AdaptBench, CostBench, and AutonomyBench — built on non-parametric statistical foundations (Mann-Kendall trend test, Theil-Sen slope estimator, EWMA smoothing). Validated on 280+ production tasks from a 10-day autonomous deployment, achieving a composite score of 62.1/100 (C+). Our results demonstrate that temporal metrics reveal performance characteristics invisible to point-in-time evaluation: strong domain adaptation combined with stagnant quality improvement — a nuance no existing benchmark captures.
Benchmark
Self-Improving Agents
Temporal Evaluation
Mann-Kendall
Production Deployment
Autonomous AI
2026
Vasyl Golubenko
TOV ZELTREX
March 2026
We identify and analyze a previously unrecognized evaluation gap in AI coding agents: temporal capabilities — properties that emerge and compound over an agent's operational lifetime. Current benchmarks measure point-in-time task performance but ignore cross-session learning, autonomous operation duration, domain adaptation speed, codebase specialization, and constitutional safety for self-modification. We extend our 20-dimension SOTA comparison matrix with 5 temporal dimensions, scoring 10 production systems. GODEGEN scores 24/25 on temporal dimensions versus 13/25 for the best competitor (Devin 2.0). We survey 70+ papers, identify 6 literature gaps, and propose four novel temporal benchmarks (ImproveBench, AdaptBench, CostBench, AutonomyBench) with Mann-Kendall trend testing and Theil-Sen slope estimation. Analysis of GitHub Copilot's Agentic Memory launch (March 4, 2026) reveals it stores codebase facts but not self-improvement data, preserving GODEGEN's structural advantage.
Temporal Evaluation
Self-Improving Agents
Cross-Session Learning
Agent Benchmarks
Living Agents
Evolutionary Optimization
2026
Vasyl Golubenko
TOV ZELTREX
March 2026
We present Night Shift, a production-deployed autonomous AI development system that operates continuously on a 2-hour dispatch cycle, generating code, specifications, research reports, and documentation without human intervention. Unlike benchmark-oriented agent systems, Night Shift has been deployed to a production server for 10+ consecutive days, completing 269 tasks at a total cost of $65.98 USD. The system introduces four key innovations: an evolutionary task optimization engine based on genetic algorithms, a continuation-based anti-truncation mechanism achieving 100% completion rate, cascade model routing across local GPU and cloud API tiers, and a human-AI symbiont loop where daily mentoring reviews shape system behavior. We compare against SWE-agent, Devin, MetaGPT, and AlphaEvolve, demonstrating that production deployment demands fundamentally different design priorities than benchmark performance.
Autonomous Agents
Software Engineering
Genetic Algorithms
LLM Agents
Cost Optimization
Human-AI Collaboration
2026
Vasyl Golubenko
TOV ZELTREX
March 2026
We present GODEGEN, a cognitive architecture that transforms autonomous AI development agents from task executors into self-evolving Digital Personalities. Building on Night Shift, we introduce six cognitive modules: a four-persona agent swarm, a SQLite-backed knowledge graph, a four-stage information refiner, a Voyager-inspired skill library, a DP-level epoch evolution manager, and a formal identity document. A systematic production-readiness audit revealed 8 critical wiring gaps — a failure mode we term "dead integration." Gap closure (91 tests) produced a system where all six modules initialized on the first production run. We compare against 12 state-of-the-art systems across 15 capabilities, finding GODEGEN leads in persistent identity evolution and fractal scaling, but lags in formal benchmarks and tool use.
Cognitive Architecture
Gap Analysis
Agent Swarm
Knowledge Graph
Production AI
Fractal Organization
2026
Vasyl Golubenko, Viktor Zhelizko
TOV ZELTREX
March 2026
We formalize the mathematical foundations of self-improving AI systems through 12 equations validated against 323 production tasks over 17 days. We derive the Quality Decay Function (R²=0.91) linking infrastructure failures to output degradation, adapt Rechenberg's 1/5 evolutionary rule for AI configuration optimization, and apply Mann-Kendall trend detection for quality monitoring. Production data demonstrates 40% cost reduction through multi-provider routing and mentoring adoption curves following logistic growth with structural changes at τ=3 days and behavioral at τ=7 days.
Self-Improving AI
Mathematical Foundations
Evolutionary Optimization
Quality Monitoring
Production Validation
Rechenberg 1/5
2026
Vasyl Golubenko, Viktor Zhelizko
TOV ZELTREX
March 2026
We present a six-layer self-healing architecture for autonomous AI agents, validated in production over 17 days (323 tasks, $84 total cost). The framework combines WAL-mode persistence, exponential retry with jitter, idempotent task execution, watchdog monitoring, circuit breakers, and MAPE-K autonomic control. A fix cascade analysis reveals mean chain length E[L]=3.0 with coupling probability p=0.67. The system recovered from a 3-day outage autonomously, achieving 99.7% availability. We formalize self-healing response time, redundancy cost models, and demonstrate Goodhart's Law effects in autonomous maintenance.
Self-Healing
Resilience Framework
MAPE-K
Crash Recovery
Fix Cascade
Autonomic Computing
2026
Vasyl Golubenko
TOV ZELTREX
March 2026
We document a catastrophic failure mode in multi-model autonomous development systems where a local 32B-parameter LLM consistently hallucinated wrong programming languages for code tasks — producing Java Spring Boot and TypeScript for a Python-only codebase. We introduce output-type-aware routing that routes code tasks to API-backed models while keeping prose on the local GPU, preserving 99.5% of cost savings while eliminating the failure mode. We formalize task-type sensitivity and propose a 2x2 routing matrix generalizable to any multi-model agent architecture.
Model Routing
Language Hallucination
Multi-Model Systems
Cost Optimization
0%
Code Merge Rate (Before)
2026
Vasyl Golubenko
TOV ZELTREX
March 2026
We report a failure mode where 477 genome mutations across 28 generations produced zero fitness improvement in an evolutionary optimization engine for autonomous AI development. Root cause: a 5-link causal disconnection chain between genotype (evolved parameters) and phenotype (task outcomes). We formalize Disconnected Evolution as Goodhart's Law applied to evolutionary optimization, survey related phenomena across evolutionary computation, NAS, meta-learning, and prompt optimization (25+ papers), and propose detection methods and architectural fixes.
Evolutionary Optimization
Goodhart's Law
Genotype-Phenotype
Disconnected Evolution
Autonomous Agents
ABOUT OUR RESEARCH
ZELTREX research focuses on production-grade autonomous AI systems — agents that operate reliably under budget constraints, integrate with real development workflows, and improve through sustained human-AI collaboration. We prioritize empirical results from deployed systems over benchmark performance.