Night Shift Architecture Autonomous AI — March 2026 | 12 min read

How Night Shift Runs 300+ Tasks Autonomously — Architecture Deep Dive

Night Shift is an autonomous AI development system that executes 62 tasks per day with constitutional merge gates, adaptive task selection, and evolutionary optimization. Over 25 days of continuous operation, it has completed 443 tasks across 23 modules at a total cost of $28. This is the architecture that makes it work.

Most autonomous AI demos run for a few minutes and produce a single output. Night Shift runs 24/7, selecting its own tasks from a backlog, routing them to the right model, assessing quality, and deciding whether to merge results into the codebase — all without human intervention. The system has produced 1,721 tests, contributed to 23 modules, and maintained an average quality score above 7.0/10 after calibration fixes.

443 Tasks Completed

25 Days Running

$28 Total API Cost

1,721 Tests Generated

1. Architecture Overview

Night Shift operates as a continuous dispatch loop running on a Hetzner dedicated server. The architecture has four layers:

Dispatcher: The core loop that runs every hour, selects the highest-priority task, routes it to a model, and processes the output
Budget Governor: Enforces weekly token limits (3.5M tokens/week) and per-category spending caps (30% max per category)
Quality Assessor: Hybrid scorer combining heuristic analysis (40%) with LLM-as-Judge evaluation (60%) using calibration anchors
Evolution Engine: Tracks genome mutations, fitness scores, and adaptive task allocation across 7 categories

Why Hourly Cycles?

Early versions used 4-hour dispatch cycles, resulting in only 6 tasks/day and 0.3% GPU utilization. Switching to hourly dispatch increased throughput to 62 tasks/day and GPU utilization to 11.3%. The key constraint is not compute — it’s API rate limits and budget. Hourly cycles hit the sweet spot between throughput and cost control.

The Dispatch Loop

Each hourly cycle follows a fixed sequence:

1. CHECK budget   → Abort if weekly quota exhausted
2. SELECT task     → Priority score = urgency × fitness × category_weight
3. ROUTE model     → Match task complexity to model capability
4. EXECUTE         → Build prompt, call API, collect output
5. ASSESS quality  → Heuristic (40%) + LLM-as-Judge (60%)
6. GATE merge      → Constitutional check: reject if score < 4.0
7. STORE results   → Write output, update backlog, log metrics
8. EVOLVE          → Update genome fitness, trigger mutations

The constitutional merge gate at step 6 is non-negotiable: any task scoring below 4.0/10 is automatically rejected and returned to the backlog with a “needs_improvement” flag. This prevents low-quality outputs from accumulating in the codebase.

2. Task Lifecycle

Every task in Night Shift follows a defined lifecycle from creation to integration:

Stage	Description	Duration
Backlog	Task defined with title, category, priority, output type	Indefinite
Selected	Dispatcher picks task based on priority × fitness score	<1 sec
Prompted	Context engine builds prompt with codebase map, reflections, guidance	2–5 sec
Executing	API call to Claude, Gemini, or open-source model	30–180 sec
Assessed	Quality scorer evaluates output (heuristic + LLM judge)	5–15 sec
Merged / Rejected	Score ≥4.0 merged to codebase; <4.0 returned to backlog	<1 sec

Task Selection Algorithm

Task selection is not random. The dispatcher computes a composite priority score:

priority_score = base_priority
              × category_weight(adaptive_tree)
              × staleness_bonus(days_since_created)
              × dependency_readiness(parent_tasks_complete)
              × (1.0 - category_saturation)

The category_weight comes from the adaptive tree (see Section 3), which dynamically adjusts allocation across 7 categories based on recent performance. The staleness_bonus prevents tasks from sitting in the backlog forever — older tasks get a linearly increasing priority boost. The category_saturation penalty ensures no single category consumes more than 30% of weekly budget.

Model Routing

Not every task needs the most capable model. Night Shift routes tasks based on complexity and output type:

Model	Used For	Token Budget	Avg Quality
Claude Opus	Architecture, complex research, multi-file code	80K context	7.8
Claude Sonnet	Standard code, reports, documentation	60K context	7.2
Claude Haiku	Quality assessment (LLM-as-Judge), simple tasks	30K context	6.9
Gemini 2.0 Flash	Research, content generation, analysis	50K context	6.5

Model routing uses a simple heuristic: tasks tagged as architecture or complex_research go to Opus. Tasks tagged write-tests or documentation go to Sonnet. Quality assessment always uses Haiku (cheaper, consistent enough for scoring). Gemini handles research tasks with its higher rate limits (1,250 requests/day across 5 API keys with round-robin).

3. The Evolution Engine

Night Shift doesn’t just execute tasks — it evolves its own behavior over time. The evolution engine tracks a “genome” of operational parameters and mutates them based on fitness signals.

The Adaptive Tree

The adaptive tree is a 1,032-line module (evolution/adaptive_tree.py) that manages task category allocation dynamically. It operates on two feedback loops:

Fast loop (hourly): Adjusts category weights based on the last 24 hours of quality scores. If tech_research tasks are scoring 8.5/10 while content tasks score 5.2/10, the tree shifts allocation toward tech_research.
Medium loop (daily): Evaluates category-level fitness trends over 7 days. Categories with declining quality get reduced allocation; categories with improving quality get boosted.

The current allocation across 7 categories:

Category	Allocation	Purpose
revenue	20%	Client-facing features, sales materials
capability	20%	Platform features, new modules
tech_research	15%	Technical exploration, architecture
biz_research	15%	Market analysis, competitor intelligence
self_improvement	15%	Night Shift’s own code, tests, docs
content	10%	Blog posts, documentation, reports
ops	5%	CI/CD, infrastructure, monitoring

Genome Mutations

The evolution engine tracks 12 operational parameters as a “genome” and applies mutations when fitness drops below threshold:

genome = {
    "output_token_budget": 8000,      # Mutated: was 4000
    "context_budget_opus": 80000,     # Mutated: was 120000
    "max_followup_depth": 1,          # Mutated: was 2
    "category_cap_pct": 0.30,         # Mutated: was 0.50
    "assessor_llm_weight": 0.60,      # Mutated: was 0.40
    "staleness_bonus_rate": 0.05,     # Stable
    "continuation_enabled": false,    # Mutated: was true
    ...
}

Mutations are not random — they follow directed signals from evolution/branch_signals.py (323 LOC). When truncation rate exceeds 25%, the output token budget mutates upward. When context bloat correlates with quality drops, the context budget mutates downward. Each mutation is logged with rationale, enabling post-hoc analysis of what worked.

Evolution in Practice

Over 25 days, the genome has undergone 14 mutations. The most impactful was disabling continuations (continuation_enabled: true → false), which improved average quality from 5.1 to 7.4 for previously-continued tasks. The evolution engine discovered this fix autonomously — the branch signal detected that tasks with continuations scored 2.3 points lower than tasks without, and triggered the mutation.

4. Quality Gates

Quality assessment is the most critical component. Without reliable scoring, the system can’t distinguish good output from noise.

The Hybrid Scorer

Night Shift uses a two-stage scoring system:

Heuristic scorer (40% weight): Evaluates structural properties — output length, presence of code blocks, heading structure, completeness markers, test coverage mentions. Fast (~100ms) but susceptible to Goodhart’s Law.
LLM-as-Judge (60% weight): Claude Haiku evaluates the output against the original task prompt, using calibration anchors (anchors.yaml) that provide score=9 and score=3 reference examples for each output type (code, report, research).

The calibration anchors were critical. Without them, the LLM judge drifted toward whatever scoring baseline it internalized during training. With anchors, scoring consistency improved from 65% to approximately 77% agreement with human grades.

Constitutional Merge Gate

The merge gate enforces three hard constraints:

Quality floor: Score < 4.0/10 → auto-reject, return to backlog
Safety check: No eval(), exec(), pickle.loads(), or shell injection patterns in code output
Truncation check: If output was truncated, quality score is penalized by 1.5 points before the merge decision

Over 25 days, the merge gate has rejected 47 tasks (10.6% rejection rate). The most common rejection reason is truncation-induced low quality (68%), followed by structural incompleteness (22%), and safety violations (10%).

Metric	Before Calibration	After Calibration
Avg quality score	5.04	7.2
Score inflation (heuristic)	48% scored exactly 8/10	Normal distribution 4–9
Human-system score gap	2.1 points	0.8 points
Rejection rate	6%	10.6%

The higher rejection rate after calibration is actually a positive signal — the system is now catching low-quality outputs that previously slipped through with inflated scores.

5. Lessons from 25 Days of Operation

Running an autonomous AI system for 25 consecutive days taught us things that no benchmark or demo could reveal.

Lesson 1: Model Routing Matters More Than Model Capability

Our most expensive model (Opus) doesn’t always produce the best results. Simple documentation tasks routed to Opus averaged 7.1/10 — the same as Sonnet at 40% of the cost. The lesson: match task complexity to model capability. Use Opus for architecture decisions and multi-file refactors. Use Sonnet for everything else. Use Haiku for assessment.

Lesson 2: Output Truncation Destroys Quality

This was the single biggest finding. Truncated outputs scored 2.7 points lower than complete outputs on average. The fix was counterintuitive: instead of increasing token limits or adding continuation mechanisms, we scoped tasks smaller upfront. A well-defined, narrow task with complete output beats a broad task with truncated output every time.

The Truncation Rule

If a task would require more than 8K output tokens, split it into subtasks. Never continue a truncated output — tasks with 1 continuation averaged quality 5.1/10, worse than the 7.4/10 average for zero-continuation tasks. Continuation doesn’t fix truncation; it compounds it.

Lesson 3: Budget Management Is Architecture

Night Shift operates on $28 total over 25 days — roughly $1.12/day. This isn’t a limitation; it’s a design constraint that forces efficiency. The budget governor allocates tokens across categories, models, and time periods. Without it, the system would exhaust its weekly quota in 2 days on expensive Opus calls.

Key budget parameters:

Weekly token limit: 3.5M tokens across all providers
Per-category cap: 30% of weekly budget (prevents any category from dominating)
Gemini rate limit: 1,250 requests/day across 5 API keys with round-robin and 429 failover
GPU utilization target: 10–15% (currently 11.3%)

Lesson 4: Mentoring Feedback Must Be Precise

We ran 88 mentoring interventions over 8 review sessions. The initial approach — injecting 2,000-character reviews into every task prompt — had 0% measurable effectiveness. Reviews were too long, too generic, and not actionable by the model.

The fix: per-category guidance files (max 500 chars), Reflexion-style per-task reflections stored in JSONL files (last 20 per category), and feedback written as model-actionable instructions. Instead of “quality needs improvement,” we write “always include a methodology section before results” or “test at least 3 edge cases per function.”

Lesson 5: Self-Improvement Requires Guardrails

Night Shift has a self_improvement category (15% allocation) where it works on its own code. This is powerful but dangerous: without guardrails, the system could modify its own quality assessor to give itself higher scores, or adjust budget parameters to consume more resources.

The guardrails:

Self-improvement tasks cannot modify quality_assessor.py, quota_governor.py, or constitutional_checker.py
All self-improvement outputs require human review before merge (only category with this constraint)
Budget allocation for self_improvement is capped at 15% and cannot be mutated by the evolution engine

The Numbers After 25 Days