—  March 2026  |  12 min read

How Night Shift Runs 300+ Tasks Autonomously — Architecture Deep Dive

Night Shift is an autonomous AI development system that executes 62 tasks per day with constitutional merge gates, adaptive task selection, and evolutionary optimization. Over 25 days of continuous operation, it has completed 443 tasks across 23 modules at a total cost of $28. This is the architecture that makes it work.

Most autonomous AI demos run for a few minutes and produce a single output. Night Shift runs 24/7, selecting its own tasks from a backlog, routing them to the right model, assessing quality, and deciding whether to merge results into the codebase — all without human intervention. The system has produced 1,721 tests, contributed to 23 modules, and maintained an average quality score above 7.0/10 after calibration fixes.

443 Tasks Completed
25 Days Running
$28 Total API Cost
1,721 Tests Generated

1. Architecture Overview

Night Shift operates as a continuous dispatch loop running on a Hetzner dedicated server. The architecture has four layers:

Why Hourly Cycles?

Early versions used 4-hour dispatch cycles, resulting in only 6 tasks/day and 0.3% GPU utilization. Switching to hourly dispatch increased throughput to 62 tasks/day and GPU utilization to 11.3%. The key constraint is not compute — it’s API rate limits and budget. Hourly cycles hit the sweet spot between throughput and cost control.

The Dispatch Loop

Each hourly cycle follows a fixed sequence:

1. CHECK budget   → Abort if weekly quota exhausted
2. SELECT task     → Priority score = urgency × fitness × category_weight
3. ROUTE model     → Match task complexity to model capability
4. EXECUTE         → Build prompt, call API, collect output
5. ASSESS quality  → Heuristic (40%) + LLM-as-Judge (60%)
6. GATE merge      → Constitutional check: reject if score < 4.0
7. STORE results   → Write output, update backlog, log metrics
8. EVOLVE          → Update genome fitness, trigger mutations

The constitutional merge gate at step 6 is non-negotiable: any task scoring below 4.0/10 is automatically rejected and returned to the backlog with a “needs_improvement” flag. This prevents low-quality outputs from accumulating in the codebase.

2. Task Lifecycle

Every task in Night Shift follows a defined lifecycle from creation to integration:

StageDescriptionDuration
BacklogTask defined with title, category, priority, output typeIndefinite
SelectedDispatcher picks task based on priority × fitness score<1 sec
PromptedContext engine builds prompt with codebase map, reflections, guidance2–5 sec
ExecutingAPI call to Claude, Gemini, or open-source model30–180 sec
AssessedQuality scorer evaluates output (heuristic + LLM judge)5–15 sec
Merged / RejectedScore ≥4.0 merged to codebase; <4.0 returned to backlog<1 sec

Task Selection Algorithm

Task selection is not random. The dispatcher computes a composite priority score:

priority_score = base_priority
              × category_weight(adaptive_tree)
              × staleness_bonus(days_since_created)
              × dependency_readiness(parent_tasks_complete)
              × (1.0 - category_saturation)

The category_weight comes from the adaptive tree (see Section 3), which dynamically adjusts allocation across 7 categories based on recent performance. The staleness_bonus prevents tasks from sitting in the backlog forever — older tasks get a linearly increasing priority boost. The category_saturation penalty ensures no single category consumes more than 30% of weekly budget.

Model Routing

Not every task needs the most capable model. Night Shift routes tasks based on complexity and output type:

ModelUsed ForToken BudgetAvg Quality
Claude OpusArchitecture, complex research, multi-file code80K context7.8
Claude SonnetStandard code, reports, documentation60K context7.2
Claude HaikuQuality assessment (LLM-as-Judge), simple tasks30K context6.9
Gemini 2.0 FlashResearch, content generation, analysis50K context6.5

Model routing uses a simple heuristic: tasks tagged as architecture or complex_research go to Opus. Tasks tagged write-tests or documentation go to Sonnet. Quality assessment always uses Haiku (cheaper, consistent enough for scoring). Gemini handles research tasks with its higher rate limits (1,250 requests/day across 5 API keys with round-robin).

3. The Evolution Engine

Night Shift doesn’t just execute tasks — it evolves its own behavior over time. The evolution engine tracks a “genome” of operational parameters and mutates them based on fitness signals.

The Adaptive Tree

The adaptive tree is a 1,032-line module (evolution/adaptive_tree.py) that manages task category allocation dynamically. It operates on two feedback loops:

The current allocation across 7 categories:

CategoryAllocationPurpose
revenue20%Client-facing features, sales materials
capability20%Platform features, new modules
tech_research15%Technical exploration, architecture
biz_research15%Market analysis, competitor intelligence
self_improvement15%Night Shift’s own code, tests, docs
content10%Blog posts, documentation, reports
ops5%CI/CD, infrastructure, monitoring

Genome Mutations

The evolution engine tracks 12 operational parameters as a “genome” and applies mutations when fitness drops below threshold:

genome = {
    "output_token_budget": 8000,      # Mutated: was 4000
    "context_budget_opus": 80000,     # Mutated: was 120000
    "max_followup_depth": 1,          # Mutated: was 2
    "category_cap_pct": 0.30,         # Mutated: was 0.50
    "assessor_llm_weight": 0.60,      # Mutated: was 0.40
    "staleness_bonus_rate": 0.05,     # Stable
    "continuation_enabled": false,    # Mutated: was true
    ...
}

Mutations are not random — they follow directed signals from evolution/branch_signals.py (323 LOC). When truncation rate exceeds 25%, the output token budget mutates upward. When context bloat correlates with quality drops, the context budget mutates downward. Each mutation is logged with rationale, enabling post-hoc analysis of what worked.

Evolution in Practice

Over 25 days, the genome has undergone 14 mutations. The most impactful was disabling continuations (continuation_enabled: true → false), which improved average quality from 5.1 to 7.4 for previously-continued tasks. The evolution engine discovered this fix autonomously — the branch signal detected that tasks with continuations scored 2.3 points lower than tasks without, and triggered the mutation.

4. Quality Gates

Quality assessment is the most critical component. Without reliable scoring, the system can’t distinguish good output from noise.

The Hybrid Scorer

Night Shift uses a two-stage scoring system:

  1. Heuristic scorer (40% weight): Evaluates structural properties — output length, presence of code blocks, heading structure, completeness markers, test coverage mentions. Fast (~100ms) but susceptible to Goodhart’s Law.
  2. LLM-as-Judge (60% weight): Claude Haiku evaluates the output against the original task prompt, using calibration anchors (anchors.yaml) that provide score=9 and score=3 reference examples for each output type (code, report, research).

The calibration anchors were critical. Without them, the LLM judge drifted toward whatever scoring baseline it internalized during training. With anchors, scoring consistency improved from 65% to approximately 77% agreement with human grades.

Constitutional Merge Gate

The merge gate enforces three hard constraints:

Over 25 days, the merge gate has rejected 47 tasks (10.6% rejection rate). The most common rejection reason is truncation-induced low quality (68%), followed by structural incompleteness (22%), and safety violations (10%).

MetricBefore CalibrationAfter Calibration
Avg quality score5.047.2
Score inflation (heuristic)48% scored exactly 8/10Normal distribution 4–9
Human-system score gap2.1 points0.8 points
Rejection rate6%10.6%

The higher rejection rate after calibration is actually a positive signal — the system is now catching low-quality outputs that previously slipped through with inflated scores.

5. Lessons from 25 Days of Operation

Running an autonomous AI system for 25 consecutive days taught us things that no benchmark or demo could reveal.

Lesson 1: Model Routing Matters More Than Model Capability

Our most expensive model (Opus) doesn’t always produce the best results. Simple documentation tasks routed to Opus averaged 7.1/10 — the same as Sonnet at 40% of the cost. The lesson: match task complexity to model capability. Use Opus for architecture decisions and multi-file refactors. Use Sonnet for everything else. Use Haiku for assessment.

Lesson 2: Output Truncation Destroys Quality

This was the single biggest finding. Truncated outputs scored 2.7 points lower than complete outputs on average. The fix was counterintuitive: instead of increasing token limits or adding continuation mechanisms, we scoped tasks smaller upfront. A well-defined, narrow task with complete output beats a broad task with truncated output every time.

The Truncation Rule

If a task would require more than 8K output tokens, split it into subtasks. Never continue a truncated output — tasks with 1 continuation averaged quality 5.1/10, worse than the 7.4/10 average for zero-continuation tasks. Continuation doesn’t fix truncation; it compounds it.

Lesson 3: Budget Management Is Architecture

Night Shift operates on $28 total over 25 days — roughly $1.12/day. This isn’t a limitation; it’s a design constraint that forces efficiency. The budget governor allocates tokens across categories, models, and time periods. Without it, the system would exhaust its weekly quota in 2 days on expensive Opus calls.

Key budget parameters:

Lesson 4: Mentoring Feedback Must Be Precise

We ran 88 mentoring interventions over 8 review sessions. The initial approach — injecting 2,000-character reviews into every task prompt — had 0% measurable effectiveness. Reviews were too long, too generic, and not actionable by the model.

The fix: per-category guidance files (max 500 chars), Reflexion-style per-task reflections stored in JSONL files (last 20 per category), and feedback written as model-actionable instructions. Instead of “quality needs improvement,” we write “always include a methodology section before results” or “test at least 3 edge cases per function.”

Lesson 5: Self-Improvement Requires Guardrails

Night Shift has a self_improvement category (15% allocation) where it works on its own code. This is powerful but dangerous: without guardrails, the system could modify its own quality assessor to give itself higher scores, or adjust budget parameters to consume more resources.

The guardrails:

The Numbers After 25 Days

443 Tasks Completed
23 Modules Built
1,721 Tests Generated
$1.12 Cost Per Day
MetricDay 1Day 25Change
Tasks/day1562+313%
GPU utilization0.3%11.3%+3,667%
Avg quality (calibrated)5.07.2+44%
Truncation rate29%8%-72%
Dispatch cycle4 hours1 hour4x faster
Genome mutations014
The system is not just running tasks — it’s getting better at running tasks. The evolution engine has autonomously discovered and applied optimizations that would have taken weeks of manual tuning.

What’s Next

Night Shift is moving toward three capabilities:

  1. Multi-repo operation: Currently limited to the hub monorepo. Next milestone is dispatching tasks across project submodules with independent context and git branches.
  2. Slow loop (weekly): The adaptive tree currently operates on hourly and daily loops. A weekly loop will evaluate strategic allocation — should the system spend more time on revenue-generating tasks vs. infrastructure?
  3. Human-in-the-loop escalation: Instead of rejecting low-scoring tasks, route them to a human review queue with the LLM judge’s specific feedback. Convert rejections into mentoring opportunities.

See Night Shift in Action

Night Shift is part of the NEXUS ecosystem — autonomous AI operations for teams of 5–200. The GitLab Pages portal shows real-time pipeline status, task history, and quality metrics.

Request a Demo    Read: Quality RCA

Related Articles

Stay Updated

Get AI insights and NEXUS updates. No spam, unsubscribe anytime.

Run your company from one screen TRY NEXUS FREE