How Night Shift Runs 300+ Tasks Autonomously — Architecture Deep Dive
Night Shift is an autonomous AI development system that executes 62 tasks per day with constitutional merge gates, adaptive task selection, and evolutionary optimization. Over 25 days of continuous operation, it has completed 443 tasks across 23 modules at a total cost of $28. This is the architecture that makes it work.
Most autonomous AI demos run for a few minutes and produce a single output. Night Shift runs 24/7, selecting its own tasks from a backlog, routing them to the right model, assessing quality, and deciding whether to merge results into the codebase — all without human intervention. The system has produced 1,721 tests, contributed to 23 modules, and maintained an average quality score above 7.0/10 after calibration fixes.
1. Architecture Overview
Night Shift operates as a continuous dispatch loop running on a Hetzner dedicated server. The architecture has four layers:
- Dispatcher: The core loop that runs every hour, selects the highest-priority task, routes it to a model, and processes the output
- Budget Governor: Enforces weekly token limits (3.5M tokens/week) and per-category spending caps (30% max per category)
- Quality Assessor: Hybrid scorer combining heuristic analysis (40%) with LLM-as-Judge evaluation (60%) using calibration anchors
- Evolution Engine: Tracks genome mutations, fitness scores, and adaptive task allocation across 7 categories
Why Hourly Cycles?
Early versions used 4-hour dispatch cycles, resulting in only 6 tasks/day and 0.3% GPU utilization. Switching to hourly dispatch increased throughput to 62 tasks/day and GPU utilization to 11.3%. The key constraint is not compute — it’s API rate limits and budget. Hourly cycles hit the sweet spot between throughput and cost control.
The Dispatch Loop
Each hourly cycle follows a fixed sequence:
1. CHECK budget → Abort if weekly quota exhausted
2. SELECT task → Priority score = urgency × fitness × category_weight
3. ROUTE model → Match task complexity to model capability
4. EXECUTE → Build prompt, call API, collect output
5. ASSESS quality → Heuristic (40%) + LLM-as-Judge (60%)
6. GATE merge → Constitutional check: reject if score < 4.0
7. STORE results → Write output, update backlog, log metrics
8. EVOLVE → Update genome fitness, trigger mutations
The constitutional merge gate at step 6 is non-negotiable: any task scoring below 4.0/10 is automatically rejected and returned to the backlog with a “needs_improvement” flag. This prevents low-quality outputs from accumulating in the codebase.
2. Task Lifecycle
Every task in Night Shift follows a defined lifecycle from creation to integration:
| Stage | Description | Duration |
|---|---|---|
| Backlog | Task defined with title, category, priority, output type | Indefinite |
| Selected | Dispatcher picks task based on priority × fitness score | <1 sec |
| Prompted | Context engine builds prompt with codebase map, reflections, guidance | 2–5 sec |
| Executing | API call to Claude, Gemini, or open-source model | 30–180 sec |
| Assessed | Quality scorer evaluates output (heuristic + LLM judge) | 5–15 sec |
| Merged / Rejected | Score ≥4.0 merged to codebase; <4.0 returned to backlog | <1 sec |
Task Selection Algorithm
Task selection is not random. The dispatcher computes a composite priority score:
priority_score = base_priority
× category_weight(adaptive_tree)
× staleness_bonus(days_since_created)
× dependency_readiness(parent_tasks_complete)
× (1.0 - category_saturation)
The category_weight comes from the adaptive tree (see Section 3), which dynamically adjusts allocation across 7 categories based on recent performance. The staleness_bonus prevents tasks from sitting in the backlog forever — older tasks get a linearly increasing priority boost. The category_saturation penalty ensures no single category consumes more than 30% of weekly budget.
Model Routing
Not every task needs the most capable model. Night Shift routes tasks based on complexity and output type:
| Model | Used For | Token Budget | Avg Quality |
|---|---|---|---|
| Claude Opus | Architecture, complex research, multi-file code | 80K context | 7.8 |
| Claude Sonnet | Standard code, reports, documentation | 60K context | 7.2 |
| Claude Haiku | Quality assessment (LLM-as-Judge), simple tasks | 30K context | 6.9 |
| Gemini 2.0 Flash | Research, content generation, analysis | 50K context | 6.5 |
Model routing uses a simple heuristic: tasks tagged as architecture or complex_research go to Opus. Tasks tagged write-tests or documentation go to Sonnet. Quality assessment always uses Haiku (cheaper, consistent enough for scoring). Gemini handles research tasks with its higher rate limits (1,250 requests/day across 5 API keys with round-robin).
3. The Evolution Engine
Night Shift doesn’t just execute tasks — it evolves its own behavior over time. The evolution engine tracks a “genome” of operational parameters and mutates them based on fitness signals.
The Adaptive Tree
The adaptive tree is a 1,032-line module (evolution/adaptive_tree.py) that manages task category allocation dynamically. It operates on two feedback loops:
- Fast loop (hourly): Adjusts category weights based on the last 24 hours of quality scores. If
tech_researchtasks are scoring 8.5/10 whilecontenttasks score 5.2/10, the tree shifts allocation toward tech_research. - Medium loop (daily): Evaluates category-level fitness trends over 7 days. Categories with declining quality get reduced allocation; categories with improving quality get boosted.
The current allocation across 7 categories:
| Category | Allocation | Purpose |
|---|---|---|
| revenue | 20% | Client-facing features, sales materials |
| capability | 20% | Platform features, new modules |
| tech_research | 15% | Technical exploration, architecture |
| biz_research | 15% | Market analysis, competitor intelligence |
| self_improvement | 15% | Night Shift’s own code, tests, docs |
| content | 10% | Blog posts, documentation, reports |
| ops | 5% | CI/CD, infrastructure, monitoring |
Genome Mutations
The evolution engine tracks 12 operational parameters as a “genome” and applies mutations when fitness drops below threshold:
genome = {
"output_token_budget": 8000, # Mutated: was 4000
"context_budget_opus": 80000, # Mutated: was 120000
"max_followup_depth": 1, # Mutated: was 2
"category_cap_pct": 0.30, # Mutated: was 0.50
"assessor_llm_weight": 0.60, # Mutated: was 0.40
"staleness_bonus_rate": 0.05, # Stable
"continuation_enabled": false, # Mutated: was true
...
}
Mutations are not random — they follow directed signals from evolution/branch_signals.py (323 LOC). When truncation rate exceeds 25%, the output token budget mutates upward. When context bloat correlates with quality drops, the context budget mutates downward. Each mutation is logged with rationale, enabling post-hoc analysis of what worked.
Evolution in Practice
Over 25 days, the genome has undergone 14 mutations. The most impactful was disabling continuations (continuation_enabled: true → false), which improved average quality from 5.1 to 7.4 for previously-continued tasks. The evolution engine discovered this fix autonomously — the branch signal detected that tasks with continuations scored 2.3 points lower than tasks without, and triggered the mutation.
4. Quality Gates
Quality assessment is the most critical component. Without reliable scoring, the system can’t distinguish good output from noise.
The Hybrid Scorer
Night Shift uses a two-stage scoring system:
- Heuristic scorer (40% weight): Evaluates structural properties — output length, presence of code blocks, heading structure, completeness markers, test coverage mentions. Fast (~100ms) but susceptible to Goodhart’s Law.
- LLM-as-Judge (60% weight): Claude Haiku evaluates the output against the original task prompt, using calibration anchors (
anchors.yaml) that provide score=9 and score=3 reference examples for each output type (code, report, research).
The calibration anchors were critical. Without them, the LLM judge drifted toward whatever scoring baseline it internalized during training. With anchors, scoring consistency improved from 65% to approximately 77% agreement with human grades.
Constitutional Merge Gate
The merge gate enforces three hard constraints:
- Quality floor: Score < 4.0/10 → auto-reject, return to backlog
- Safety check: No
eval(),exec(),pickle.loads(), or shell injection patterns in code output - Truncation check: If output was truncated, quality score is penalized by 1.5 points before the merge decision
Over 25 days, the merge gate has rejected 47 tasks (10.6% rejection rate). The most common rejection reason is truncation-induced low quality (68%), followed by structural incompleteness (22%), and safety violations (10%).
| Metric | Before Calibration | After Calibration |
|---|---|---|
| Avg quality score | 5.04 | 7.2 |
| Score inflation (heuristic) | 48% scored exactly 8/10 | Normal distribution 4–9 |
| Human-system score gap | 2.1 points | 0.8 points |
| Rejection rate | 6% | 10.6% |
The higher rejection rate after calibration is actually a positive signal — the system is now catching low-quality outputs that previously slipped through with inflated scores.
5. Lessons from 25 Days of Operation
Running an autonomous AI system for 25 consecutive days taught us things that no benchmark or demo could reveal.
Lesson 1: Model Routing Matters More Than Model Capability
Our most expensive model (Opus) doesn’t always produce the best results. Simple documentation tasks routed to Opus averaged 7.1/10 — the same as Sonnet at 40% of the cost. The lesson: match task complexity to model capability. Use Opus for architecture decisions and multi-file refactors. Use Sonnet for everything else. Use Haiku for assessment.
Lesson 2: Output Truncation Destroys Quality
This was the single biggest finding. Truncated outputs scored 2.7 points lower than complete outputs on average. The fix was counterintuitive: instead of increasing token limits or adding continuation mechanisms, we scoped tasks smaller upfront. A well-defined, narrow task with complete output beats a broad task with truncated output every time.
The Truncation Rule
If a task would require more than 8K output tokens, split it into subtasks. Never continue a truncated output — tasks with 1 continuation averaged quality 5.1/10, worse than the 7.4/10 average for zero-continuation tasks. Continuation doesn’t fix truncation; it compounds it.
Lesson 3: Budget Management Is Architecture
Night Shift operates on $28 total over 25 days — roughly $1.12/day. This isn’t a limitation; it’s a design constraint that forces efficiency. The budget governor allocates tokens across categories, models, and time periods. Without it, the system would exhaust its weekly quota in 2 days on expensive Opus calls.
Key budget parameters:
- Weekly token limit: 3.5M tokens across all providers
- Per-category cap: 30% of weekly budget (prevents any category from dominating)
- Gemini rate limit: 1,250 requests/day across 5 API keys with round-robin and 429 failover
- GPU utilization target: 10–15% (currently 11.3%)
Lesson 4: Mentoring Feedback Must Be Precise
We ran 88 mentoring interventions over 8 review sessions. The initial approach — injecting 2,000-character reviews into every task prompt — had 0% measurable effectiveness. Reviews were too long, too generic, and not actionable by the model.
The fix: per-category guidance files (max 500 chars), Reflexion-style per-task reflections stored in JSONL files (last 20 per category), and feedback written as model-actionable instructions. Instead of “quality needs improvement,” we write “always include a methodology section before results” or “test at least 3 edge cases per function.”
Lesson 5: Self-Improvement Requires Guardrails
Night Shift has a self_improvement category (15% allocation) where it works on its own code. This is powerful but dangerous: without guardrails, the system could modify its own quality assessor to give itself higher scores, or adjust budget parameters to consume more resources.
The guardrails:
- Self-improvement tasks cannot modify
quality_assessor.py,quota_governor.py, orconstitutional_checker.py - All self-improvement outputs require human review before merge (only category with this constraint)
- Budget allocation for self_improvement is capped at 15% and cannot be mutated by the evolution engine
The Numbers After 25 Days
| Metric | Day 1 | Day 25 | Change |
|---|---|---|---|
| Tasks/day | 15 | 62 | +313% |
| GPU utilization | 0.3% | 11.3% | +3,667% |
| Avg quality (calibrated) | 5.0 | 7.2 | +44% |
| Truncation rate | 29% | 8% | -72% |
| Dispatch cycle | 4 hours | 1 hour | 4x faster |
| Genome mutations | 0 | 14 | — |
The system is not just running tasks — it’s getting better at running tasks. The evolution engine has autonomously discovered and applied optimizations that would have taken weeks of manual tuning.
What’s Next
Night Shift is moving toward three capabilities:
- Multi-repo operation: Currently limited to the hub monorepo. Next milestone is dispatching tasks across project submodules with independent context and git branches.
- Slow loop (weekly): The adaptive tree currently operates on hourly and daily loops. A weekly loop will evaluate strategic allocation — should the system spend more time on revenue-generating tasks vs. infrastructure?
- Human-in-the-loop escalation: Instead of rejecting low-scoring tasks, route them to a human review queue with the LLM judge’s specific feedback. Convert rejections into mentoring opportunities.
See Night Shift in Action
Night Shift is part of the NEXUS ecosystem — autonomous AI operations for teams of 5–200. The GitLab Pages portal shows real-time pipeline status, task history, and quality metrics.
Request a Demo Read: Quality RCARelated Articles
- Why Our AI Agent’s Quality Dropped 31% — root cause analysis of the quality decline and 15 fixes
- Night Shift: 300 Tasks in 14 Days — production data from the first two weeks
- Night Shift: How AI Writes Code While You Sleep — the original Night Shift deep dive
- From 0 to 3,000 Tests: Building Quality into AI-Generated Code — how Night Shift maintains code quality
- Enterprise AI Security Checklist for 2026 — the 8 security gates protecting autonomous AI