Claude Code Agent Orchestration Guide

Complete
Developer & Technical Author April 2026

The Problem

The AI-assisted development landscape has a productivity paradox. According to JetBrains’ 2024 developer survey, 84% of professional developers use AI coding tools. Yet METR’s controlled trial found that experienced open-source contributors took 19% longer on tasks when using AI assistants, despite believing they were 20% faster. The bottleneck has shifted: writing code is no longer the constraint. Reviewing, verifying, and coordinating AI-generated output is.

Multi-agent systems seem like the answer, but naive agent multiplication makes things worse. Research from Kim et al. (2025) at Google DeepMind documented a 17.2x error amplification rate when multiple agents operate without structured coordination. Their study established a ~45% capability ceiling for uncoordinated multi-agent systems, meaning that adding more agents to an unstructured workflow actively degrades quality beyond a threshold.

The review bottleneck compounds this. Google’s internal data shows developers spend 6.4 hours per week on code review. Microsoft reports an average of 15 hours from pull request creation to first reviewer comment. These numbers represent the actual constraint on development velocity, not the speed of writing code.

No structured methodology existed for determining when a single agent should handle a task versus when multi-agent orchestration is warranted, how to scale review effort proportionally to task complexity, or how to prevent the error amplification that destroys the value of AI assistance. This project was built to fill that gap.

What Was Built

A complete implementation guide that transforms a 41-page technical runbook into an executable, phase-by-phase automation system for building multi-agent AI development workflows. The guide covers every layer of the Claude Code platform, from initial environment setup through autonomous operation with safety guardrails.

The system spans 11 phases (Phase 0 through Phase 10), each with dedicated automation scripts, validation checks, and documentation. The core architecture uses four layers that compose independently:

  • Configuration layer: CLAUDE.md hierarchy (global, project, local), path-scoped rules, auto-memory system, and settings.json for hooks and permissions
  • Enforcement and routing layer: 5 deterministic hook events plus a threshold escalation engine that scores task complexity and routes to the appropriate execution tier
  • Execution layer: Three tiers of agent coordination, from solo execution for simple tasks to full multi-agent orchestration for complex changes
  • Review agent layer: Security reviewer, quality reviewer, and fixer agents with a steelman evaluation process that resists false-positive dismissals

The project includes 28 shell automation scripts, 10 diagrams (Mermaid, Vega-Lite, and D2), a 7-exercise interactive tutorial, and is grounded in 31 academic and industry citations. Every phase can be reproduced from scratch on Windows 11 WSL 2 or macOS.

Architecture

The system follows a lean-first dynamic activation model: start with the minimum viable agent configuration and escalate only when task complexity warrants it. This is the opposite of most multi-agent frameworks, which default to maximum coordination regardless of task size.

Layer 1: Configuration CLAUDE.md Rules Auto Memory Settings Layer 2: Enforcement and Routing Hooks (5 events) Threshold Router (0-10 score) Layer 3: Execution Tier 1 (0-3) Solo Opus agent Tier 2 (4-7) 3 parallel reviewers Tier 3 (8+) Ultrathink + OCR + Codex review Layer 4: Review Agents Security Quality Fixer OCR Codex

The Four Layers

Layer 1 (Configuration) establishes the instruction hierarchy. CLAUDE.md files operate at three additive levels: global (~150 lines of high-impact rules), project (codebase-specific conventions), and local (personal preferences, git-ignored). Path-scoped rules provide directory-level overrides without polluting the main configuration. Auto-memory persists learned context across sessions. Settings.json defines hooks, permissions, and MCP server connections. The combined instruction budget is roughly 200 lines, forcing prioritization of high-signal rules.

Layer 2 (Enforcement) provides deterministic guardrails that fire before Claude makes decisions. Five hook events cover the full tool lifecycle:

Hook EventFires WhenExample Use
PreToolUseBefore any tool executesBlock rm -rf, guard sensitive files
PostToolUseAfter successful tool executionAuto-format with Prettier, git attribution
PostToolUseFailAfter a tool failsRe-read modified files, error recovery
NotificationModel needs human inputDesktop toast alert
StopTask completesDesktop completion notification

The critical property of hooks is determinism: they fire regardless of conversation context, model reasoning, or prompt content. A PreToolUse hook that blocks rm -rf cannot be overridden by a persuasive prompt. This makes hooks the only truly reliable enforcement layer in the system. CLAUDE.md instructions are influential but can be overridden by strong context. Hooks cannot.

Layer 3 (Execution) implements the threshold routing engine. Every prompt is scored on five dimensions (file count, directory depth, keyword signals, security-sensitive paths, code diff size). The score maps to one of three tiers. Tier 1 (score 0-3) uses a solo Opus agent for direct execution. Tier 2 (score 4-7) adds three parallel review agents after implementation. Tier 3 (score 8+) activates the full orchestra: ultrathink planning, OCR multi-agent discourse review, and Codex cross-model adversarial review. Users can override with “just do it” (downgrade) or “full review” (upgrade to Tier 3).

Layer 4 (Review Agents) provides specialized analysis. The security reviewer (Sonnet 4.6) scans for OWASP vulnerabilities, IAM issues, and credential exposure. The quality reviewer (Sonnet 4.6) checks KISS, DRY, and separation of concerns. The fixer agent (Opus 4.6) uses a steelman process: it assumes findings are valid and argues against dismissing them before issuing ACCEPT/REJECT/DEFER verdicts. This prevents the cascading false-positive problem where legitimate findings get rationalized away. All review agents run in isolated Git worktrees to prevent context pollution.

Threshold Escalation Engine

The threshold router is the central innovation of this system. Rather than applying the same orchestration to every task, it dynamically matches coordination effort to task complexity. A one-line typo fix gets solo execution in seconds. A cross-module refactoring gets the full review pipeline.

Prompt User input Analyze 5 dimensions Score 0 to 10+ Tier 1 (0-3) Solo execution Tier 2 (4-7) + 3 reviewers Tier 3 (8+) Full orchestra User Override "just do it" / "full"

The scoring algorithm analyzes each prompt across five dimensions:

SignalPointsRationale
Files likely to change+1 per 3 filesMore files means more coordination needed
Distinct directories+1 per 2 dirsCross-module work needs broader review
Security-sensitive paths+3 eachHighest cost of failure per change
Complexity keywords+2 eachSignals like “refactor”, “architect”, “migrate”
Diff size+1 per 100 linesLarger changes need proportionally more scrutiny

The token economics validate this approach. Three independent agents cost 3.75x a single agent’s input, while three forked agents sharing a cached base context cost only 1.55x, yielding a 2.4x savings. Claude Max’s 1-hour cache TTL means that agents spawned within the same session share ~90% of their context at cached-token prices. This makes Tier 2 (3 reviewers) economically viable for medium-complexity tasks where independent sessions would be wasteful.

How Each Tier Operates

Tier 1 (Score 0-3): Solo Opus agent handles the task directly. No subagents, no review pipeline. This is the fastest path and covers the majority of development work: single-file edits, documentation updates, test additions, small bug fixes. The threshold router announces [T1] Routine edit, proceeding directly. and executes without overhead.

Tier 2 (Score 4-7): After the primary Opus agent implements the change, three review subagents are spawned in parallel. The security reviewer (Sonnet 4.6, ~20K tokens per run) scans for OWASP vulnerabilities, credential exposure, and permission escalation. The quality reviewer (Sonnet 4.6, ~15K tokens per run) checks KISS, DRY, separation of concerns, and naming conventions. The fixer agent (Opus 4.6, ~25K tokens per run) evaluates both reviewers’ findings using the steelman process, then applies accepted fixes. This tier covers multi-file feature additions, refactoring across 2-3 modules, and configuration changes with security implications.

Tier 3 (Score 8+): The full orchestra activates. Implementation begins with ultrathink planning (extended reasoning for architectural decisions). After implementation, the OCR multi-agent discourse review runs 2-3 rounds of deliberation between specialized reviewers. Then Codex cross-model adversarial review provides an independent perspective from GPT-5.4. All findings are presented to the developer before any commit. This tier covers architecture-level changes, security-critical modifications, large refactoring, and infrastructure changes.

The three tiers are not just about adding more reviewers. Each tier changes the type of analysis applied. Tier 1 trusts the primary agent’s judgment. Tier 2 adds parallel verification. Tier 3 adds deliberation (OCR discourse) and cross-model challenge (Codex). The escalation is qualitative, not just quantitative.

Implementation Phases

The guide is structured as 11 sequential-then-parallel phases. Phases 0 through 3 form the critical path and must be completed in order. Phases 4 through 9 can be completed in any order once the foundation is in place. Phase 10 validates the complete system.

PhaseNameTimeKey Deliverable
0Pre-Flight Checks30 min7 environment validation scripts
1Core Configuration1-2 hrsCLAUDE.md hierarchy + path-scoped rules
2Hooks System1 hrsettings.json with 5 hook event types
3Threshold Router2-3 hrsComplexity scoring skill definition
4Turbo Skills + MCP1-2 hrs60+ workflow skills + MCP server config
5Open Code Review30 minOCR multi-agent review plugin
6Codex Plugin30 minCross-model adversarial review (GPT-5.4)
7Custom Subagents1-2 hrsSecurity, Quality, and Fixer agent defs
8Skills Library1-2 hrsDomain-specific custom skill builders
9Auto Mode30 minAutonomous operation with safety limits
10Integration Testing1-2 hrs10-test validation suite

Total estimated wall-clock time: 14-17 hours across multiple sessions (per-phase active time sums to 10-16 hours; the remainder is session overhead, context switching, and debugging between phases). The critical path through Phases 0-3 takes approximately 5-7 hours and provides the foundation for all subsequent phases.

Each phase includes a dedicated README with step-by-step instructions, automation scripts that handle the setup, and validation checks that confirm successful completion before moving to the next phase.

Critical Path and Parallelization

Phases 0-3 form a strict sequential dependency chain:

  • Phase 0 (Pre-Flight) establishes the WSL 2 environment, Node.js, npm, git, and Claude Code installation. Without a working environment, nothing else can proceed.
  • Phase 1 (Core Configuration) creates the CLAUDE.md hierarchy and path-scoped rules. These define the instruction budget that all subsequent phases operate within.
  • Phase 2 (Hooks System) configures the 5 hook events in settings.json. Hooks are the enforcement mechanism for Phase 3’s routing decisions.
  • Phase 3 (Threshold Router) builds the complexity scoring engine. This is the coordination layer that determines how Phases 4-9 components are activated.

Once Phase 3 is complete, Phases 4 through 9 can be completed in any order. Each adds a capability (skills, OCR, Codex, subagents, custom skills, auto mode) that the threshold router orchestrates. Phase 10 validates the complete system with a 10-test integration suite.

Automation Scripts

The repository contains 28 shell scripts organized into root-level utilities and phase-specific automation.

Five root-level scripts handle cross-cutting concerns:

  • platform-detect.sh: Detects WSL 2 vs macOS and sets environment-specific paths for Docker, Node.js, and shell configuration
  • new-project.sh: Auto-bootstraps Claude Code tooling for any new project directory (rules symlink, MCP config, OCR config, worktree include, settings)
  • clone-wrapper.sh: Combines git clone with automatic tooling setup so new repos are production-ready immediately
  • check-project.sh: Health check that validates all Claude Code configuration is present and correctly linked
  • validate-repo.sh: Pre-push validation that checks for secrets, broken symlinks, and missing configuration

The remaining 23 scripts are phase-specific, each creating the exact configuration files, skill definitions, or settings entries for its phase. All scripts are idempotent (safe to re-run), use set -euo pipefail for strict error handling, and support both WSL 2 (Ubuntu 24.04) and macOS (Apple Silicon). Scripts use $((VAR + 1)) arithmetic instead of ((VAR++)) to avoid the bash arithmetic exit code 1 trap that breaks strict mode when the result is zero.

Every script exits with a clear success or failure message. No script modifies system-level configuration outside the project directory. The validate-repo.sh pre-push script is designed for integration with git pre-push hooks, ensuring secrets and broken symlinks never reach the remote repository.

Script naming follows a consistent convention: phase-NN-description.sh for phase scripts, and verb-noun format for root utilities (e.g., platform-detect.sh, validate-repo.sh). This makes discovery straightforward when browsing the repository.

Script distribution across phases:

PhaseScriptsPurpose
Phase 07Environment validation (Claude, Node, git, WSL memory, notifications, directories)
Phase 13CLAUDE.md creation, rules setup, memory initialization
Phase 22Settings.json configuration, hook testing
Phase 32Threshold skill creation, routing validation
Phase 42Turbo skills installation, MCP server setup
Phase 5-62OCR and Codex plugin installation
Phase 7-82Custom subagent and skill creation
Phase 91Auto-mode enablement with safety limits
Phase 102Integration test suite and validation runner

Tutorial System

The PetShelter tutorial provides 7 hands-on exercises using a fictional pet shelter data management project. Each exercise builds on the previous one, progressing from basic configuration to full autonomous orchestration.

ExerciseTitleConcept
1Hello PetsFirst Claude Code project + CLAUDE.md creation
2Guard the KennelHooks for security enforcement
3Sort the LitterThreshold router complexity classification
4Fetch SkillsTurbo skills + MCP server integration
5Breed the AgentsCustom subagent creation and review pipelines
6Auto WalkiesAutonomous mode with safety guardrails
7Vet CheckupIntegration testing and system validation

The tutorial includes a starter project at tutorial/assets/sample-project/ with a README, CLAUDE.md template, and pets.json sample data. Each exercise takes 15-25 minutes, with the full tutorial requiring approximately 2-3 hours.

The progression is designed for developers with at least 1 year of GitHub experience. No prior Claude Code experience is required. Exercises can be completed independently, though they are most valuable when done in sequence.

What the Tutorial Teaches

Each exercise targets a specific orchestration concept:

  • Exercises 1-2 establish the configuration and enforcement layers. By the end of Exercise 2, the developer has a working CLAUDE.md hierarchy and active hooks that block dangerous operations in real time.
  • Exercise 3 introduces the threshold router. The developer classifies 5 sample tasks by complexity and verifies the router assigns the correct tier to each one.
  • Exercises 4-5 add skills, MCP servers, and custom subagents. The developer creates a working review pipeline that can evaluate code changes across security, quality, and correctness dimensions.
  • Exercises 6-7 enable autonomous operation and validate the complete system. The developer runs a 10-test integration suite that verifies every component works together.

The PetShelter theme was chosen because pet data is inherently low-stakes, letting developers experiment with hooks, subagents, and auto mode without anxiety about breaking production systems.

All exercises include expected output samples so developers can verify their setup is working correctly at each step. Error recovery guidance is embedded in each exercise for the most common failure modes encountered during testing.

The companion Mermaid Hybrid Stack Guide includes a separate “Dogs and Cats” tutorial track with 7 additional exercises focused on diagram generation, completing the full 14-exercise learning path across both guides.

Research Foundation

This project is grounded in 31 academic and industry citations that informed every architectural decision. The research spans multi-agent coordination theory, developer productivity measurement, and AI tool adoption patterns.

Key findings that shaped the design:

  • Multi-agent scaling ceiling (Kim et al. 2025, Google DeepMind): Naive multi-agent systems hit a ~45% capability ceiling. Beyond this threshold, adding more agents degrades rather than improves output quality. The 17.2x error amplification rate means that unstructured agent coordination is actively counterproductive for complex tasks.

  • Productivity paradox (METR 2025, Faros 2024): METR’s randomized controlled trial showed experienced developers took 19% longer with AI assistance. Faros Engineering’s analysis of 1,000+ teams found that while individual AI tool adoption was high, measurable team-level productivity gains were absent. The gap between individual perception and measured impact is the core problem this guide addresses.

  • Review bottleneck (Google, Microsoft internal data): Code review consumes 6.4 hours per week per developer at Google. Microsoft’s data shows 15-hour median time to first PR comment. These numbers represent the actual constraint on development velocity once AI handles initial code generation.

  • Adaptive complexity routing (NeurIPS 2024): Research on adaptive collaboration systems demonstrates that matching coordination effort to task complexity outperforms both fixed-high and fixed-low approaches. This directly validates the threshold escalation model.

  • Token economics (Anthropic prompt caching documentation): Forked agents sharing a cached base context cost 2.4x less than independent agents. With Claude Max’s 1-hour cache TTL, multi-agent review becomes economically viable for medium-complexity tasks that would otherwise be too expensive to justify.

The full reference list with links is available in the REFERENCES.md file.

Debugging Log

Implementation surfaced 12 real errors, each documented with exact symptoms, root causes, and fixes. This is not a hypothetical troubleshooting guide. Every error was encountered during the actual build process and resolved through systematic investigation.

Selected highlights:

ErrorRoot CauseFix
ANSI corruption in model fieldsTerminal escape codes embedded in config valuesStrip ANSI sequences before writing to JSON
autoMode.environment type mismatchsettings.json expected object, received stringWrap environment values in proper object structure
Threshold router silent on T1/T2Skill not loaded when complexity score was lowAdd mandatory “always activate” trigger condition
GitHub push protection blockingSecrets detected in example config snippetsMove examples to documentation, use placeholder values
Rules path detection failureSymlink resolution failed across WSL/Windows boundaryUse readlink -f with fallback to manual path resolution
PostToolUse hook exit code 1((COUNTER++)) returns exit 1 when counter is 0Replace with COUNTER=$((COUNTER + 1)) for set -e safety

Each error in the full log includes four fields: what happened (symptoms), why it happened (root cause analysis), how it was fixed (exact commands or config changes), and the lesson extracted (what to do differently next time). The complete debugging journal is in the repository’s phase directories, with a consolidated summary in the runbook.

Error Classification

The 12 errors fall into three categories. Configuration errors (5 of 12) involve mismatched types, missing fields, or incorrect paths in settings files. These are the most common and the quickest to resolve once identified. Platform errors (4 of 12) stem from differences between WSL 2 and macOS environments: path resolution, Docker socket locations, and shell behavior. Integration errors (3 of 12) emerge when two correctly configured components interact in unexpected ways, such as the PostToolUse hook exit code propagation issue and the GitHub push protection conflict with example secrets.

Outcome

The completed system delivers structured AI-assisted development workflows across every complexity tier:

  • 65 automation scripts across this guide and its companion Mermaid pipeline, covering environment setup, configuration generation, validation, rendering, and testing
  • 20 diagrams across this guide and its companion Mermaid pipeline, visualizing the architecture, decision flows, token economics, phase dependencies, and error resolution patterns
  • 14 hands-on tutorial exercises split across two progressive learning tracks (PetShelter for orchestration, Dogs and Cats for diagram generation)
  • 31 academic and industry citations grounding every architectural decision in published research
  • 12 documented real errors with root cause analysis and verified fixes

The review pipeline reduces unstructured review time by an estimated 40-78% through OCR’s multi-agent discourse phase, where multiple specialized reviewers debate findings before presenting consolidated results.

Token Budget Management

The system treats the CLAUDE.md instruction budget as a finite resource. Anthropic’s system prompt consumes a significant portion of the context window before user instructions are processed. The guide quantifies approximately 150 usable instruction slots and provides a prioritization framework: security rules first, then workflow automation, then style conventions. Lower-priority instructions are moved to path-scoped rules files that load only when relevant directories are edited, keeping the main instruction set lean.

Review Agent Coordination

The Tier 2 and Tier 3 review pipelines use a steelman process inspired by adversarial collaboration in academic peer review. The security reviewer, quality reviewer, and fixer agent each operate independently. The fixer does not simply accept all findings. Instead, it evaluates each finding against the actual code, rejects false positives with evidence, and only applies changes that survive scrutiny. This prevents the review pipeline from introducing unnecessary churn while ensuring genuine issues are addressed.

Reproducibility

The system is fully reproducible: any developer with a Claude Max subscription can build it from scratch following the phase-by-phase instructions. The guide was tested on both Windows 11 WSL 2 (Ubuntu 24.04) and macOS (Apple Silicon). Every command, configuration snippet, and script is copy-paste ready. The 28 automation scripts eliminate manual configuration entirely for developers who prefer automated setup over step-by-step walkthroughs.

Cross-Platform Compatibility

ComponentWindows 11 WSL 2macOS Apple Silicon
Node.js v20+Native in WSLHomebrew
Claude Code CLInpm global installnpm global install
Docker DesktopWSL 2 backendNative
Git + GitHub CLIapt-getHomebrew
Shell scriptsBash 5.xBash 5.x (Homebrew)

The platform-detect.sh script auto-detects the operating environment and configures paths, Docker socket locations, and shell settings accordingly. No manual platform-specific configuration is required after initial setup.

Lessons Learned

  • Complexity-based routing beats naive multiplication. The threshold escalation engine is the single highest-value component. Matching orchestration effort to task complexity avoids both the overhead of unnecessary review (Tier 1 tasks) and the quality gaps of under-reviewed work (Tier 3 tasks). Kim et al.’s 17.2x error amplification data validates this approach quantitatively.

  • Hooks provide the only deterministic enforcement layer. CLAUDE.md instructions are influential but not guaranteed. Hooks fire before permission checks and cannot be bypassed by model reasoning. The PreToolUse bash-blocker hook has prevented more accidental damage than any other single component in the system.

  • Configuration is an instruction budget, not a wish list. CLAUDE.md has roughly 150 usable instruction slots after the system prompt consumes its share. Treating this as a scarce resource and prioritizing high-impact rules produces better results than exhaustive documentation that dilutes the model’s attention.

  • Error documentation is first-class output. The 12-error debugging log is as valuable as the working system. Future developers will encounter similar edge cases, and the root cause analysis saves hours of investigation. Documentation was written at the point of failure, not reconstructed afterward.

  • Academic grounding validates engineering choices. The threshold escalation model, steelman review process, and token caching strategy all have published research supporting them. Grounding engineering decisions in peer-reviewed work transforms “I think this works” into “this approach is consistent with Kim et al.’s findings on multi-agent coordination.”

  • The distinction between skills and plugins matters. Skills are CLI tools that provide composable development workflows (formatting, review, testing). Plugins are Claude extensions that add entire capabilities (OCR multi-agent review, Codex cross-model challenge). They use different models, consume different token budgets, and serve different purposes. Conflating them leads to misconfigured systems that either waste tokens on unnecessary plugin invocations or miss critical review by relying on skills alone.

  • Cross-platform testing is non-negotiable. WSL 2 cross-filesystem I/O runs at approximately 6% of native performance. Operations like npm install take 10-20x longer when run on Windows-mounted paths versus native Linux paths. Every script was tested on both WSL 2 and macOS to ensure correct behavior across environments. Three of the 12 documented errors were platform-specific issues that only surfaced during cross-platform validation.

Companion Project

This guide works in tandem with the Mermaid Hybrid Stack Integration Guide, which provides the diagram-as-code pipeline used to create all 10 visualizations in this project.

The Mermaid guide covers the full rendering infrastructure: Docker-based Mermaid CLI, Vega-Lite data charts, D2 architecture diagrams, AI-assisted generation via Claude Code skills, and PostToolUse hooks for automatic rendering. The system architecture diagram, threshold flowchart, token economics comparison, and phase dependency graph in this project were all produced using that pipeline.

Together, the two guides form an integrated platform for AI-assisted development. The orchestration guide structures how AI agents coordinate and review work. The Mermaid guide structures how that work is visualized and documented. Both share the same Claude Code foundation, automation patterns, and documentation methodology.

GitHub: claude-agent-orchestration-guide

Runbook: Claude_Code_Orchestration_Runbook.pdf – The 41-page source document covering the full 4-layer architecture, all 11 implementation phases, research compendium with 31 citations, visual diagrams, automation scripts, and complete debugging log.