Claude Code Agent Orchestration Guide
CompleteThe Problem
The AI-assisted development landscape has a productivity paradox. According to JetBrains’ 2024 developer survey, 84% of professional developers use AI coding tools. Yet METR’s controlled trial found that experienced open-source contributors took 19% longer on tasks when using AI assistants, despite believing they were 20% faster. The bottleneck has shifted: writing code is no longer the constraint. Reviewing, verifying, and coordinating AI-generated output is.
Multi-agent systems seem like the answer, but naive agent multiplication makes things worse. Research from Kim et al. (2025) at Google DeepMind documented a 17.2x error amplification rate when multiple agents operate without structured coordination. Their study established a ~45% capability ceiling for uncoordinated multi-agent systems, meaning that adding more agents to an unstructured workflow actively degrades quality beyond a threshold.
The review bottleneck compounds this. Google’s internal data shows developers spend 6.4 hours per week on code review. Microsoft reports an average of 15 hours from pull request creation to first reviewer comment. These numbers represent the actual constraint on development velocity, not the speed of writing code.
No structured methodology existed for determining when a single agent should handle a task versus when multi-agent orchestration is warranted, how to scale review effort proportionally to task complexity, or how to prevent the error amplification that destroys the value of AI assistance. This project was built to fill that gap.
What Was Built
A complete implementation guide that transforms a 41-page technical runbook into an executable, phase-by-phase automation system for building multi-agent AI development workflows. The guide covers every layer of the Claude Code platform, from initial environment setup through autonomous operation with safety guardrails.
The system spans 11 phases (Phase 0 through Phase 10), each with dedicated automation scripts, validation checks, and documentation. The core architecture uses four layers that compose independently:
- Configuration layer: CLAUDE.md hierarchy (global, project, local), path-scoped rules, auto-memory system, and settings.json for hooks and permissions
- Enforcement and routing layer: 5 deterministic hook events plus a threshold escalation engine that scores task complexity and routes to the appropriate execution tier
- Execution layer: Three tiers of agent coordination, from solo execution for simple tasks to full multi-agent orchestration for complex changes
- Review agent layer: Security reviewer, quality reviewer, and fixer agents with a steelman evaluation process that resists false-positive dismissals
The project includes 28 shell automation scripts, 10 diagrams (Mermaid, Vega-Lite, and D2), a 7-exercise interactive tutorial, and is grounded in 31 academic and industry citations. Every phase can be reproduced from scratch on Windows 11 WSL 2 or macOS.
Architecture
The system follows a lean-first dynamic activation model: start with the minimum viable agent configuration and escalate only when task complexity warrants it. This is the opposite of most multi-agent frameworks, which default to maximum coordination regardless of task size.
The Four Layers
Layer 1 (Configuration) establishes the instruction hierarchy. CLAUDE.md files operate at three additive levels: global (~150 lines of high-impact rules), project (codebase-specific conventions), and local (personal preferences, git-ignored). Path-scoped rules provide directory-level overrides without polluting the main configuration. Auto-memory persists learned context across sessions. Settings.json defines hooks, permissions, and MCP server connections. The combined instruction budget is roughly 200 lines, forcing prioritization of high-signal rules.
Layer 2 (Enforcement) provides deterministic guardrails that fire before Claude makes decisions. Five hook events cover the full tool lifecycle:
| Hook Event | Fires When | Example Use |
|---|---|---|
| PreToolUse | Before any tool executes | Block rm -rf, guard sensitive files |
| PostToolUse | After successful tool execution | Auto-format with Prettier, git attribution |
| PostToolUseFail | After a tool fails | Re-read modified files, error recovery |
| Notification | Model needs human input | Desktop toast alert |
| Stop | Task completes | Desktop completion notification |
The critical property of hooks is determinism: they fire regardless of conversation context, model reasoning, or prompt content. A PreToolUse hook that blocks rm -rf cannot be overridden by a persuasive prompt. This makes hooks the only truly reliable enforcement layer in the system. CLAUDE.md instructions are influential but can be overridden by strong context. Hooks cannot.
Layer 3 (Execution) implements the threshold routing engine. Every prompt is scored on five dimensions (file count, directory depth, keyword signals, security-sensitive paths, code diff size). The score maps to one of three tiers. Tier 1 (score 0-3) uses a solo Opus agent for direct execution. Tier 2 (score 4-7) adds three parallel review agents after implementation. Tier 3 (score 8+) activates the full orchestra: ultrathink planning, OCR multi-agent discourse review, and Codex cross-model adversarial review. Users can override with “just do it” (downgrade) or “full review” (upgrade to Tier 3).
Layer 4 (Review Agents) provides specialized analysis. The security reviewer (Sonnet 4.6) scans for OWASP vulnerabilities, IAM issues, and credential exposure. The quality reviewer (Sonnet 4.6) checks KISS, DRY, and separation of concerns. The fixer agent (Opus 4.6) uses a steelman process: it assumes findings are valid and argues against dismissing them before issuing ACCEPT/REJECT/DEFER verdicts. This prevents the cascading false-positive problem where legitimate findings get rationalized away. All review agents run in isolated Git worktrees to prevent context pollution.
Threshold Escalation Engine
The threshold router is the central innovation of this system. Rather than applying the same orchestration to every task, it dynamically matches coordination effort to task complexity. A one-line typo fix gets solo execution in seconds. A cross-module refactoring gets the full review pipeline.
The scoring algorithm analyzes each prompt across five dimensions:
| Signal | Points | Rationale |
|---|---|---|
| Files likely to change | +1 per 3 files | More files means more coordination needed |
| Distinct directories | +1 per 2 dirs | Cross-module work needs broader review |
| Security-sensitive paths | +3 each | Highest cost of failure per change |
| Complexity keywords | +2 each | Signals like “refactor”, “architect”, “migrate” |
| Diff size | +1 per 100 lines | Larger changes need proportionally more scrutiny |
The token economics validate this approach. Three independent agents cost 3.75x a single agent’s input, while three forked agents sharing a cached base context cost only 1.55x, yielding a 2.4x savings. Claude Max’s 1-hour cache TTL means that agents spawned within the same session share ~90% of their context at cached-token prices. This makes Tier 2 (3 reviewers) economically viable for medium-complexity tasks where independent sessions would be wasteful.
How Each Tier Operates
Tier 1 (Score 0-3): Solo Opus agent handles the task directly. No subagents, no review pipeline. This is the fastest path and covers the majority of development work: single-file edits, documentation updates, test additions, small bug fixes. The threshold router announces [T1] Routine edit, proceeding directly. and executes without overhead.
Tier 2 (Score 4-7): After the primary Opus agent implements the change, three review subagents are spawned in parallel. The security reviewer (Sonnet 4.6, ~20K tokens per run) scans for OWASP vulnerabilities, credential exposure, and permission escalation. The quality reviewer (Sonnet 4.6, ~15K tokens per run) checks KISS, DRY, separation of concerns, and naming conventions. The fixer agent (Opus 4.6, ~25K tokens per run) evaluates both reviewers’ findings using the steelman process, then applies accepted fixes. This tier covers multi-file feature additions, refactoring across 2-3 modules, and configuration changes with security implications.
Tier 3 (Score 8+): The full orchestra activates. Implementation begins with ultrathink planning (extended reasoning for architectural decisions). After implementation, the OCR multi-agent discourse review runs 2-3 rounds of deliberation between specialized reviewers. Then Codex cross-model adversarial review provides an independent perspective from GPT-5.4. All findings are presented to the developer before any commit. This tier covers architecture-level changes, security-critical modifications, large refactoring, and infrastructure changes.
The three tiers are not just about adding more reviewers. Each tier changes the type of analysis applied. Tier 1 trusts the primary agent’s judgment. Tier 2 adds parallel verification. Tier 3 adds deliberation (OCR discourse) and cross-model challenge (Codex). The escalation is qualitative, not just quantitative.
Implementation Phases
The guide is structured as 11 sequential-then-parallel phases. Phases 0 through 3 form the critical path and must be completed in order. Phases 4 through 9 can be completed in any order once the foundation is in place. Phase 10 validates the complete system.
| Phase | Name | Time | Key Deliverable |
|---|---|---|---|
| 0 | Pre-Flight Checks | 30 min | 7 environment validation scripts |
| 1 | Core Configuration | 1-2 hrs | CLAUDE.md hierarchy + path-scoped rules |
| 2 | Hooks System | 1 hr | settings.json with 5 hook event types |
| 3 | Threshold Router | 2-3 hrs | Complexity scoring skill definition |
| 4 | Turbo Skills + MCP | 1-2 hrs | 60+ workflow skills + MCP server config |
| 5 | Open Code Review | 30 min | OCR multi-agent review plugin |
| 6 | Codex Plugin | 30 min | Cross-model adversarial review (GPT-5.4) |
| 7 | Custom Subagents | 1-2 hrs | Security, Quality, and Fixer agent defs |
| 8 | Skills Library | 1-2 hrs | Domain-specific custom skill builders |
| 9 | Auto Mode | 30 min | Autonomous operation with safety limits |
| 10 | Integration Testing | 1-2 hrs | 10-test validation suite |
Total estimated wall-clock time: 14-17 hours across multiple sessions (per-phase active time sums to 10-16 hours; the remainder is session overhead, context switching, and debugging between phases). The critical path through Phases 0-3 takes approximately 5-7 hours and provides the foundation for all subsequent phases.
Each phase includes a dedicated README with step-by-step instructions, automation scripts that handle the setup, and validation checks that confirm successful completion before moving to the next phase.
Critical Path and Parallelization
Phases 0-3 form a strict sequential dependency chain:
- Phase 0 (Pre-Flight) establishes the WSL 2 environment, Node.js, npm, git, and Claude Code installation. Without a working environment, nothing else can proceed.
- Phase 1 (Core Configuration) creates the CLAUDE.md hierarchy and path-scoped rules. These define the instruction budget that all subsequent phases operate within.
- Phase 2 (Hooks System) configures the 5 hook events in settings.json. Hooks are the enforcement mechanism for Phase 3’s routing decisions.
- Phase 3 (Threshold Router) builds the complexity scoring engine. This is the coordination layer that determines how Phases 4-9 components are activated.
Once Phase 3 is complete, Phases 4 through 9 can be completed in any order. Each adds a capability (skills, OCR, Codex, subagents, custom skills, auto mode) that the threshold router orchestrates. Phase 10 validates the complete system with a 10-test integration suite.
Automation Scripts
The repository contains 28 shell scripts organized into root-level utilities and phase-specific automation.
Five root-level scripts handle cross-cutting concerns:
- platform-detect.sh: Detects WSL 2 vs macOS and sets environment-specific paths for Docker, Node.js, and shell configuration
- new-project.sh: Auto-bootstraps Claude Code tooling for any new project directory (rules symlink, MCP config, OCR config, worktree include, settings)
- clone-wrapper.sh: Combines
git clonewith automatic tooling setup so new repos are production-ready immediately - check-project.sh: Health check that validates all Claude Code configuration is present and correctly linked
- validate-repo.sh: Pre-push validation that checks for secrets, broken symlinks, and missing configuration
The remaining 23 scripts are phase-specific, each creating the exact configuration files, skill definitions, or settings entries for its phase. All scripts are idempotent (safe to re-run), use set -euo pipefail for strict error handling, and support both WSL 2 (Ubuntu 24.04) and macOS (Apple Silicon). Scripts use $((VAR + 1)) arithmetic instead of ((VAR++)) to avoid the bash arithmetic exit code 1 trap that breaks strict mode when the result is zero.
Every script exits with a clear success or failure message. No script modifies system-level configuration outside the project directory. The validate-repo.sh pre-push script is designed for integration with git pre-push hooks, ensuring secrets and broken symlinks never reach the remote repository.
Script naming follows a consistent convention: phase-NN-description.sh for phase scripts, and verb-noun format for root utilities (e.g., platform-detect.sh, validate-repo.sh). This makes discovery straightforward when browsing the repository.
Script distribution across phases:
| Phase | Scripts | Purpose |
|---|---|---|
| Phase 0 | 7 | Environment validation (Claude, Node, git, WSL memory, notifications, directories) |
| Phase 1 | 3 | CLAUDE.md creation, rules setup, memory initialization |
| Phase 2 | 2 | Settings.json configuration, hook testing |
| Phase 3 | 2 | Threshold skill creation, routing validation |
| Phase 4 | 2 | Turbo skills installation, MCP server setup |
| Phase 5-6 | 2 | OCR and Codex plugin installation |
| Phase 7-8 | 2 | Custom subagent and skill creation |
| Phase 9 | 1 | Auto-mode enablement with safety limits |
| Phase 10 | 2 | Integration test suite and validation runner |
Tutorial System
The PetShelter tutorial provides 7 hands-on exercises using a fictional pet shelter data management project. Each exercise builds on the previous one, progressing from basic configuration to full autonomous orchestration.
| Exercise | Title | Concept |
|---|---|---|
| 1 | Hello Pets | First Claude Code project + CLAUDE.md creation |
| 2 | Guard the Kennel | Hooks for security enforcement |
| 3 | Sort the Litter | Threshold router complexity classification |
| 4 | Fetch Skills | Turbo skills + MCP server integration |
| 5 | Breed the Agents | Custom subagent creation and review pipelines |
| 6 | Auto Walkies | Autonomous mode with safety guardrails |
| 7 | Vet Checkup | Integration testing and system validation |
The tutorial includes a starter project at tutorial/assets/sample-project/ with a README, CLAUDE.md template, and pets.json sample data. Each exercise takes 15-25 minutes, with the full tutorial requiring approximately 2-3 hours.
The progression is designed for developers with at least 1 year of GitHub experience. No prior Claude Code experience is required. Exercises can be completed independently, though they are most valuable when done in sequence.
What the Tutorial Teaches
Each exercise targets a specific orchestration concept:
- Exercises 1-2 establish the configuration and enforcement layers. By the end of Exercise 2, the developer has a working CLAUDE.md hierarchy and active hooks that block dangerous operations in real time.
- Exercise 3 introduces the threshold router. The developer classifies 5 sample tasks by complexity and verifies the router assigns the correct tier to each one.
- Exercises 4-5 add skills, MCP servers, and custom subagents. The developer creates a working review pipeline that can evaluate code changes across security, quality, and correctness dimensions.
- Exercises 6-7 enable autonomous operation and validate the complete system. The developer runs a 10-test integration suite that verifies every component works together.
The PetShelter theme was chosen because pet data is inherently low-stakes, letting developers experiment with hooks, subagents, and auto mode without anxiety about breaking production systems.
All exercises include expected output samples so developers can verify their setup is working correctly at each step. Error recovery guidance is embedded in each exercise for the most common failure modes encountered during testing.
The companion Mermaid Hybrid Stack Guide includes a separate “Dogs and Cats” tutorial track with 7 additional exercises focused on diagram generation, completing the full 14-exercise learning path across both guides.
Research Foundation
This project is grounded in 31 academic and industry citations that informed every architectural decision. The research spans multi-agent coordination theory, developer productivity measurement, and AI tool adoption patterns.
Key findings that shaped the design:
-
Multi-agent scaling ceiling (Kim et al. 2025, Google DeepMind): Naive multi-agent systems hit a ~45% capability ceiling. Beyond this threshold, adding more agents degrades rather than improves output quality. The 17.2x error amplification rate means that unstructured agent coordination is actively counterproductive for complex tasks.
-
Productivity paradox (METR 2025, Faros 2024): METR’s randomized controlled trial showed experienced developers took 19% longer with AI assistance. Faros Engineering’s analysis of 1,000+ teams found that while individual AI tool adoption was high, measurable team-level productivity gains were absent. The gap between individual perception and measured impact is the core problem this guide addresses.
-
Review bottleneck (Google, Microsoft internal data): Code review consumes 6.4 hours per week per developer at Google. Microsoft’s data shows 15-hour median time to first PR comment. These numbers represent the actual constraint on development velocity once AI handles initial code generation.
-
Adaptive complexity routing (NeurIPS 2024): Research on adaptive collaboration systems demonstrates that matching coordination effort to task complexity outperforms both fixed-high and fixed-low approaches. This directly validates the threshold escalation model.
-
Token economics (Anthropic prompt caching documentation): Forked agents sharing a cached base context cost 2.4x less than independent agents. With Claude Max’s 1-hour cache TTL, multi-agent review becomes economically viable for medium-complexity tasks that would otherwise be too expensive to justify.
The full reference list with links is available in the REFERENCES.md file.
Debugging Log
Implementation surfaced 12 real errors, each documented with exact symptoms, root causes, and fixes. This is not a hypothetical troubleshooting guide. Every error was encountered during the actual build process and resolved through systematic investigation.
Selected highlights:
| Error | Root Cause | Fix |
|---|---|---|
| ANSI corruption in model fields | Terminal escape codes embedded in config values | Strip ANSI sequences before writing to JSON |
| autoMode.environment type mismatch | settings.json expected object, received string | Wrap environment values in proper object structure |
| Threshold router silent on T1/T2 | Skill not loaded when complexity score was low | Add mandatory “always activate” trigger condition |
| GitHub push protection blocking | Secrets detected in example config snippets | Move examples to documentation, use placeholder values |
| Rules path detection failure | Symlink resolution failed across WSL/Windows boundary | Use readlink -f with fallback to manual path resolution |
| PostToolUse hook exit code 1 | ((COUNTER++)) returns exit 1 when counter is 0 | Replace with COUNTER=$((COUNTER + 1)) for set -e safety |
Each error in the full log includes four fields: what happened (symptoms), why it happened (root cause analysis), how it was fixed (exact commands or config changes), and the lesson extracted (what to do differently next time). The complete debugging journal is in the repository’s phase directories, with a consolidated summary in the runbook.
Error Classification
The 12 errors fall into three categories. Configuration errors (5 of 12) involve mismatched types, missing fields, or incorrect paths in settings files. These are the most common and the quickest to resolve once identified. Platform errors (4 of 12) stem from differences between WSL 2 and macOS environments: path resolution, Docker socket locations, and shell behavior. Integration errors (3 of 12) emerge when two correctly configured components interact in unexpected ways, such as the PostToolUse hook exit code propagation issue and the GitHub push protection conflict with example secrets.
Outcome
The completed system delivers structured AI-assisted development workflows across every complexity tier:
- 65 automation scripts across this guide and its companion Mermaid pipeline, covering environment setup, configuration generation, validation, rendering, and testing
- 20 diagrams across this guide and its companion Mermaid pipeline, visualizing the architecture, decision flows, token economics, phase dependencies, and error resolution patterns
- 14 hands-on tutorial exercises split across two progressive learning tracks (PetShelter for orchestration, Dogs and Cats for diagram generation)
- 31 academic and industry citations grounding every architectural decision in published research
- 12 documented real errors with root cause analysis and verified fixes
The review pipeline reduces unstructured review time by an estimated 40-78% through OCR’s multi-agent discourse phase, where multiple specialized reviewers debate findings before presenting consolidated results.
Token Budget Management
The system treats the CLAUDE.md instruction budget as a finite resource. Anthropic’s system prompt consumes a significant portion of the context window before user instructions are processed. The guide quantifies approximately 150 usable instruction slots and provides a prioritization framework: security rules first, then workflow automation, then style conventions. Lower-priority instructions are moved to path-scoped rules files that load only when relevant directories are edited, keeping the main instruction set lean.
Review Agent Coordination
The Tier 2 and Tier 3 review pipelines use a steelman process inspired by adversarial collaboration in academic peer review. The security reviewer, quality reviewer, and fixer agent each operate independently. The fixer does not simply accept all findings. Instead, it evaluates each finding against the actual code, rejects false positives with evidence, and only applies changes that survive scrutiny. This prevents the review pipeline from introducing unnecessary churn while ensuring genuine issues are addressed.
Reproducibility
The system is fully reproducible: any developer with a Claude Max subscription can build it from scratch following the phase-by-phase instructions. The guide was tested on both Windows 11 WSL 2 (Ubuntu 24.04) and macOS (Apple Silicon). Every command, configuration snippet, and script is copy-paste ready. The 28 automation scripts eliminate manual configuration entirely for developers who prefer automated setup over step-by-step walkthroughs.
Cross-Platform Compatibility
| Component | Windows 11 WSL 2 | macOS Apple Silicon |
|---|---|---|
| Node.js v20+ | Native in WSL | Homebrew |
| Claude Code CLI | npm global install | npm global install |
| Docker Desktop | WSL 2 backend | Native |
| Git + GitHub CLI | apt-get | Homebrew |
| Shell scripts | Bash 5.x | Bash 5.x (Homebrew) |
The platform-detect.sh script auto-detects the operating environment and configures paths, Docker socket locations, and shell settings accordingly. No manual platform-specific configuration is required after initial setup.
Lessons Learned
-
Complexity-based routing beats naive multiplication. The threshold escalation engine is the single highest-value component. Matching orchestration effort to task complexity avoids both the overhead of unnecessary review (Tier 1 tasks) and the quality gaps of under-reviewed work (Tier 3 tasks). Kim et al.’s 17.2x error amplification data validates this approach quantitatively.
-
Hooks provide the only deterministic enforcement layer. CLAUDE.md instructions are influential but not guaranteed. Hooks fire before permission checks and cannot be bypassed by model reasoning. The PreToolUse bash-blocker hook has prevented more accidental damage than any other single component in the system.
-
Configuration is an instruction budget, not a wish list. CLAUDE.md has roughly 150 usable instruction slots after the system prompt consumes its share. Treating this as a scarce resource and prioritizing high-impact rules produces better results than exhaustive documentation that dilutes the model’s attention.
-
Error documentation is first-class output. The 12-error debugging log is as valuable as the working system. Future developers will encounter similar edge cases, and the root cause analysis saves hours of investigation. Documentation was written at the point of failure, not reconstructed afterward.
-
Academic grounding validates engineering choices. The threshold escalation model, steelman review process, and token caching strategy all have published research supporting them. Grounding engineering decisions in peer-reviewed work transforms “I think this works” into “this approach is consistent with Kim et al.’s findings on multi-agent coordination.”
-
The distinction between skills and plugins matters. Skills are CLI tools that provide composable development workflows (formatting, review, testing). Plugins are Claude extensions that add entire capabilities (OCR multi-agent review, Codex cross-model challenge). They use different models, consume different token budgets, and serve different purposes. Conflating them leads to misconfigured systems that either waste tokens on unnecessary plugin invocations or miss critical review by relying on skills alone.
-
Cross-platform testing is non-negotiable. WSL 2 cross-filesystem I/O runs at approximately 6% of native performance. Operations like
npm installtake 10-20x longer when run on Windows-mounted paths versus native Linux paths. Every script was tested on both WSL 2 and macOS to ensure correct behavior across environments. Three of the 12 documented errors were platform-specific issues that only surfaced during cross-platform validation.
Companion Project
This guide works in tandem with the Mermaid Hybrid Stack Integration Guide, which provides the diagram-as-code pipeline used to create all 10 visualizations in this project.
The Mermaid guide covers the full rendering infrastructure: Docker-based Mermaid CLI, Vega-Lite data charts, D2 architecture diagrams, AI-assisted generation via Claude Code skills, and PostToolUse hooks for automatic rendering. The system architecture diagram, threshold flowchart, token economics comparison, and phase dependency graph in this project were all produced using that pipeline.
Together, the two guides form an integrated platform for AI-assisted development. The orchestration guide structures how AI agents coordinate and review work. The Mermaid guide structures how that work is visualized and documented. Both share the same Claude Code foundation, automation patterns, and documentation methodology.
GitHub: claude-agent-orchestration-guide
Runbook: Claude_Code_Orchestration_Runbook.pdf – The 41-page source document covering the full 4-layer architecture, all 11 implementation phases, research compendium with 31 citations, visual diagrams, automation scripts, and complete debugging log.