Back to Blog
Tutorial

Building a Multi-Agent System with AKIOS

Multi-agent orchestration is the hardest problem in applied AI. Not because the models are weak — they are extraordinarily capable. The difficulty is coordination: how do five autonomous systems share state, respect boundaries, recover from failures, and produce a coherent result without a human babysitting every interaction? In this tutorial, we build a governed research team of 5 agents using the AKIOS SDK. Every agent runs inside a policy cage. Every message is traced. Every failure triggers structured recovery.

The Research Team Architecture

Our system is not a flat collection of agents calling each other. It is a directed acyclic graph (DAG) with explicit message channels, policy boundaries, and a coordinator that owns the workflow state machine:

AKIOS CONTROL PLANE
Policy · Tracing · Budgets
Project Manager Coordinator — DAG
Research Analyst 10K tok/h
Code Review 8K tok/h
Documentation 6K tok/h
QA Validator Final gate — 4K tok/h
REF: RESEARCH-TEAM
AKIOS ENG

The five agents have distinct roles, scoped permissions, and dedicated token budgets:

  • Project Manager: Owns the workflow DAG. Assigns tasks, collects results, decides when to retry or escalate. Cannot access external APIs directly.
  • Research Analyst: Web search and data gathering. Allowed: GitHub API, Google Scholar, arXiv. Blocked: everything else. Budget: 10K tokens/hour.
  • Code Reviewer: Static analysis and security review. Allowed: repository read access. No network egress. Budget: 8K tokens/hour.
  • Documentation Agent: Synthesizes findings into technical reports. No tool access — pure text generation. Budget: 6K tokens/hour.
  • QA Validator: Final gate. Checks completeness, consistency, and policy compliance. Can reject and trigger a full re-run. Budget: 4K tokens/hour.

The Message Envelope

Agents do not send raw strings to each other. Every inter-agent message is a typed envelope that the control plane can inspect, trace, and gate:

from dataclasses import dataclass, field
from typing import Any, Literal
from datetime import datetime
import uuid

@dataclass
class AgentMessage:
    """Typed envelope for all inter-agent communication."""
    id: str = field(default_factory=lambda: str(uuid.uuid4()))
    sender: str = ""           # e.g. "research_analyst"
    receiver: str = ""         # e.g. "project_manager"
    msg_type: Literal[
        "task_assignment",
        "result",
        "error",
        "escalation",
        "approval_request"
    ] = "result"
    payload: dict[str, Any] = field(default_factory=dict)
    timestamp: str = field(
        default_factory=lambda: datetime.utcnow().isoformat()
    )
    token_count: int = 0       # Filled by control plane
    trace_id: str = ""         # Propagated across the DAG

    def to_context(self) -> str:
        """Serialize for injection into agent prompt context."""
        return (
            f"[FROM: {self.sender}] [TYPE: {self.msg_type}]\n"
            f"{self.payload.get('content', '')}"
        )

Policy Manifests

Each agent gets a YAML policy manifest loaded at initialization. The control plane enforces these constraints at the network layer — not via prompt instructions. An agent literally cannot violate its policy:

apiVersion: akios/v1
kind: AgentPolicy
metadata:
  name: research-analyst
  team: research-squad
spec:
  governance:
    allowed_tools:
      - web_search
      - data_analysis
      - arxiv_fetch
    network_access:
      allowlist:
        - host: "api.github.com"
          methods: ["GET"]
        - host: "scholar.google.com"
        - host: "export.arxiv.org"
          methods: ["GET"]
      denylist:
        - host: "*"  # Everything not explicitly allowed
    budget:
      max_tokens_per_hour: 10000
      max_cost_per_session: $2.50
      max_tool_calls_per_minute: 20

  messaging:
    allowed_receivers:
      - "project_manager"
    max_message_size_tokens: 2000
    require_structured_output: true

  observability:
    log_level: "detailed"
    trace_propagation: true
    alert_on:
      - pattern: "security_vulnerability"
        action: "escalate_to_human"
      - pattern: "budget_80_percent"
        action: "notify_coordinator"

The Coordinator: A State Machine

The Project Manager agent is not just another LLM call. It is backed by a deterministic state machine that tracks which DAG stages have completed, which have failed, and what to do next. The LLM generates task descriptions; the state machine controls the flow:

from akios import Agent, Policy, Swarm, WorkflowDAG
from akios.recovery import RetryPolicy, CircuitBreaker
from enum import Enum

class StageStatus(Enum):
    PENDING = "pending"
    RUNNING = "running"
    COMPLETED = "completed"
    FAILED = "failed"
    RETRYING = "retrying"

# Load all five agent policies
policies = {
    "project_manager": Policy.from_file("policies/project-manager.yaml"),
    "research_analyst": Policy.from_file("policies/research-analyst.yaml"),
    "code_reviewer":    Policy.from_file("policies/code-reviewer.yaml"),
    "documentation":    Policy.from_file("policies/documentation.yaml"),
    "qa_validator":     Policy.from_file("policies/qa-validator.yaml"),
}

# Build the agent swarm with per-agent models and policies
swarm = Swarm(
    agents={
        name: Agent(
            model="gpt-4-turbo",
            policy=policy,
            temperature=0.2,  # Low temp for deterministic outputs
        )
        for name, policy in policies.items()
    },
    coordinator_policy=policies["project_manager"],
)

# Define the workflow as a DAG — not a flat list
dag = WorkflowDAG(
    stages={
        "research": {
            "agents": ["research_analyst"],
            "depends_on": [],
            "retry": RetryPolicy(max_retries=2, backoff="exponential"),
        },
        "review": {
            "agents": ["code_reviewer"],
            "depends_on": ["research"],
            "retry": RetryPolicy(max_retries=1),
        },
        "documentation": {
            "agents": ["documentation"],
            "depends_on": ["research", "review"],  # Needs both inputs
            "retry": RetryPolicy(max_retries=2),
        },
        "validation": {
            "agents": ["qa_validator"],
            "depends_on": ["documentation"],
            "circuit_breaker": CircuitBreaker(
                failure_threshold=3,
                action="escalate_to_human",
            ),
        },
    }
)

# Execute with full governance
result = swarm.run(
    task="Analyze the security implications of the new OAuth2 "
         "implementation in the payments service",
    workflow=dag,
    trace_id="research-20250115-001",
)

Failure Recovery

In production multi-agent systems, failures are not exceptions — they are expected events. An agent might hit its token budget, an API might be rate-limited, or the model might produce malformed output. AKIOS provides three layers of recovery:

from akios.recovery import (
    RetryPolicy,
    CircuitBreaker,
    FallbackChain,
    DeadLetterQueue,
)

# Layer 1: Automatic retry with exponential backoff
retry = RetryPolicy(
    max_retries=3,
    backoff="exponential",
    base_delay_ms=500,
    max_delay_ms=10000,
    retryable_errors=[
        "rate_limit_exceeded",
        "model_timeout",
        "malformed_output",
    ],
)

# Layer 2: Circuit breaker — stop retrying if the agent is broken
breaker = CircuitBreaker(
    failure_threshold=5,           # 5 failures in window
    window_seconds=300,            # 5-minute window
    action="escalate_to_human",    # Don't keep burning tokens
    cooldown_seconds=600,          # Wait 10 min before auto-retry
)

# Layer 3: Fallback chain — try alternative agents/models
fallback = FallbackChain(
    primary="gpt-4-turbo",
    fallbacks=[
        {"model": "gpt-4o-mini", "budget_multiplier": 0.5},
        {"model": "claude-3-haiku", "budget_multiplier": 0.3},
    ],
    condition="primary_budget_exhausted",
)

# Layer 4: Dead letter queue — nothing is silently lost
dlq = DeadLetterQueue(
    destination="akios-radar",
    retention_days=30,
    alert_on_enqueue=True,
)

Observability: The Full Trace

Every message, every policy evaluation, every token spent is captured in a structured trace. This is not optional logging — it is a first-class data structure that you can query, visualize, and audit:

{
  "trace_id": "research-20250115-001",
  "duration_ms": 34520,
  "total_tokens": 28340,
  "total_cost_usd": 0.87,
  "stages": [
    {
      "name": "research",
      "agent": "research_analyst",
      "status": "completed",
      "duration_ms": 12300,
      "tokens_used": 9200,
      "tool_calls": [
        {"tool": "web_search", "query": "OAuth2 PKCE vulnerabilities 2024"},
        {"tool": "arxiv_fetch", "paper_id": "2401.09876"}
      ],
      "policy_evaluations": 14,
      "policy_denials": 1,
      "denial_details": [
        {
          "action": "network_request",
          "target": "api.openai.com",
          "reason": "Host not in allowlist"
        }
      ]
    },
    {
      "name": "review",
      "agent": "code_reviewer",
      "status": "completed",
      "retries": 1,
      "retry_reason": "malformed_output"
    },
    {
      "name": "documentation",
      "agent": "documentation",
      "status": "completed"
    },
    {
      "name": "validation",
      "agent": "qa_validator",
      "status": "completed",
      "verdict": "approved",
      "quality_score": 0.92
    }
  ]
}

Cost Control at the Swarm Level

Individual agent budgets are not enough. You need swarm-level cost controls that aggregate spending across all agents and enforce hard limits:

apiVersion: akios/v1
kind: SwarmBudget
metadata:
  name: research-squad-budget
spec:
  total_budget:
    max_cost_per_run: $5.00
    max_tokens_per_run: 50000
    alert_at_percent: [50, 80, 95]
  
  per_agent_caps:
    research_analyst:
      max_percent_of_total: 40
    code_reviewer:
      max_percent_of_total: 25
    documentation:
      max_percent_of_total: 20
    qa_validator:
      max_percent_of_total: 15
  
  overflow_policy:
    action: "complete_current_stage_then_stop"
    notify: ["engineering-team@akioud.ai"]

Production Patterns

From our engineering work on multi-agent coordination, here are the key operational patterns we have identified:

  • Low temperature everywhere: Multi-agent systems amplify randomness. If each agent has temperature 0.7, the combined output variance is multiplicative, not additive. Use 0.1-0.3 for deterministic coordination.
  • Structured outputs only: Agents communicating via free-form text will eventually produce unparseable messages. Require JSON schema validation on every inter-agent message.
  • Budget the coordinator separately: The Project Manager agent should have its own generous budget — it is the cheapest agent but the most critical. A starved coordinator cannot recover from downstream failures.
  • Circuit breakers save money: Without a circuit breaker, a broken agent will retry indefinitely, burning through your entire budget in minutes. Set failure thresholds aggressively low (3-5 failures).
  • Trace everything, query later: You will not know what to look for until something breaks. Capture full traces, store them cheaply, and build queries retroactively.
  • Prefer DAGs over hierarchies: Tree-structured agent coordination creates bottlenecks. DAGs allow parallel execution of independent stages and make dependencies explicit.

Multi-agent systems are not a pattern you adopt casually. They are distributed systems with all the complexity that implies — network partitions, partial failures, state divergence, and cascading errors. The AKIOS SDK does not eliminate this complexity. It gives you the infrastructure to manage it: deterministic policies, structured messaging, automatic recovery, full observability, and hard budget limits. Build the hard thing, but build it on solid ground.