Building a Multi-Agent System with AKIOS
Multi-agent orchestration is the hardest problem in applied AI. Not because the models are weak — they are extraordinarily capable. The difficulty is coordination: how do five autonomous systems share state, respect boundaries, recover from failures, and produce a coherent result without a human babysitting every interaction? In this tutorial, we build a governed research team of 5 agents using the AKIOS SDK. Every agent runs inside a policy cage. Every message is traced. Every failure triggers structured recovery.
The Research Team Architecture
Our system is not a flat collection of agents calling each other. It is a directed acyclic graph (DAG) with explicit message channels, policy boundaries, and a coordinator that owns the workflow state machine:
The five agents have distinct roles, scoped permissions, and dedicated token budgets:
- Project Manager: Owns the workflow DAG. Assigns tasks, collects results, decides when to retry or escalate. Cannot access external APIs directly.
- Research Analyst: Web search and data gathering. Allowed: GitHub API, Google Scholar, arXiv. Blocked: everything else. Budget: 10K tokens/hour.
- Code Reviewer: Static analysis and security review. Allowed: repository read access. No network egress. Budget: 8K tokens/hour.
- Documentation Agent: Synthesizes findings into technical reports. No tool access — pure text generation. Budget: 6K tokens/hour.
- QA Validator: Final gate. Checks completeness, consistency, and policy compliance. Can reject and trigger a full re-run. Budget: 4K tokens/hour.
The Message Envelope
Agents do not send raw strings to each other. Every inter-agent message is a typed envelope that the control plane can inspect, trace, and gate:
from dataclasses import dataclass, field
from typing import Any, Literal
from datetime import datetime
import uuid
@dataclass
class AgentMessage:
"""Typed envelope for all inter-agent communication."""
id: str = field(default_factory=lambda: str(uuid.uuid4()))
sender: str = "" # e.g. "research_analyst"
receiver: str = "" # e.g. "project_manager"
msg_type: Literal[
"task_assignment",
"result",
"error",
"escalation",
"approval_request"
] = "result"
payload: dict[str, Any] = field(default_factory=dict)
timestamp: str = field(
default_factory=lambda: datetime.utcnow().isoformat()
)
token_count: int = 0 # Filled by control plane
trace_id: str = "" # Propagated across the DAG
def to_context(self) -> str:
"""Serialize for injection into agent prompt context."""
return (
f"[FROM: {self.sender}] [TYPE: {self.msg_type}]\n"
f"{self.payload.get('content', '')}"
)
Policy Manifests
Each agent gets a YAML policy manifest loaded at initialization. The control plane enforces these constraints at the network layer — not via prompt instructions. An agent literally cannot violate its policy:
apiVersion: akios/v1
kind: AgentPolicy
metadata:
name: research-analyst
team: research-squad
spec:
governance:
allowed_tools:
- web_search
- data_analysis
- arxiv_fetch
network_access:
allowlist:
- host: "api.github.com"
methods: ["GET"]
- host: "scholar.google.com"
- host: "export.arxiv.org"
methods: ["GET"]
denylist:
- host: "*" # Everything not explicitly allowed
budget:
max_tokens_per_hour: 10000
max_cost_per_session: $2.50
max_tool_calls_per_minute: 20
messaging:
allowed_receivers:
- "project_manager"
max_message_size_tokens: 2000
require_structured_output: true
observability:
log_level: "detailed"
trace_propagation: true
alert_on:
- pattern: "security_vulnerability"
action: "escalate_to_human"
- pattern: "budget_80_percent"
action: "notify_coordinator"
The Coordinator: A State Machine
The Project Manager agent is not just another LLM call. It is backed by a deterministic state machine that tracks which DAG stages have completed, which have failed, and what to do next. The LLM generates task descriptions; the state machine controls the flow:
from akios import Agent, Policy, Swarm, WorkflowDAG
from akios.recovery import RetryPolicy, CircuitBreaker
from enum import Enum
class StageStatus(Enum):
PENDING = "pending"
RUNNING = "running"
COMPLETED = "completed"
FAILED = "failed"
RETRYING = "retrying"
# Load all five agent policies
policies = {
"project_manager": Policy.from_file("policies/project-manager.yaml"),
"research_analyst": Policy.from_file("policies/research-analyst.yaml"),
"code_reviewer": Policy.from_file("policies/code-reviewer.yaml"),
"documentation": Policy.from_file("policies/documentation.yaml"),
"qa_validator": Policy.from_file("policies/qa-validator.yaml"),
}
# Build the agent swarm with per-agent models and policies
swarm = Swarm(
agents={
name: Agent(
model="gpt-4-turbo",
policy=policy,
temperature=0.2, # Low temp for deterministic outputs
)
for name, policy in policies.items()
},
coordinator_policy=policies["project_manager"],
)
# Define the workflow as a DAG — not a flat list
dag = WorkflowDAG(
stages={
"research": {
"agents": ["research_analyst"],
"depends_on": [],
"retry": RetryPolicy(max_retries=2, backoff="exponential"),
},
"review": {
"agents": ["code_reviewer"],
"depends_on": ["research"],
"retry": RetryPolicy(max_retries=1),
},
"documentation": {
"agents": ["documentation"],
"depends_on": ["research", "review"], # Needs both inputs
"retry": RetryPolicy(max_retries=2),
},
"validation": {
"agents": ["qa_validator"],
"depends_on": ["documentation"],
"circuit_breaker": CircuitBreaker(
failure_threshold=3,
action="escalate_to_human",
),
},
}
)
# Execute with full governance
result = swarm.run(
task="Analyze the security implications of the new OAuth2 "
"implementation in the payments service",
workflow=dag,
trace_id="research-20250115-001",
)
Failure Recovery
In production multi-agent systems, failures are not exceptions — they are expected events. An agent might hit its token budget, an API might be rate-limited, or the model might produce malformed output. AKIOS provides three layers of recovery:
from akios.recovery import (
RetryPolicy,
CircuitBreaker,
FallbackChain,
DeadLetterQueue,
)
# Layer 1: Automatic retry with exponential backoff
retry = RetryPolicy(
max_retries=3,
backoff="exponential",
base_delay_ms=500,
max_delay_ms=10000,
retryable_errors=[
"rate_limit_exceeded",
"model_timeout",
"malformed_output",
],
)
# Layer 2: Circuit breaker — stop retrying if the agent is broken
breaker = CircuitBreaker(
failure_threshold=5, # 5 failures in window
window_seconds=300, # 5-minute window
action="escalate_to_human", # Don't keep burning tokens
cooldown_seconds=600, # Wait 10 min before auto-retry
)
# Layer 3: Fallback chain — try alternative agents/models
fallback = FallbackChain(
primary="gpt-4-turbo",
fallbacks=[
{"model": "gpt-4o-mini", "budget_multiplier": 0.5},
{"model": "claude-3-haiku", "budget_multiplier": 0.3},
],
condition="primary_budget_exhausted",
)
# Layer 4: Dead letter queue — nothing is silently lost
dlq = DeadLetterQueue(
destination="akios-radar",
retention_days=30,
alert_on_enqueue=True,
)
Observability: The Full Trace
Every message, every policy evaluation, every token spent is captured in a structured trace. This is not optional logging — it is a first-class data structure that you can query, visualize, and audit:
{
"trace_id": "research-20250115-001",
"duration_ms": 34520,
"total_tokens": 28340,
"total_cost_usd": 0.87,
"stages": [
{
"name": "research",
"agent": "research_analyst",
"status": "completed",
"duration_ms": 12300,
"tokens_used": 9200,
"tool_calls": [
{"tool": "web_search", "query": "OAuth2 PKCE vulnerabilities 2024"},
{"tool": "arxiv_fetch", "paper_id": "2401.09876"}
],
"policy_evaluations": 14,
"policy_denials": 1,
"denial_details": [
{
"action": "network_request",
"target": "api.openai.com",
"reason": "Host not in allowlist"
}
]
},
{
"name": "review",
"agent": "code_reviewer",
"status": "completed",
"retries": 1,
"retry_reason": "malformed_output"
},
{
"name": "documentation",
"agent": "documentation",
"status": "completed"
},
{
"name": "validation",
"agent": "qa_validator",
"status": "completed",
"verdict": "approved",
"quality_score": 0.92
}
]
}
Cost Control at the Swarm Level
Individual agent budgets are not enough. You need swarm-level cost controls that aggregate spending across all agents and enforce hard limits:
apiVersion: akios/v1
kind: SwarmBudget
metadata:
name: research-squad-budget
spec:
total_budget:
max_cost_per_run: $5.00
max_tokens_per_run: 50000
alert_at_percent: [50, 80, 95]
per_agent_caps:
research_analyst:
max_percent_of_total: 40
code_reviewer:
max_percent_of_total: 25
documentation:
max_percent_of_total: 20
qa_validator:
max_percent_of_total: 15
overflow_policy:
action: "complete_current_stage_then_stop"
notify: ["engineering-team@akioud.ai"]
Production Patterns
From our engineering work on multi-agent coordination, here are the key operational patterns we have identified:
- Low temperature everywhere: Multi-agent systems amplify randomness. If each agent has temperature 0.7, the combined output variance is multiplicative, not additive. Use 0.1-0.3 for deterministic coordination.
- Structured outputs only: Agents communicating via free-form text will eventually produce unparseable messages. Require JSON schema validation on every inter-agent message.
- Budget the coordinator separately: The Project Manager agent should have its own generous budget — it is the cheapest agent but the most critical. A starved coordinator cannot recover from downstream failures.
- Circuit breakers save money: Without a circuit breaker, a broken agent will retry indefinitely, burning through your entire budget in minutes. Set failure thresholds aggressively low (3-5 failures).
- Trace everything, query later: You will not know what to look for until something breaks. Capture full traces, store them cheaply, and build queries retroactively.
- Prefer DAGs over hierarchies: Tree-structured agent coordination creates bottlenecks. DAGs allow parallel execution of independent stages and make dependencies explicit.
Multi-agent systems are not a pattern you adopt casually. They are distributed systems with all the complexity that implies — network partitions, partial failures, state divergence, and cascading errors. The AKIOS SDK does not eliminate this complexity. It gives you the infrastructure to manage it: deterministic policies, structured messaging, automatic recovery, full observability, and hard budget limits. Build the hard thing, but build it on solid ground.