How to Build a Production-Ready Multi-Agent Incident Response System Using OpenAI Swarm and Tool-Augmented Agents

In this tutorial, we build an advanced yet practical multi-agent system using OpenAI Swarm that runs in Colab. We demonstrate how we can orchestrate specialized agents, such as a triage agent, an SRE agent, a communications agent, and a critic, to collaboratively handle a real-world production incident scenario. By structuring agent handoffs, integrating lightweight tools for knowledge retrieval and decision ranking, and keeping the implementation clean and modular, we show how Swarm enables us to design controllable, agentic workflows without heavy frameworks or complex infrastructure. Check out the FULL CODES HERE.

!pip -q install -U openai
!pip -q install -U “git+https://github.com/openai/swarm.git”

import os

def load_openai_key():
try:
from google.colab import userdata
key = userdata.get(“OPENAI_API_KEY”)
except Exception:
key = None
if not key:
import getpass
key = getpass.getpass(“Enter OPENAI_API_KEY (hidden): “).strip()
if not key:
raise RuntimeError(“OPENAI_API_KEY not provided”)
return key

os.environ[“OPENAI_API_KEY”] = load_openai_key()

We set up the environment and securely load the OpenAI API key so the notebook can run safely in Google Colab. We ensure the key is fetched from Colab secrets when available and fall back to a hidden prompt otherwise. This keeps authentication simple and reusable across sessions. Check out the FULL CODES HERE.

import json
import re
from typing import List, Dict
from swarm import Swarm, Agent

client = Swarm()

We import the core Python utilities and initialize the Swarm client that orchestrates all agent interactions. This snippet establishes the runtime backbone that allows agents to communicate, hand off tasks, and execute tool calls. It serves as the entry point for the multi-agent workflow. Check out the FULL CODES HERE.

KB_DOCS = [
{
“id”: “kb-incident-001”,
“title”: “API Latency Incident Playbook”,
“text”: “If p95 latency spikes, validate deploys, dependencies, and error rates. Rollback, cache, rate-limit, scale. Compare p50 vs p99 and inspect upstream timeouts.”
},
{
“id”: “kb-risk-001”,
“title”: “Risk Communication Guidelines”,
“text”: “Updates must include impact, scope, mitigation, owner, and next update. Avoid blame and separate internal vs external messaging.”
},
{
“id”: “kb-ops-001”,
“title”: “On-call Handoff Template”,
“text”: “Include summary, timeline, current status, mitigations, open questions, next actions, and owners.”
},
]

def _normalize(s: str) -> List[str]:
return re.sub(r”[^a-z0-9\s]”, ” “, s.lower()).split()

def search_kb(query: str, top_k: int = 3) -> str:
q = set(_normalize(query))
scored = []
for d in KB_DOCS:
score = len(q.intersection(set(_normalize(d[“title”] + ” ” + d[“text”]))))
scored.append((score, d))
scored.sort(key=lambda x: x[0], reverse=True)
docs = [d for s, d in scored[:top_k] if s > 0] or [scored[0][1]]
return json.dumps(docs, indent=2)

We define a lightweight internal knowledge base and implement a retrieval function to surface relevant context during agent reasoning. By using simple token-based matching, we allow agents to ground their responses in predefined operational documents. This demonstrates how Swarm can be augmented with domain-specific memory without external dependencies. Check out the FULL CODES HERE.

def estimate_mitigation_impact(options_json: str) -> str:
try:
options = json.loads(options_json)
except Exception as e:
return json.dumps({“error”: str(e)})
ranking = []
for o in options:
conf = float(o.get(“confidence”, 0.5))
risk = o.get(“risk”, “medium”)
penalty = {“low”: 0.1, “medium”: 0.25, “high”: 0.45}.get(risk, 0.25)
ranking.append({
“option”: o.get(“option”),
“confidence”: conf,
“risk”: risk,
“score”: round(conf – penalty, 3)
})
ranking.sort(key=lambda x: x[“score”], reverse=True)
return json.dumps(ranking, indent=2)

We introduce a structured tool that evaluates and ranks mitigation strategies based on confidence and risk. This allows agents to move beyond free-form reasoning and produce semi-quantitative decisions. We show how tools can enforce consistency and decision discipline in agent outputs. Check out the FULL CODES HERE.

def handoff_to_sre():
return sre_agent

def handoff_to_comms():
return comms_agent

def handoff_to_handoff_writer():
return handoff_writer_agent

def handoff_to_critic():
return critic_agent

We define explicit handoff functions that enable one agent to transfer control to another. This snippet illustrates how we model delegation and specialization within Swarm. It makes agent-to-agent routing transparent and easy to extend. Check out the FULL CODES HERE.

triage_agent = Agent(
name=”Triage”,
model=”gpt-4o-mini”,
instructions=”””
Decide which agent should handle the request.
Use SRE for incident response.
Use Comms for customer or executive messaging.
Use HandoffWriter for on-call notes.
Use Critic for review or improvement.
“””,
functions=[search_kb, handoff_to_sre, handoff_to_comms, handoff_to_handoff_writer, handoff_to_critic]
)

sre_agent = Agent(
name=”SRE”,
model=”gpt-4o-mini”,
instructions=”””
Produce a structured incident response with triage steps,
ranked mitigations, ranked hypotheses, and a 30-minute plan.
“””,
functions=[search_kb, estimate_mitigation_impact]
)

comms_agent = Agent(
name=”Comms”,
model=”gpt-4o-mini”,
instructions=”””
Produce an external customer update and an internal technical update.
“””,
functions=[search_kb]
)

handoff_writer_agent = Agent(
name=”HandoffWriter”,
model=”gpt-4o-mini”,
instructions=”””
Produce a clean on-call handoff document with standard headings.
“””,
functions=[search_kb]
)

critic_agent = Agent(
name=”Critic”,
model=”gpt-4o-mini”,
instructions=”””
Critique the previous answer, then produce a refined final version and a checklist.
“””
)

We configure multiple specialized agents, each with a clearly scoped responsibility and instruction set. By separating triage, incident response, communications, handoff writing, and critique, we demonstrate a clean division of labor. Check out the FULL CODES HERE.

def run_pipeline(user_request: str):
messages = [{“role”: “user”, “content”: user_request}]
r1 = client.run(agent=triage_agent, messages=messages, max_turns=8)
messages2 = r1.messages + [{“role”: “user”, “content”: “Review and improve the last answer”}]
r2 = client.run(agent=critic_agent, messages=messages2, max_turns=4)
return r2.messages[-1][“content”]

request = “””
Production p95 latency jumped from 250ms to 2.5s after a deploy.
Errors slightly increased, DB CPU stable, upstream timeouts rising.
Provide a 30-minute action plan and a customer update.
“””

print(run_pipeline(request))

We assemble the full orchestration pipeline that executes triage, specialist reasoning, and critical refinement in sequence. This snippet shows how we run the end-to-end workflow with a single function call. It ties together all agents and tools into a coherent, production-style agentic system.

In conclusion, we established a clear pattern for designing agent-oriented systems with OpenAI Swarm that emphasizes clarity, separation of responsibilities, and iterative refinement. We showed how to route tasks intelligently, enrich agent reasoning with local tools, and improve output quality via a critic loop, all while maintaining a simple, Colab-friendly setup. This approach allows us to scale from experimentation to real operational use cases, making Swarm a powerful foundation for building reliable, production-grade agentic AI workflows.

Check out the FULL CODES HERE. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

Source link