Slo Architect

Name: Slo Architect
Author: alirezarezvani

Use when defining, reviewing, or operating SLOs/SLIs/error budgets. Triggers on "define an SLO", "what should our SLO be", "error budget", "burn rate", "SLI", "service level objective", "Google SRE workbook", "multi-window burn-rate alert", or any reliability-target question. Sh…

gokubernetes

By alirezarezvani

19k 2.7kUpdated 3 days agoPythonMIT

Skill Content

# SLO Architect

Define SLOs that mean something. Most "SLOs" in the wild are arbitrary numbers no one believes — 99.9% on every endpoint, no SLI definition, no error budget, no policy for what happens when budget burns. This skill enforces the discipline from Google's SRE Workbook: pick the right SLI, set a target users actually care about, calculate the error budget, wire multi-window burn-rate alerts, and have a written policy for when budget runs out.

## When to use

- Defining a new SLO for a service or feature
- Reviewing existing SLOs for common bugs
- Picking the right SLI (event-based vs time-window based vs request-based)
- Computing error budgets and burn-rate alert thresholds
- Tying SLOs to existing controls — feature flags abort, chaos blast radius, operator capability levels

## When NOT to use

- General observability strategy (metrics + logs + traces) → use `observability-designer`
- Customer-facing SLAs with legal teeth → that's contract drafting, not engineering
- Performance load testing (capacity, not reliability) → use `performance-profiler`
- Active incident response → use `incident-response`

## Core principle: an SLO is a promise about user experience

```
SLI  ⟶  measurable signal of user-perceived health (e.g., HTTP 2xx rate)
SLO  ⟶  target for the SLI over a window (e.g., 99.9% over 30 days)
SLA  ⟶  customer-facing commitment with consequences (separate concern)
EB   ⟶  error budget: 100% − SLO target = how much "bad" you can spend
BR   ⟶  burn rate: how fast you're consuming the error budget
```

The four cardinal mistakes:

1. **Target too high** (99.99%+ on services that can't support it) — every minor blip violates SLO; alerts become noise.
2. **Wrong SLI** (CPU usage as proxy for user experience) — system can be "green" while users suffer.
3. **No error budget policy** — burning budget means nothing if there's no agreed action.
4. **Single-window burn-rate alert** — either too noisy (page on a 5-min spike) or too slow (notice budget exhausted after the fact).

The 3 tools below catch each of these.

## Quick start

```bash
SKILL=engineering/slo-architect/skills/slo-architect

# 1. Design an SLO
python "$SKILL/scripts/slo_designer.py" \
  --service checkout-svc \
  --sli-type request-success-rate \
  --target 99.9 \
  --window-days 30

# 2. Compute error budget + multi-window burn-rate alerts
python "$SKILL/scripts/error_budget_calculator.py" \
  --target 99.9 --window-days 30

# 3. Review existing SLO definitions for common bugs
python "$SKILL/scripts/slo_review.py" --slo-doc docs/slos/
```

## The 3 Python tools

All stdlib-only.

### `slo_designer.py`

Generates a structured SLO definition with required fields. Refuses to render if any required field is missing (`exit 1`).

```bash
python scripts/slo_designer.py \
  --service checkout-svc \
  --sli-type request-success-rate \
  --target 99.9 \
  --window-days 30 \
  --owner team-checkout
```

**SLI types supported:**
- `request-success-rate` — `(total_requests - bad_requests) / total_requests`
- `request-latency` — `count(requests < threshold) / total_requests`
- `availability-time` — `(window - downtime) / window`
- `data-freshness` — `count(data_age < threshold) / total_data_points`
- `correctness` — `count(correct_outputs) / total_outputs`

Output is markdown by default with all required fields filled or marked `<must define>`. JSON output (`--format json`) is consumed by `slo_review.py`.

### `error_budget_calculator.py`

Given target availability + window, computes:
- Allowed downtime in the window
- Multi-window burn-rate thresholds per Google SRE Workbook (Chapter 5):
  - **Fast burn** — page if 2% of monthly budget consumed in 1 hour
  - **Slow burn** — page if 10% consumed in 6 hours, ticket if 10% in 3 days
- Recommended alerting rules (PromQL-shaped output)

```bash
python scripts/error_budget_calculator.py --target 99.9 --window-days 30
python scripts/error_budget_calculator.py --target 99.95 --window-days 7 --format json
```

### `slo_review.py`

Audits a directory of SLO definitions (markdown or JSON) for the common bugs.

```bash
python scripts/slo_review.py --slo-doc docs/slos/
```

**Checks:**
- `target_too_high`: target ≥ 99.99% (sustainable only with massive engineering investment)
- `target_too_low`: target ≤ 99.0% (probably wrong SLI; users will notice)
- `window_too_short`: window < 7 days (statistical noise dominates)
- `window_too_long`: window > 90 days (slow feedback)
- `no_sli_definition`: SLI section missing or vague ("everything OK")
- `no_error_budget_policy`: no documented action when budget burns
- `cpu_as_sli`: CPU/memory used as user-experience proxy (wrong signal)

## SLI selection cheatsheet

| User experience | SLI type | What you measure |
|---|---|---|
| "Did the request succeed?" | request-success-rate | `2xx / total` |
| "Was the response fast?" | request-latency | `count(p99 < threshold) / total` |
| "Was the service up?" | availability-time | `(window - downtime) / window` |
| "Is the data current?" | data-freshness | `count(data_age < threshold) / total` |
| "Was the answer correct?" | correctness | `count(correct) / total` |

See `references/sli_design.md` for examples and anti-patterns.

## Error budget math (the basics)

For 99.9% SLO over 30 days:
- Allowed unavailability: `0.1% × 30 × 24 × 60 = 43.2 minutes`
- 1-hour fast-burn threshold (2% of monthly budget burned): `2% × 43.2 / 60 ≈ 1.44 ratio multiplier`
- 6-hour slow-burn threshold (10% in 6h): `10% × 43.2 / 360 ≈ 0.6 ratio multiplier`

`error_budget_calculator.py` does this math for you and emits ready-to-paste alert rules.

## Composition with the rest of the portfolio

This skill explicitly composes with three others:

| Skill | Composition |
|---|---|
| `feature-flags-architect` | Rollout abort criteria reference SLO burn-rate thresholds |
| `chaos-engineering` | Blast-radius calculator already takes monthly error budget as input — define it here |
| `kubernetes-operator` | Operator capability L4 (Deep Insights) requires SLOs + Prometheus rules |

The `error_budget_calculator.py` output is in the same shape `engineering/skills/chaos-engineering/scripts/blast_radius_calculator.py` expects on stdin.

## Workflows

### Workflow 1: Define a new SLO

```
1. Pick the user journey to protect (e.g., "checkout completion").
2. Choose SLI type (request-success-rate, latency, availability, freshness, correctness).
3. Define the SLI precisely: numerator/denominator with concrete labels.
4. Pick a target by measuring 30 days of historical SLI value:
     target = floor(p50 of last 30 days × 100) / 100
   This avoids targets the system has never sustained.
5. Pick a window (28 days = 4 calendar weeks, recommended).
6. Run slo_designer.py to render the SLO definition.
7. Run error_budget_calculator.py to get burn-rate alerts.
8. Write the error budget policy (what happens when budget burns).
9. Run slo_review.py — must pass before the SLO is "live".
```

### Workflow 2: Quarterly SLO review

```
1. For every active SLO, run slo_review.py — fix any FAIL findings.
2. Look at last quarter's data:
   - Was the SLO too easy (never burned budget)? Tighten target.
   - Was it too hard (frequently burned)? Loosen target OR fix the system.
   - Did burn-rate alerts fire usefully (not too noisy, not too late)? Adjust thresholds.
3. Audit error budget policies — were they actually followed when budget burned?
4. Commit revised SLOs; archive old versions with date stamps.
```

### Workflow 3: SLO-driven rollback

```
1. New deploy starts burning error budget faster than baseline.
2. Burn-rate alert fires (from error_budget_calculator.py thresholds).
3. Auto-rollback via feature flag (kill switch from feature-flags-architect).
4. Postmortem feeds into next SLO revision.
```

## References

- `references/slo_principles.md` — SLI vs SLO vs SLA, Google SRE Workbook canon
- `references/sli_design.md` — picking the right SLI; 5 types with examples
- `references/error_budget.md` — error budget math, burn-rate alerts, budget policy
- `references/composition.md` — how SLOs feed feature flags, chaos, operators

## Slash command

`/slo-design` — interactive SLO design wizard that runs all 3 tools.

## Asset templates

- `assets/slo_template.yaml` — fillable SLO YAML
- `assets/error_budget_policy.md` — fillable policy template

## Anti-patterns

- **99.99% on every endpoint** — copy-paste SLOs that nobody verified the system can sustain
- **CPU usage as SLI** — system metrics aren't user experience
- **Single-window burn-rate alert** — too noisy if 5-min, too slow if 30-day
- **No error budget policy** — burning budget means nothing without an action
- **SLOs without owners** — no one is responsible; they bit-rot
- **SLOs reviewed once a year** — system characteristics change faster than that
- **SLAs in the SLO doc** — different audience, different stakes; keep them separate
- **SLO target = SLA target** — SLO must be tighter (you should beat your contract before customers notice)

## Verifiable success

A team using this skill should achieve:

- 100% of SLOs pass `slo_review.py` with 0 FAIL findings
- Every SLO has a documented owner, error budget, burn-rate alerts, and policy
- Burn-rate alerts fire ≤2 times/month per SLO that's hit (signal, not noise)
- Mean time to detect SLO violation: <30 min (multi-window burn-rate alerts working)
- Quarterly SLO review happens every quarter (not annually)

How to use

Copy the skill content above
Create a .claude/skills directory in your project
Save as .claude/skills/claude-skills-slo-architect.md
Use /claude-skills-slo-architect in Claude Code to invoke this skill

README

View on GitHub

Claude Code Skills & Plugins — Agent Skills for Every Coding Tool

345 production-ready Claude Code skills, plugins, and agent skills for 13 AI coding tools.

The most comprehensive open-source library of Claude Code skills and agent plugins — also works with OpenAI Codex, Gemini CLI, Cursor, and 9 more coding agents. Reusable expertise packages covering engineering, DevOps, marketing (incl. AEO — Answer Engine Optimization for LLM citation), security (PreToolUse hooks), compliance, C-level advisory (incl. founder-mode CFO/CMO/CRO/CPO/COO/CHRO/CISO/GC/CDO/CAIO/CCO/VPE personas + 21 /cs:* slash commands), productivity (capture/email/reflect), an academic research stack (litreview/grants/dossier/patent/syllabus/pulse/notebooklm + hybrid router), and enterprise Research Operations (clinical-research/research-finance/market-research/product-research, v2.9.0).

Works with: Claude Code · OpenAI Codex · Gemini CLI · OpenClaw · Hermes Agent¹ · Mistral Vibe² · Cursor · Aider · Windsurf · Kilo Code · OpenCode · Augment · Antigravity

5,200+ GitHub stars — the most comprehensive open-source Claude Code skills & agent plugins library.

What Are Claude Code Skills & Agent Plugins?

Claude Code skills (also called agent skills or coding agent plugins) are modular instruction packages that give AI coding agents domain expertise they don't have out of the box. Each skill includes:

SKILL.md — structured instructions, workflows, and decision frameworks
Python tools — 579 CLI scripts (all stdlib-only, zero pip installs)
Reference docs — 702 templates, checklists, and domain-specific knowledge files

One repo, thirteen platforms. Works natively as Claude Code plugins, Codex agent skills, Gemini CLI skills, Hermes Agent skills, Mistral Vibe skills, and converts to more tools via scripts/convert.sh. All 579 Python tools run anywhere Python runs.

Skills vs Agents vs Personas

	Skills	Agents	Personas
Purpose	How to execute a task	What task to do	Who is thinking
Scope	Single domain	Single domain	Cross-domain
Voice	Neutral	Professional	Personality-driven
Example	"Follow these steps for SEO"	"Run a security audit"	"Think like a startup CTO"

All three work together. See Orchestration for how to combine them.

Quick Install

Gemini CLI (New)

# Clone the repository
git clone https://github.com/alirezarezvani/claude-skills.git
cd claude-skills

# Run the setup script
./scripts/gemini-install.sh

# Start using skills
> activate_skill(name="senior-architect")

Claude Code (Recommended)

# Add the marketplace
/plugin marketplace add alirezarezvani/claude-skills

# Install by domain
/plugin install engineering-skills@claude-code-skills          # 24 core engineering
/plugin install engineering-advanced-skills@claude-code-skills  # 25 POWERFUL-tier
/plugin install product-skills@claude-code-skills               # 12 product skills
/plugin install marketing-skills@claude-code-skills             # 43 marketing skills
/plugin install ra-qm-skills@claude-code-skills                 # 12 regulatory/quality
/plugin install pm-skills@claude-code-skills                    # 6 project management
/plugin install c-level-skills@claude-code-skills               # 28 C-level advisory (full C-suite)
/plugin install business-growth-skills@claude-code-skills       # 4 business & growth
/plugin install finance-skills@claude-code-skills               # 2 finance (analyst + SaaS metrics)

# Or install individual skills
/plugin install skill-security-auditor@claude-code-skills       # Security scanner
/plugin install playwright-pro@claude-code-skills                  # Playwright testing toolkit
/plugin install self-improving-agent@claude-code-skills         # Auto-memory curation
/plugin install content-creator@claude-code-skills              # Single skill

OpenAI Codex

npx agent-skills-cli add alirezarezvani/claude-skills --agent codex
# Or: git clone + ./scripts/codex-install.sh

OpenClaw

bash <(curl -s https://raw.githubusercontent.com/alirezarezvani/claude-skills/main/scripts/openclaw-install.sh)

Manual Installation

git clone https://github.com/alirezarezvani/claude-skills.git
# Copy any skill folder to ~/.claude/skills/ (Claude Code) or ~/.codex/skills/ (Codex)

Multi-Tool Support (New)

Convert all 345 skills to 9 AI coding tools with a single script:

Tool	Format	Install
Cursor	`.mdc` rules	`./scripts/install.sh --tool cursor --target .`
Aider	`CONVENTIONS.md`	`./scripts/install.sh --tool aider --target .`
Kilo Code	`.kilocode/rules/`	`./scripts/install.sh --tool kilocode --target .`
Windsurf	`.windsurf/skills/`	`./scripts/install.sh --tool windsurf --target .`
OpenCode	`.opencode/skills/`	`./scripts/install.sh --tool opencode --target .`
Augment	`.augment/rules/`	`./scripts/install.sh --tool augment --target .`
Antigravity	`~/.gemini/antigravity/skills/`	`./scripts/install.sh --tool antigravity`
Hermes Agent	`~/.hermes/skills/`	`python scripts/sync-hermes-skills.py --verbose`
Mistral Vibe	`~/.vibe/skills/`	`./scripts/vibe-install.sh`

How it works:

# 1. Convert all skills to all tools (takes ~15 seconds)
./scripts/convert.sh --tool all

# 2. Install into your project (with confirmation)
./scripts/install.sh --tool cursor --target /path/to/project

# Or use --force to skip confirmation:
./scripts/install.sh --tool aider --target . --force

# 3. Verify
find .cursor/rules -name "*.mdc" | wc -l  # Should show 346

Each tool gets:

✅ All 345 skills converted to native format
✅ Per-tool README with install/verify/update steps
✅ Support for scripts, references, templates where applicable
✅ Zero manual conversion work

Run ./scripts/convert.sh --tool all to generate tool-specific outputs locally.

Skills Overview

345 skills across 17 domains:

Domain	Skills	Highlights	Details
🔧 Engineering — Core	51	Architecture, frontend, backend, fullstack, QA, DevOps, SecOps, AI/ML, data, Playwright Pro (test gen, flaky fix, migrations), self-improving agent (auto-memory curation), security suite, a11y audit	engineering-team/
⚡ Engineering — POWERFUL	78	Agent designer, RAG architect, database designer, CI/CD builder, security auditor, MCP builder, AgentHub, Helm charts, Terraform, self-eval, llm-wiki, tc-tracker, autoresearch-agent, reliability portfolio (feature-flags-architect, kubernetes-operator, chaos-engineering, slo-architect), ship-gate, security-guidance PreToolUse hook, Matt Pocock skills (write-a-skill, caveman, grill-me, handoff, grill-with-docs)	engineering/
🎯 Product	17	Product manager, agile PO, strategist, UX researcher, UI design, landing pages, SaaS scaffolder, analytics, experiment designer, discovery, roadmap communicator, code-to-prd, apple-hig-expert	product-team/
📣 Marketing	46	8 pods: Content, SEO + AEO (`aeo` — E-E-A-T audit, citation tracking across 5 LLMs), CRO, Channels, Growth, Intelligence, Sales + context foundation + orchestration router	marketing-skill/
🚀 Productivity	6	`capture` (brain-dump-to-action), `email` pair (inbox-setup + inbox-triage), `reflect` (journal), `handoff` (Matt Pocock-inspired), `andreessen` (market-first decision mode)	productivity/
🎨 Marketing (top-level)	1	`landing` — single-file HTML landing-page generator (4 design styles, GSAP patterns, brand palette validator)	marketing/
🔬 Research (academic)	8	`research` orchestrator (hybrid router + fallback) + 7 specialists: `pulse`, `litreview`, `grants` (NIH), `dossier`, `patent`, `syllabus`, `notebooklm`	research/
🧪 Research Operations ✨v2.9.0	5	Enterprise/cross-functional research: orchestrator + `clinical-research` (study design), `research-finance` (R&D program finance), `market-research` (sizing/survey/segmentation), `product-research` (user research) — each with onboarding + customization + opt-in autoresearch bridge	research-ops/
📋 Project Management	9	Senior PM, scrum master, Jira, Confluence, Atlassian admin, templates + bundled Atlassian Remote MCP	project-management/
🏥 Regulatory & QM	18	ISO 13485, MDR 2017/745, FDA, ISO 27001, GDPR, SOC 2, CAPA, risk management	ra-qm-team/
🛡️ Compliance OS	9	Compliance operating system — controls, evidence, audit-readiness workflows	compliance-os/
💼 C-Level Advisory	66	Full C-suite (CEO/CTO/CFO/CMO/CRO/CPO/COO/CHRO/CISO/GC/CDO/CAIO/CCO/VPE) + founder-mode agents + orchestration + board meetings + culture & collaboration	c-level-advisor/
📈 Business & Growth	5	Customer success, sales engineer, revenue ops, contracts & proposals, BizDev toolkit	business-growth/
🏭 Business Operations	7	Orchestrator + process-mapper, vendor-management, capacity-planner, internal-comms, knowledge-ops, procurement-optimizer	business-operations/
🤝 Commercial	8	Orchestrator + pricing-strategist, deal-desk, partnerships-architect, channel-economics, commercial-policy, rfp-responder, commercial-forecaster	commercial/
💰 Finance	4	Financial analyst (DCF, budgeting, forecasting), SaaS metrics coach, business investment advisor	finance/

Personas

Pre-configured agent identities with curated skill loadouts, workflows, and distinct communication styles. Personas go beyond "use these skills" — they define how an agent thinks, prioritizes, and communicates.

Persona	Domain	Best For
Startup CTO	Engineering + Strategy	Architecture decisions, tech stack selection, team building, technical due diligence
Growth Marketer	Marketing + Growth	Content-led growth, launch strategy, channel optimization, bootstrapped marketing
Solo Founder	Cross-domain	One-person sta

…

Footnotes

Hermes Agent is BYO-sync tier: the repo ships a pre-generated .hermes/skills/claude-skills/ tree, but you run python scripts/sync-hermes-skills.py once locally to install into ~/.hermes/skills/. Uses the same agentskills.io SKILL.md standard — no format conversion. ↩
Mistral Vibe is also BYO-sync tier: the repo ships a pre-generated .vibe/skills/claude-skills/ tree, run ./scripts/vibe-install.sh once locally to install into ~/.vibe/skills/. Same agentskills.io SKILL.md standard — no format conversion. Docs: https://docs.mistral.ai/mistral-vibe/agents-skills. ↩