Event Store¶

Source of truth for all coordination. Built on PostgreSQL, same pattern as the Tablez platform (Marten event sourcing).

Why¶

Full audit trail of everything that happens
Status queries from one place ("what's the status of issue #42?")
Metrics (time from approved to production, agent efficiency)
Debugging ("what happened during the outage?")
Replay (reconstruct state at any point in time)

Schema¶

CREATE TABLE events (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    type            TEXT NOT NULL,
    status          TEXT NOT NULL DEFAULT 'ACCEPTED',
    actor           TEXT NOT NULL,
    data            JSONB NOT NULL,
    parent_id       UUID,
    correlation_id  UUID NOT NULL,
    created_at      TIMESTAMPTZ DEFAULT NOW()
);

correlation_id groups all events for one issue
parent_id chains related events
data contains issue details, agent info, error context

Event Types¶

Lifecycle events¶

Event	Written by	Meaning
`ISSUE_APPROVED`	CTO	Issue is approved for work
`ISSUE_ASSIGNED`	Conductor	Assigned to a specific agent
`ISSUE_UNASSIGNED`	Conductor	Removed from agent (for reassignment)
`WORK_STARTED`	Dev agent	Agent has started working
`BRANCH_CREATED`	Dev agent	Feature branch created
`PR_CREATED`	Dev agent	Pull request opened
`CI_PASSED`	Dev agent / CI	Build and unit tests passed
`CI_FAILED`	Dev agent / CI	Build or tests failed
`REVIEW_PASSED`	Review agent	Code review approved
`REVIEW_DISPUTED`	CTO	CTO disagrees with a review (post-merge)
`HUMAN_GATE_TRIGGERED`	CI	Sensitive change, needs human review
`MERGED`	Dev agent	PR merged to main
`DEPLOYED_STAGING`	Dev agent	Deployed to staging
`SMOKE_PASSED`	Dev agent	Smoke tests passed
`SMOKE_FAILED`	Dev agent	Smoke tests failed. data includes `reason: "code" \\| "external_dependency"`
`DEPLOYED_PRODUCTION`	Dev agent	Deployed to production
`VERIFIED`	Dev agent	Production verified
`ISSUE_DONE`	Dev agent	Issue complete (only valid if VERIFIED exists)

Health and escalation events¶

Event	Written by	Meaning
`HEARTBEAT`	Dev agent	Agent is alive (every 60 seconds)
`AGENT_STUCK`	Dev agent / Conductor	Agent can't make progress
`ESCALATED`	Conductor	Escalated to CTO (two agents stuck on same issue)
`HUMAN_GATE_REMINDER`	Conductor	Slack reminder sent (deduplicated, every 30 min)
`MILESTONE_COMPLETE`	Conductor	All issues in milestone done

Validation rules¶

ISSUE_DONE is only valid if VERIFIED exists in the same correlation_id
ISSUE_ASSIGNED followed by ISSUE_UNASSIGNED = not currently assigned
Reassignment always means fresh branch, previous branch abandoned
SMOKE_FAILED retry counting: only applies when reason: "code". The dev agent counts SMOKE_FAILED(reason: "code") events in the correlation_id. At count >= 3, writes AGENT_STUCK. If reason: "external_dependency", no retries, agent writes ESCALATED immediately
Custom implementation SLA per issue via GitHub label: sla:45m, sla:1h, sla:2h. Default is 45 minutes. The higher default accounts for pre-PR test execution time (unit + mocked E2E). CI SLA is always fixed at 15 minutes

Storage notes¶

HEARTBEAT events should be stored in a separate table or pruned aggressively. They are high-volume and not needed for audit after a short retention period
All other events are append-only and retained indefinitely

Issue Lifecycle¶

%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#e2e8f0", "primaryTextColor": "#1e293b", "primaryBorderColor": "#64748b", "lineColor": "#94a3b8", "secondaryColor": "#e2e8f0", "tertiaryColor": "#cbd5e1", "background": "#0f172a", "mainBkg": "#e2e8f0", "nodeBorder": "#64748b", "clusterBkg": "#1e293b", "clusterBorder": "#475569", "titleColor": "#e2e8f0", "edgeLabelBackground": "#1e293b", "nodeTextColor": "#1e293b"}}}%%
graph LR
    A["ISSUE_APPROVED"] --> B["ISSUE_ASSIGNED"]
    B --> C["WORK_STARTED"]
    C --> D["PR_CREATED"]
    D --> E["CI_PASSED"]
    E --> F["MERGED"]
    F --> G["DEPLOYED_PRODUCTION"]
    G --> H["ISSUE_DONE"]
    H -->|"unblocks next"| A

    style A fill:#34d399,color:#000
    style B fill:#34d399,color:#000
    style C fill:#60a5fa,color:#000
    style D fill:#a78bfa,color:#000
    style E fill:#a78bfa,color:#000
    style F fill:#34d399,color:#000
    style G fill:#fbbf24,color:#000
    style H fill:#94a3b8,color:#000

Recovery paths¶

The lifecycle is not always linear. These loops can occur:

CI fails: CI_FAILED → agent fixes → back to PR_CREATED
Smoke fails: SMOKE_FAILED → agent fixes on top of main (new PR) → back to PR_CREATED (max 3 retries)
Agent stuck: AGENT_STUCK → ISSUE_UNASSIGNED → ISSUE_ASSIGNED (new agent, fresh branch)

Resilience¶

Event store downtime: Agents buffer events locally and retry with backoff. Audit gaps may occur but correctness is not affected. The Conductor pauses until the store is reachable.

Agent silent death: If no HEARTBEAT for 5 minutes, the Conductor writes AGENT_STUCK on the agent's behalf and triggers reassignment.

Conductor crash: Stateless. Restarts, reads event store, resumes. No data loss. If it crashes after writing ISSUE_ASSIGNED but before the agent reads it, the agent polls for its own assignments on the next cycle.

Known Risks¶

Mock contract drift: Mocked external services (Twilio, Stripe, Claude API) can diverge from real APIs over time. Mocked E2E will keep passing while real smoke tests start failing. Mitigation: contract tests (Pact or similar) in CI to verify mocks match live API responses. Planned for a future phase.

E2E environment isolation: Each agent runs its own Docker Compose stack for mocked E2E. Shared staging env is only used for post-merge smoke tests (real integrations). Do not run mocked E2E against shared state.

Query Examples¶

What's the status of issue #42?

SELECT type, actor, created_at FROM events
WHERE correlation_id = (
    SELECT correlation_id FROM events
    WHERE data->>'issueId' = '42' LIMIT 1
)
ORDER BY created_at DESC LIMIT 1;

How long from approved to production?

SELECT
    (done.created_at - approved.created_at) AS duration
FROM events approved
JOIN events done ON approved.correlation_id = done.correlation_id
WHERE approved.type = 'ISSUE_APPROVED'
AND done.type = 'ISSUE_DONE';

Which agent is most efficient?

SELECT
    assigned.data->>'agentId' AS agent,
    AVG(done.created_at - assigned.created_at) AS avg_time
FROM events assigned
JOIN events done ON assigned.correlation_id = done.correlation_id
WHERE assigned.type = 'ISSUE_ASSIGNED'
AND done.type = 'ISSUE_DONE'
AND NOT EXISTS (
    -- Only count the final assignment (not reassigned ones)
    SELECT 1 FROM events un
    WHERE un.correlation_id = assigned.correlation_id
    AND un.type = 'ISSUE_UNASSIGNED'
    AND un.created_at > assigned.created_at
)
GROUP BY agent ORDER BY avg_time;

What happened during the outage?

SELECT type, actor, data, created_at FROM events
WHERE created_at BETWEEN '2026-03-28 14:00' AND '2026-03-28 15:00'
ORDER BY created_at;

How many times was an issue reassigned?

SELECT correlation_id, COUNT(*) AS reassignments
FROM events
WHERE type = 'ISSUE_UNASSIGNED'
GROUP BY correlation_id
HAVING COUNT(*) > 0;