Event Store¶
Source of truth for all coordination. Built on PostgreSQL, same pattern as the Tablez platform (Marten event sourcing).
Why¶
- Full audit trail of everything that happens
- Status queries from one place ("what's the status of issue #42?")
- Metrics (time from approved to production, agent efficiency)
- Debugging ("what happened during the outage?")
- Replay (reconstruct state at any point in time)
Schema¶
CREATE TABLE events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
type TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'ACCEPTED',
actor TEXT NOT NULL,
data JSONB NOT NULL,
parent_id UUID,
correlation_id UUID NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
correlation_idgroups all events for one issueparent_idchains related eventsdatacontains issue details, agent info, error context
Event Types¶
Lifecycle events¶
| Event | Written by | Meaning |
|---|---|---|
ISSUE_APPROVED |
CTO | Issue is approved for work |
ISSUE_ASSIGNED |
Conductor | Assigned to a specific agent |
ISSUE_UNASSIGNED |
Conductor | Removed from agent (for reassignment) |
WORK_STARTED |
Dev agent | Agent has started working |
BRANCH_CREATED |
Dev agent | Feature branch created |
PR_CREATED |
Dev agent | Pull request opened |
CI_PASSED |
Dev agent / CI | Build and unit tests passed |
CI_FAILED |
Dev agent / CI | Build or tests failed |
REVIEW_PASSED |
Review agent | Code review approved |
REVIEW_DISPUTED |
CTO | CTO disagrees with a review (post-merge) |
HUMAN_GATE_TRIGGERED |
CI | Sensitive change, needs human review |
MERGED |
Dev agent | PR merged to main |
DEPLOYED_STAGING |
Dev agent | Deployed to staging |
SMOKE_PASSED |
Dev agent | Smoke tests passed |
SMOKE_FAILED |
Dev agent | Smoke tests failed. data includes reason: "code" \| "external_dependency" |
DEPLOYED_PRODUCTION |
Dev agent | Deployed to production |
VERIFIED |
Dev agent | Production verified |
ISSUE_DONE |
Dev agent | Issue complete (only valid if VERIFIED exists) |
Health and escalation events¶
| Event | Written by | Meaning |
|---|---|---|
HEARTBEAT |
Dev agent | Agent is alive (every 60 seconds) |
AGENT_STUCK |
Dev agent / Conductor | Agent can't make progress |
ESCALATED |
Conductor | Escalated to CTO (two agents stuck on same issue) |
HUMAN_GATE_REMINDER |
Conductor | Slack reminder sent (deduplicated, every 30 min) |
MILESTONE_COMPLETE |
Conductor | All issues in milestone done |
Validation rules¶
ISSUE_DONEis only valid ifVERIFIEDexists in the same correlation_idISSUE_ASSIGNEDfollowed byISSUE_UNASSIGNED= not currently assigned- Reassignment always means fresh branch, previous branch abandoned
- SMOKE_FAILED retry counting: only applies when
reason: "code". The dev agent countsSMOKE_FAILED(reason: "code")events in the correlation_id. At count >= 3, writesAGENT_STUCK. Ifreason: "external_dependency", no retries, agent writesESCALATEDimmediately - Custom implementation SLA per issue via GitHub label:
sla:45m,sla:1h,sla:2h. Default is 45 minutes. The higher default accounts for pre-PR test execution time (unit + mocked E2E). CI SLA is always fixed at 15 minutes
Storage notes¶
HEARTBEATevents should be stored in a separate table or pruned aggressively. They are high-volume and not needed for audit after a short retention period- All other events are append-only and retained indefinitely
Issue Lifecycle¶
%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#e2e8f0", "primaryTextColor": "#1e293b", "primaryBorderColor": "#64748b", "lineColor": "#94a3b8", "secondaryColor": "#e2e8f0", "tertiaryColor": "#cbd5e1", "background": "#0f172a", "mainBkg": "#e2e8f0", "nodeBorder": "#64748b", "clusterBkg": "#1e293b", "clusterBorder": "#475569", "titleColor": "#e2e8f0", "edgeLabelBackground": "#1e293b", "nodeTextColor": "#1e293b"}}}%%
graph LR
A["ISSUE_APPROVED"] --> B["ISSUE_ASSIGNED"]
B --> C["WORK_STARTED"]
C --> D["PR_CREATED"]
D --> E["CI_PASSED"]
E --> F["MERGED"]
F --> G["DEPLOYED_PRODUCTION"]
G --> H["ISSUE_DONE"]
H -->|"unblocks next"| A
style A fill:#34d399,color:#000
style B fill:#34d399,color:#000
style C fill:#60a5fa,color:#000
style D fill:#a78bfa,color:#000
style E fill:#a78bfa,color:#000
style F fill:#34d399,color:#000
style G fill:#fbbf24,color:#000
style H fill:#94a3b8,color:#000
Recovery paths¶
The lifecycle is not always linear. These loops can occur:
- CI fails:
CI_FAILED→ agent fixes → back toPR_CREATED - Smoke fails:
SMOKE_FAILED→ agent fixes on top of main (new PR) → back toPR_CREATED(max 3 retries) - Agent stuck:
AGENT_STUCK→ISSUE_UNASSIGNED→ISSUE_ASSIGNED(new agent, fresh branch)
Resilience¶
Event store downtime: Agents buffer events locally and retry with backoff. Audit gaps may occur but correctness is not affected. The Conductor pauses until the store is reachable.
Agent silent death: If no HEARTBEAT for 5 minutes, the Conductor writes AGENT_STUCK on the agent's behalf and triggers reassignment.
Conductor crash: Stateless. Restarts, reads event store, resumes. No data loss. If it crashes after writing ISSUE_ASSIGNED but before the agent reads it, the agent polls for its own assignments on the next cycle.
Known Risks¶
Mock contract drift: Mocked external services (Twilio, Stripe, Claude API) can diverge from real APIs over time. Mocked E2E will keep passing while real smoke tests start failing. Mitigation: contract tests (Pact or similar) in CI to verify mocks match live API responses. Planned for a future phase.
E2E environment isolation: Each agent runs its own Docker Compose stack for mocked E2E. Shared staging env is only used for post-merge smoke tests (real integrations). Do not run mocked E2E against shared state.
Query Examples¶
What's the status of issue #42?
SELECT type, actor, created_at FROM events
WHERE correlation_id = (
SELECT correlation_id FROM events
WHERE data->>'issueId' = '42' LIMIT 1
)
ORDER BY created_at DESC LIMIT 1;
How long from approved to production?
SELECT
(done.created_at - approved.created_at) AS duration
FROM events approved
JOIN events done ON approved.correlation_id = done.correlation_id
WHERE approved.type = 'ISSUE_APPROVED'
AND done.type = 'ISSUE_DONE';
Which agent is most efficient?
SELECT
assigned.data->>'agentId' AS agent,
AVG(done.created_at - assigned.created_at) AS avg_time
FROM events assigned
JOIN events done ON assigned.correlation_id = done.correlation_id
WHERE assigned.type = 'ISSUE_ASSIGNED'
AND done.type = 'ISSUE_DONE'
AND NOT EXISTS (
-- Only count the final assignment (not reassigned ones)
SELECT 1 FROM events un
WHERE un.correlation_id = assigned.correlation_id
AND un.type = 'ISSUE_UNASSIGNED'
AND un.created_at > assigned.created_at
)
GROUP BY agent ORDER BY avg_time;
What happened during the outage?
SELECT type, actor, data, created_at FROM events
WHERE created_at BETWEEN '2026-03-28 14:00' AND '2026-03-28 15:00'
ORDER BY created_at;
How many times was an issue reassigned?