Skip to content

Event Store

Source of truth for all coordination. Built on PostgreSQL, same pattern as the Tablez platform (Marten event sourcing).


Why

  • Full audit trail of everything that happens
  • Status queries from one place ("what's the status of issue #42?")
  • Metrics (time from approved to production, agent efficiency)
  • Debugging ("what happened during the outage?")
  • Replay (reconstruct state at any point in time)

Schema

CREATE TABLE events (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    type            TEXT NOT NULL,
    status          TEXT NOT NULL DEFAULT 'ACCEPTED',
    actor           TEXT NOT NULL,
    data            JSONB NOT NULL,
    parent_id       UUID,
    correlation_id  UUID NOT NULL,
    created_at      TIMESTAMPTZ DEFAULT NOW()
);
  • correlation_id groups all events for one issue
  • parent_id chains related events
  • data contains issue details, agent info, error context

Event Types

Lifecycle events

Event Written by Meaning
ISSUE_APPROVED CTO Issue is approved for work
ISSUE_ASSIGNED Conductor Assigned to a specific agent
ISSUE_UNASSIGNED Conductor Removed from agent (for reassignment)
WORK_STARTED Dev agent Agent has started working
BRANCH_CREATED Dev agent Feature branch created
PR_CREATED Dev agent Pull request opened
CI_PASSED Dev agent / CI Build and unit tests passed
CI_FAILED Dev agent / CI Build or tests failed
REVIEW_PASSED Review agent Code review approved
REVIEW_DISPUTED CTO CTO disagrees with a review (post-merge)
HUMAN_GATE_TRIGGERED CI Sensitive change, needs human review
MERGED Dev agent PR merged to main
DEPLOYED_STAGING Dev agent Deployed to staging
SMOKE_PASSED Dev agent Smoke tests passed
SMOKE_FAILED Dev agent Smoke tests failed. data includes reason: "code" \| "external_dependency"
DEPLOYED_PRODUCTION Dev agent Deployed to production
VERIFIED Dev agent Production verified
ISSUE_DONE Dev agent Issue complete (only valid if VERIFIED exists)

Health and escalation events

Event Written by Meaning
HEARTBEAT Dev agent Agent is alive (every 60 seconds)
AGENT_STUCK Dev agent / Conductor Agent can't make progress
ESCALATED Conductor Escalated to CTO (two agents stuck on same issue)
HUMAN_GATE_REMINDER Conductor Slack reminder sent (deduplicated, every 30 min)
MILESTONE_COMPLETE Conductor All issues in milestone done

Validation rules

  • ISSUE_DONE is only valid if VERIFIED exists in the same correlation_id
  • ISSUE_ASSIGNED followed by ISSUE_UNASSIGNED = not currently assigned
  • Reassignment always means fresh branch, previous branch abandoned
  • SMOKE_FAILED retry counting: only applies when reason: "code". The dev agent counts SMOKE_FAILED(reason: "code") events in the correlation_id. At count >= 3, writes AGENT_STUCK. If reason: "external_dependency", no retries, agent writes ESCALATED immediately
  • Custom implementation SLA per issue via GitHub label: sla:45m, sla:1h, sla:2h. Default is 45 minutes. The higher default accounts for pre-PR test execution time (unit + mocked E2E). CI SLA is always fixed at 15 minutes

Storage notes

  • HEARTBEAT events should be stored in a separate table or pruned aggressively. They are high-volume and not needed for audit after a short retention period
  • All other events are append-only and retained indefinitely

Issue Lifecycle

%%{init: {"theme": "base", "themeVariables": {"primaryColor": "#e2e8f0", "primaryTextColor": "#1e293b", "primaryBorderColor": "#64748b", "lineColor": "#94a3b8", "secondaryColor": "#e2e8f0", "tertiaryColor": "#cbd5e1", "background": "#0f172a", "mainBkg": "#e2e8f0", "nodeBorder": "#64748b", "clusterBkg": "#1e293b", "clusterBorder": "#475569", "titleColor": "#e2e8f0", "edgeLabelBackground": "#1e293b", "nodeTextColor": "#1e293b"}}}%%
graph LR
    A["ISSUE_APPROVED"] --> B["ISSUE_ASSIGNED"]
    B --> C["WORK_STARTED"]
    C --> D["PR_CREATED"]
    D --> E["CI_PASSED"]
    E --> F["MERGED"]
    F --> G["DEPLOYED_PRODUCTION"]
    G --> H["ISSUE_DONE"]
    H -->|"unblocks next"| A

    style A fill:#34d399,color:#000
    style B fill:#34d399,color:#000
    style C fill:#60a5fa,color:#000
    style D fill:#a78bfa,color:#000
    style E fill:#a78bfa,color:#000
    style F fill:#34d399,color:#000
    style G fill:#fbbf24,color:#000
    style H fill:#94a3b8,color:#000

Recovery paths

The lifecycle is not always linear. These loops can occur:

  • CI fails: CI_FAILED → agent fixes → back to PR_CREATED
  • Smoke fails: SMOKE_FAILED → agent fixes on top of main (new PR) → back to PR_CREATED (max 3 retries)
  • Agent stuck: AGENT_STUCKISSUE_UNASSIGNEDISSUE_ASSIGNED (new agent, fresh branch)

Resilience

Event store downtime: Agents buffer events locally and retry with backoff. Audit gaps may occur but correctness is not affected. The Conductor pauses until the store is reachable.

Agent silent death: If no HEARTBEAT for 5 minutes, the Conductor writes AGENT_STUCK on the agent's behalf and triggers reassignment.

Conductor crash: Stateless. Restarts, reads event store, resumes. No data loss. If it crashes after writing ISSUE_ASSIGNED but before the agent reads it, the agent polls for its own assignments on the next cycle.


Known Risks

Mock contract drift: Mocked external services (Twilio, Stripe, Claude API) can diverge from real APIs over time. Mocked E2E will keep passing while real smoke tests start failing. Mitigation: contract tests (Pact or similar) in CI to verify mocks match live API responses. Planned for a future phase.

E2E environment isolation: Each agent runs its own Docker Compose stack for mocked E2E. Shared staging env is only used for post-merge smoke tests (real integrations). Do not run mocked E2E against shared state.


Query Examples

What's the status of issue #42?

SELECT type, actor, created_at FROM events
WHERE correlation_id = (
    SELECT correlation_id FROM events
    WHERE data->>'issueId' = '42' LIMIT 1
)
ORDER BY created_at DESC LIMIT 1;

How long from approved to production?

SELECT
    (done.created_at - approved.created_at) AS duration
FROM events approved
JOIN events done ON approved.correlation_id = done.correlation_id
WHERE approved.type = 'ISSUE_APPROVED'
AND done.type = 'ISSUE_DONE';

Which agent is most efficient?

SELECT
    assigned.data->>'agentId' AS agent,
    AVG(done.created_at - assigned.created_at) AS avg_time
FROM events assigned
JOIN events done ON assigned.correlation_id = done.correlation_id
WHERE assigned.type = 'ISSUE_ASSIGNED'
AND done.type = 'ISSUE_DONE'
AND NOT EXISTS (
    -- Only count the final assignment (not reassigned ones)
    SELECT 1 FROM events un
    WHERE un.correlation_id = assigned.correlation_id
    AND un.type = 'ISSUE_UNASSIGNED'
    AND un.created_at > assigned.created_at
)
GROUP BY agent ORDER BY avg_time;

What happened during the outage?

SELECT type, actor, data, created_at FROM events
WHERE created_at BETWEEN '2026-03-28 14:00' AND '2026-03-28 15:00'
ORDER BY created_at;

How many times was an issue reassigned?

SELECT correlation_id, COUNT(*) AS reassignments
FROM events
WHERE type = 'ISSUE_UNASSIGNED'
GROUP BY correlation_id
HAVING COUNT(*) > 0;