Skip to content

Tablez — Technical Architecture

Version: 0.3 Date: March 2026 Author: Stig-Johnny Stoebakk / Claude-3


Design Principles

  1. Clean Architecture — Dependencies point inward. Domain has no knowledge of infrastructure.
  2. IDesign Method (Juval Löwy / "Righting Software") — 4-layer decomposition: Clients → Managers → Engines → Resource Access. Layers ordered by volatility.
  3. Mediator Pattern — All cross-cutting communication goes through MediatR. No direct service-to-service calls.
  4. CQRS + Event Sourcing — Commands append events. Queries read from projections. Full audit trail.
  5. Domain-Driven Design — Aggregates, value objects, domain events. The domain layer is pure C# with no framework dependencies.
  6. Event-Driven Architecture — Services communicate through domain events, not direct calls.

Tech Stack

Layer Technology Why
Runtime .NET 10 / ASP.NET Core Team expertise, performance, ecosystem
API Minimal APIs + MediatR Clean routing, mediator pattern
Realtime SignalR Floor view, live table status
Database PostgreSQL Proven, JSON support, row-level locking
Event Store Marten (on PostgreSQL) Event sourcing + document store, no extra infra
ORM EF Core 10 (migrations only) Schema migrations for non-event-sourced tables
Cache Valkey (Redis fork, BSD) No vendor lock-in, pub/sub for events + cache invalidation
Background Jobs Hangfire (PostgreSQL storage) Queues, delayed jobs, recurring tasks, dashboard
State Machine Stateless (NuGet) Reservation lifecycle, waitlist flows
AI/LLM Semantic Kernel + Claude API Tool calling, function calling for AI channels
SMS Twilio Confirmations, waitlist notifications, 2FA
Payment Stripe No-show fees, ticketed events
Auth ASP.NET Identity + JWT Restaurant staff auth
Hosting Kubernetes (managed or k3s) Scalable, cloud-portable
CI/CD GitHub Actions + ARC (self-hosted) Build, test, push images on k3s runners
GitOps Flux CD Lightweight, pure GitOps, image automation, Discord alerts
Cluster UI vCluster Platform (free tier) Web dashboard at vcluster.invotek.no, Cloudflare Zero Trust
DNS/Ingress Cloudflare Tunnels No exposed ports, DDoS protection, free TLS
Infra as Code Terraform (Cloudflare provider) Tunnel, DNS, Zero Trust Access as code
Secrets Bitwarden (personal) Password management, shared via Bitwarden Send
Git Hosting GitHub (tablez-dev org) Separate from personal repos

Volatility-Based Decomposition (IDesign)

Services ordered by change frequency. Top layers change often and auto-deploy. Bottom layers change rarely and require gates.

block-beta
    columns 1
    block:high["HIGH VOLATILITY — auto-deploy"]
        web["tablez-web (dashboard)"] gateway["tablez-api-gateway (routes)"]
    end
    block:freq["FREQUENT CHANGES"]
        ai["tablez-ai (prompts, tools)"]
    end
    block:mod["MODERATE CHANGES"]
        reservation["tablez-reservation"] guest["tablez-guest"] notification["tablez-notification"]
    end
    block:low["LOW VOLATILITY"]
        restaurant["tablez-restaurant"] contracts["tablez-contracts"]
    end
    block:gate["HUMAN GATE"]
        migration["tablez-migration (DB schema)"]
    end

    style high fill:#4CAF50,color:#fff
    style freq fill:#8BC34A,color:#fff
    style mod fill:#FFC107,color:#000
    style low fill:#FF9800,color:#fff
    style gate fill:#F44336,color:#fff

Repository Structure

Separate GitHub organization: tablez-dev/. Multi-repo for AI agent productivity — each repo is a bounded context an agent can own completely.

Repo Purpose Auto-deploy
tablez-dev/tablez-contracts Shared DTOs, events, interfaces → NuGet Yes
tablez-dev/tablez-api-gateway YARP API gateway, routing, auth Yes
tablez-dev/tablez-reservation Core booking engine + event store Yes
tablez-dev/tablez-guest Guest CRM, profiles Yes
tablez-dev/tablez-restaurant Restaurant config, floor plan, schedule Yes
tablez-dev/tablez-ai LLM gateway, Semantic Kernel, tool API Yes
tablez-dev/tablez-notification SMS, email, push (Hangfire workers) Yes
tablez-dev/tablez-web Staff dashboard frontend (Next.js) Yes
tablez-dev/tablez-migration EF Core + Marten schema migrations Human gate
tablez-dev/tablez-gitops Flux manifests, overlays, notifications Human gate on prod
tablez-dev/tablez-docs Specs, architecture, ADRs N/A

Shared types: tablez-contracts contains two projects: Tablez.Contracts (DTOs, events, interfaces) and Tablez.Observability (shared OpenTelemetry setup). All services reference contracts via ProjectReference using multi-repo Docker builds — CI checks out tablez-contracts alongside the service repo using a CONTRACTS_TOKEN org secret. See LOCAL-DEV.md section 8 for details.


Architecture Overview (IDesign + Clean Architecture)

Each service follows the same internal 4-layer structure:

graph TB
    subgraph Clients["Clients Layer"]
        REST["REST API<br/>(Minimal APIs)"]
        SignalR["SignalR Hub<br/>(Floor View)"]
        AIGateway["AI Agent Gateway<br/>(Chat/Phone/Email)"]
    end

    subgraph Managers["Managers Layer"]
        RM["ReservationManager"]
        GM["GuestManager"]
        WM["WaitlistManager"]
        NM["NotificationManager"]
        ResM["RestaurantManager"]
    end

    subgraph Engines["Engines Layer — Pure Logic, No I/O"]
        AE["AvailabilityEngine"]
        TAE["TableAssignmentEngine"]
        PE["PricingEngine"]
        WME["WaitlistMatchingEngine"]
        SE["ScheduleEngine"]
        VE["ValidationEngine"]
    end

    subgraph ResourceAccess["Resource Access Layer — 1:1 with External Systems"]
        ES["EventStore<br/>(Marten)"]
        CA["CacheAccessor<br/>(Valkey)"]
        SMS["SmsAccessor<br/>(Twilio)"]
        PA["PaymentAccessor<br/>(Stripe)"]
        LLM["LlmAccessor<br/>(Claude API)"]
    end

    subgraph External["External Systems"]
        PG[(PostgreSQL)]
        VK[(Valkey)]
        APIs["Twilio / Stripe / Claude"]
    end

    Clients -->|MediatR| Managers
    Managers --> Engines
    Managers --> ResourceAccess
    ES --> PG
    CA --> VK
    SMS --> APIs
    PA --> APIs
    LLM --> APIs

Event Sourcing (Marten)

All state changes are stored as immutable events. Current state is derived by replaying events. Marten uses PostgreSQL as the event store — no extra infrastructure.

Domain Events

// Reservation aggregate events
ReservationRequested    { GuestId, PartySize, DateTime, Channel }
ReservationConfirmed    { ReservationId, TableId, ConfirmedBy }
ReservationCancelled    { ReservationId, Reason, CancelledBy }
GuestArrived            { ReservationId, ArrivedAt }
GuestSeated             { ReservationId, TableId, SeatedAt }
GuestCompleted          { ReservationId, CompletedAt }
NoShowMarked            { ReservationId, MarkedAt }

// Waitlist aggregate events
WaitlistEntryCreated    { GuestId, PartySize, TimeWindow }
WaitlistSlotOffered     { WaitlistId, ReservationSlot, ExpiresAt }
WaitlistOfferAccepted   { WaitlistId }
WaitlistOfferExpired    { WaitlistId }
WaitlistOfferDeclined   { WaitlistId }

// Guest aggregate events
GuestProfileCreated     { GuestId, Name, Phone, Email }
GuestProfileUpdated     { GuestId, Field, OldValue, NewValue }
GuestPreferenceAdded    { GuestId, Preference }

// Table/floor events
TableStatusChanged      { TableId, OldStatus, NewStatus, ChangedBy }

Write Side (Commands)

sequenceDiagram
    participant C as Client
    participant M as Manager (MediatR)
    participant E as Engine
    participant ES as EventStore (Marten)
    participant PG as PostgreSQL
    participant VK as Valkey pub/sub

    C->>M: Send(Command)
    M->>E: Validate business rules
    E-->>M: Valid / Invalid
    M->>ES: Append event
    ES->>PG: Persist to event stream
    ES->>VK: Publish domain event
    VK-->>M: Other services react async
    M-->>C: Result

Read Side (Projections)

Marten automatically builds read models from events:

flowchart LR
    ES["Event Stream"] --> MP["Marten Projection"] --> RM["Read Model<br/>(PostgreSQL table)"]
Projection Built from Used by
ReservationView Reservation events Staff dashboard, availability check
FloorView Table + reservation events Live floor view (SignalR)
GuestHistory Guest + reservation events CRM, AI agent context
DailyAvailability Reservation + schedule events Booking widget, AI agent
WaitlistQueue Waitlist events Staff dashboard, waitlist management

Cross-Service Event Flow

flowchart LR
    R["tablez-reservation"] -->|ReservationConfirmed| VK["Valkey pub/sub"]
    VK --> N["tablez-notification<br/>SMS confirmation"]
    VK --> G["tablez-guest<br/>update visit count"]
    VK --> W["tablez-web<br/>SignalR floor view"]
    VK --> AI["tablez-ai<br/>update context"]

No direct service-to-service calls. Services communicate exclusively through domain events.

Benefits

Benefit Tablez use case
Full audit trail "Who changed this reservation and when?"
Temporal queries "What did the floor look like at 19:30?"
Rebuild state Replay events to debug or recover
Event-driven Services react to events, no coupling
Undo/compensation Cancellation = new event, not DELETE
Analytics Stream events to build dashboards
GDPR Find all events for a guest, redact/delete

Mediator Flow (MediatR)

Every request flows through the mediator pipeline:

sequenceDiagram
    participant HTTP as HTTP Request
    participant EP as Endpoint (Client)
    participant Val as ValidationBehavior
    participant Log as LoggingBehavior
    participant Cache as CachingBehavior
    participant H as Handler (Manager)
    participant E as Engine
    participant DB as Marten (PostgreSQL)

    HTTP->>EP: Request
    EP->>Val: MediatR.Send(Command)
    Val->>Log: Validated
    Log->>Cache: Logged
    Cache->>H: Cache miss
    H->>E: Business rules
    E-->>H: Result
    H->>DB: Append events / query
    DB-->>H: Persisted / data
    H-->>Cache: Result
    Cache-->>Log: Cached
    Log-->>Val: Logged
    Val-->>EP: Response DTO
    EP-->>HTTP: HTTP Response

Caching Strategy (Valkey)

Data Strategy TTL Invalidation
Availability slots Event-driven None Invalidate on ReservationConfirmed/Cancelled
Restaurant config Write-through 1 hour Invalidate on admin update
Floor plan / tables Write-through 1 hour Invalidate on admin update
Guest profiles Cache-aside 15 min TTL expiry
Service schedule Cache-aside 30 min Invalidate on schedule change
Active table status Write-behind No TTL Real-time SignalR updates

Invalidation mechanism: Domain events published to Valkey pub/sub trigger cache eviction. Same events that drive service communication also drive cache invalidation.


State Machines (Stateless)

Reservation Lifecycle

stateDiagram-v2
    [*] --> Requested
    Requested --> Confirmed : confirm
    Requested --> Cancelled : cancel
    Confirmed --> Arrived : arrive
    Confirmed --> Cancelled : cancel
    Confirmed --> NoShow : no_show
    Arrived --> Seated : seat
    Arrived --> Cancelled : cancel
    Seated --> Completed : complete
    Completed --> [*]
    Cancelled --> [*]
    NoShow --> [*]

Each state transition appends an event to the Marten event store. The state machine validates transitions; the event store records them.

Waitlist Lifecycle

stateDiagram-v2
    [*] --> Queued
    Queued --> Offered : slot_available
    Queued --> Cancelled : cancel
    Offered --> Converted : accept
    Offered --> Expired : timeout
    Offered --> Declined : decline
    Expired --> Queued : requeue
    Converted --> [*]
    Cancelled --> [*]
    Declined --> [*]

Concurrent Booking (Race Condition Prevention)

Marten supports optimistic concurrency on event streams:

// Marten appends to the reservation stream with expected version
session.Events.Append(reservationId, expectedVersion, new ReservationConfirmed { ... });
await session.SaveChangesAsync();
// Throws ConcurrencyException if another event was appended first

For availability, combine with PostgreSQL advisory locks:

SELECT pg_advisory_xact_lock(hashtext(@restaurant_id || @date || @time));

No distributed locks needed. PostgreSQL handles serialization.


AI Agent Architecture (LLM Tool API)

Based on Tablez Spec v1.2 Section 11. The AI agent receives minimal static context and fetches everything via tools.

sequenceDiagram
    participant G as Guest (phone/chat/email)
    participant AI as AI Gateway (tablez-ai)
    participant SK as Semantic Kernel
    participant LLM as Claude API
    participant SVC as Backend Services (MediatR)

    G->>AI: Natural language request
    AI->>SK: Orchestrate
    SK->>LLM: Reason + decide tool calls
    LLM-->>SK: Tool call: check_availability
    SK->>SVC: CheckAvailabilityQuery
    SVC-->>SK: Available slots
    SK->>LLM: Result + continue reasoning
    LLM-->>SK: Tool call: create_reservation
    SK->>SVC: CreateReservationCommand
    SVC-->>SK: Reservation confirmed
    SK->>LLM: Format response
    LLM-->>SK: Natural language reply
    SK-->>AI: Response
    AI-->>G: "Your table is booked for 7pm!"

Tool mapping:

Tool call MediatR handler
check_availability CheckAvailabilityQuery
create_reservation CreateReservationCommand
create_waitlist CreateWaitlistEntryCommand
get_service_overview GetServiceOverviewQuery
get_guest_profile GetGuestProfileQuery
update_guest_profile UpdateGuestProfileCommand

Key principle: LLM handles language. The system handles logic. LLM never decides availability — it calls check_availability and reports the result.


Background Jobs (Hangfire)

Job Type Trigger
Send SMS confirmation Fire-and-forget ReservationConfirmed event
Send waitlist offer SMS Fire-and-forget WaitlistSlotOffered event
Waitlist hold expiry Delayed (15 min) WaitlistSlotOffered event
No-show cleanup Recurring (hourly) Cron
Reminder SMS Delayed (24h before) ReservationConfirmed event
Projection rebuild Manual Admin trigger

All jobs use PostgreSQL storage — no additional infrastructure. Jobs are triggered by domain events via Valkey pub/sub.


Deployment Architecture

Kubernetes

graph TB
    subgraph CF["Cloudflare Edge"]
        tablez["tablez.com"]
        api["api.tablez.com"]
        staff["staff.tablez.com"]
        ws["ws.tablez.com"]
    end

    subgraph K8S["Kubernetes — namespace: tablez"]
        tunnel["cloudflared<br/>(DaemonSet)"]

        subgraph Services["Application Services"]
            GW["api-gateway<br/>2+ replicas"]
            R["reservation<br/>2+ replicas"]
            G["guest<br/>1+ replicas"]
            REST["restaurant<br/>1 replica"]
            AI["ai<br/>2+ replicas"]
            N["notification<br/>1 replica"]
            WEB["web<br/>2+ replicas"]
        end

        subgraph Infra["Infrastructure"]
            PG[(PostgreSQL<br/>Marten event store)]
            VK[(Valkey<br/>cache + pub/sub)]
        end
    end

    tablez --> tunnel
    api --> tunnel
    staff --> tunnel
    ws --> tunnel
    tunnel --> GW
    tunnel --> WEB
    GW --> R
    GW --> G
    GW --> REST
    GW --> AI
    R --> PG
    G --> PG
    R --> VK
    N --> VK

Cloudflare Tunnels (DNS + Ingress)

No exposed ports. No public IPs. No cert-manager. Cloudflare Tunnel runs inside the cluster and routes traffic from Cloudflare's edge.

Active hostnames (managed by Terraform):

Domain Target Purpose Zero Trust
grafana.invotek.no grafana.observability:80 Observability dashboards Yes (invotekas@gmail.com)
vcluster.invotek.no loft.vcluster-platform:443 vCluster Platform dashboard Yes (invotekas@gmail.com)

Future hostnames (when services are production-ready):

Domain Target Purpose
tablez.com tablez-web:3000 Booking widget
api.tablez.com tablez-api-gateway:8080 REST API
staff.tablez.com tablez-web:3000 Staff dashboard
ws.tablez.com tablez-api-gateway:8080 SignalR

Works identically on k3s at home and managed Kubernetes in cloud. Tunnel config, DNS, and Zero Trust policies are managed as code via Terraform in tablez-gitops/terraform/.

Terraform (Cloudflare Infrastructure)

DNS records, Cloudflare Tunnel configuration, and Zero Trust Access policies are managed via Terraform — not the Cloudflare dashboard.

tablez-gitops/terraform/
├── versions.tf          # Provider + backend config
├── variables.tf         # Input variables (token, IDs, emails)
├── tunnel.tf            # Tunnel + ingress config + token output
├── dns.tf               # CNAME records
├── access.tf            # Zero Trust Access apps + policies
├── terraform.tfvars     # Local secrets (gitignored)
└── terraform.tfvars.example  # Template for secrets

Adding a new hostname:

  1. Add ingress rule in tunnel.tf
  2. Add CNAME record in dns.tf
  3. (Optional) Add Zero Trust Access app + policy in access.tf
  4. Run terraform plan && terraform apply

Required API token permissions: Account > Cloudflare Tunnel: Edit, Zone > DNS: Edit, Account > Access: Apps and Policies: Edit.

Setup:

cd tablez-gitops/terraform
cp terraform.tfvars.example terraform.tfvars
# Fill in cloudflare_api_token, cloudflare_account_id, cloudflare_zone_id
terraform init
terraform plan
terraform apply
# Deploy tunnel token to k8s:
kubectl create secret generic cloudflared-token -n observability \
  --from-literal=token=$(terraform output -raw tunnel_token)

Deploy Gating

Condition Action
All tests pass + no DB migration Auto-deploy to dev/staging/prod
DB migration detected in PR Block deploy, notify Discord, require human approval
Production overlay changed Require PR approval
Dev/staging Always auto-deploy

Migration detection in CI:

- name: Check for migrations
  run: |
    if git diff HEAD~1 --name-only | grep -q "Migrations/"; then
      echo "REQUIRES_APPROVAL=true" >> $GITHUB_ENV
    fi


GitOps (Flux CD)

Repository Structure

tablez-dev/tablez-gitops/
├── clusters/
│   └── local/
│       └── flux-system/
│           ├── gotk-components.yaml    # Flux controller manifests
│           └── gotk-sync.yaml          # GitRepository + Kustomizations
├── infrastructure/
│   ├── base/
│   │   ├── kustomization.yaml
│   │   ├── namespace.yaml              # tablez namespace
│   │   ├── postgres.yaml               # PostgreSQL StatefulSet + Service
│   │   ├── valkey.yaml                 # Valkey Deployment + Service
│   │   ├── arc-system/                 # ARC controller (HelmRelease)
│   │   │   ├── namespace.yaml          # arc-systems + arc-runners namespaces
│   │   │   ├── helmrepository.yaml     # OCI repo for ARC charts
│   │   │   ├── helmrelease.yaml        # ARC controller deployment
│   │   │   └── kustomization.yaml
│   │   ├── arc-runners/                # Runner scale sets (one per repo)
│   │   │   ├── tablez-reservation.yaml # HelmRelease — DinD runner
│   │   │   ├── tablez-guest.yaml
│   │   │   ├── tablez-restaurant.yaml
│   │   │   ├── tablez-notification.yaml
│   │   │   ├── tablez-ai.yaml
│   │   │   ├── tablez-api-gateway.yaml
│   │   │   └── kustomization.yaml
│   │   ├── image-automation/           # Flux image automation
│   │   │   ├── image-repositories.yaml # Scan ghcr.io for new tags
│   │   │   ├── image-policies.yaml     # Select latest main-sha-timestamp tag
│   │   │   ├── image-update-automation.yaml # Commit tag updates to gitops
│   │   │   └── kustomization.yaml
│   │   └── observability/              # LGTM stack (OpenTelemetry)
│   │       ├── namespace.yaml          # observability namespace
│   │       ├── helmrepositories.yaml   # prometheus-community, grafana, open-telemetry
│   │       ├── otel-collector.yaml     # Central telemetry pipeline
│   │       ├── prometheus.yaml         # Metrics (kube-prometheus-stack)
│   │       ├── tempo.yaml              # Traces
│   │       ├── loki.yaml               # Logs
│   │       ├── grafana.yaml            # Dashboards
│   │       ├── cloudflared.yaml       # Cloudflare Tunnel connector
│   │       └── kustomization.yaml
│   └── overlays/
│       └── local/
│           └── kustomization.yaml
├── apps/
│   ├── base/
│   │   ├── reservation/                # Deployment + Service + health checks
│   │   ├── guest/
│   │   ├── restaurant/
│   │   ├── notification/
│   │   ├── ai/
│   │   └── api-gateway/
│   └── overlays/
│       └── local/
│           ├── kustomization.yaml      # References all 6 services
│           ├── reservation/
│           ├── guest/
│           ├── restaurant/
│           ├── notification/
│           ├── ai/
│           └── api-gateway/
├── terraform/                         # Cloudflare infrastructure (not managed by Flux)
│   ├── versions.tf                    # Provider + backend config
│   ├── variables.tf                   # Input variables
│   ├── tunnel.tf                      # Cloudflare Tunnel + ingress
│   ├── dns.tf                         # CNAME records
│   ├── access.tf                      # Zero Trust Access policies
│   └── terraform.tfvars.example       # Template for secrets
└── README.md

Flux reconciles infrastructure first (PostgreSQL, Valkey, ARC controller + runner scale sets, observability stack), then apps (all 6 service deployments). Terraform manages Cloudflare resources (tunnel, DNS, Zero Trust) separately. Everything is self-contained — move to a new cluster by installing Flux, pointing at this repo, creating secrets, and running terraform apply.

Deployment Flow

flowchart LR
    PR["PR merged to main"] --> ARC["ARC runner<br/>(self-hosted, DinD)"]
    ARC --> GHCR["ghcr.io<br/>main-sha-timestamp tag"]
    GHCR --> IR["Flux Image Reflector<br/>scans every 5m"]
    IR --> IA["Flux Image Automation<br/>commits tag update"]
    IA --> FK["Flux Kustomize Controller<br/>reconciles gitops repo"]
    FK --> K8S["Kubernetes<br/>rolling update"]

How it works:

  1. Code merges to main → ARC runner builds and pushes image tagged main-<sha7>-<unix_timestamp>
  2. Flux Image Reflector scans ghcr.io/tablez-dev/* every 5 minutes, detects the new tag
  3. Image Policy selects the tag with the highest timestamp (most recent build)
  4. Image Update Automation commits the new tag to tablez-gitops deployment manifests (via # {"$imagepolicy": ...} setter markers)
  5. Flux Kustomize Controller reconciles and triggers a rolling update

Tag format: main-<sha7>-<unix_timestamp> (e.g., main-a1b2c3d-1773128998). Pure SHA tags are not sortable — the timestamp suffix allows Flux to determine ordering.

ARC runners: DinD sidecar defined manually (not containerMode: dind) to pass --dns=8.8.8.8 to dockerd. All workflows use network: host on docker/build-push-action because BuildKit's bridge network has broken DNS in k3s DinD (see LOCAL-DEV.md section 8).

Helm vs Kustomize

What Tool Why
All tablez services Kustomize Simple, no templating overhead
PostgreSQL Kustomize (raw manifest) StatefulSet with PVC, simple enough without Helm
Valkey Kustomize (raw manifest) Single Deployment + Service
ARC controller Helm (via Flux HelmRelease) Official chart, CRDs managed by Helm
ARC runner scale sets Helm (via Flux HelmRelease) One HelmRelease per repo, manual DinD sidecar with --dns flags
Prometheus Helm (kube-prometheus-stack) CRDs, ServiceMonitors, complex config
Tempo Helm (grafana/tempo) Official chart, storage config
Loki Helm (grafana/loki) Official chart, single-binary mode
Grafana Helm (grafana/grafana) Data sources, dashboards as values
OTel Collector Helm (open-telemetry) Pipeline config, receiver/exporter setup
Cloudflared Kustomize Simple DaemonSet

Observability (OpenTelemetry + LGTM Stack)

Full observability from day one. All telemetry flows through the OpenTelemetry Collector, which routes to purpose-built backends.

Architecture

flowchart LR
    subgraph Services["Tablez Services (OTLP)"]
        R["reservation"]
        G["guest"]
        REST["restaurant"]
        N["notification"]
        AI["ai"]
        GW["api-gateway"]
    end

    subgraph Collector["OTel Collector"]
        OC["opentelemetry-collector<br/>Deployment"]
    end

    subgraph Backends["LGTM Stack"]
        P["Prometheus<br/>(metrics)"]
        T["Tempo<br/>(traces)"]
        L["Loki<br/>(logs)"]
        GR["Grafana<br/>(dashboards)"]
    end

    R & G & REST & N & AI & GW -->|OTLP/gRPC| OC
    OC -->|remote write| P
    OC -->|OTLP| T
    OC -->|OTLP/HTTP| L
    GR --> P & T & L

Stack Components

Component Purpose Deployment Retention
OTel Collector Central telemetry pipeline — receives, batches, routes Deployment (1 replica) N/A (pass-through)
Prometheus Metrics storage + PromQL queries kube-prometheus-stack (Helm) 7 days
Tempo Distributed trace storage Single-binary (Helm) 72 hours
Loki Log aggregation (label-indexed) Single-binary (Helm) 7 days
Grafana Unified dashboards with trace↔log↔metric correlation Standalone (Helm) Persistent

All deployed as Flux HelmReleases in observability namespace inside the vcluster. GitOps source: tablez-gitops/infrastructure/base/observability/.

.NET Instrumentation

Shared project Tablez.Observability (in tablez-contracts repo) provides one-line setup:

// Program.cs — two lines for full observability
builder.Services.AddTablezObservability("Reservation");
builder.Logging.AddTablezLogging();

// Optional: MediatR tracing (wraps every command/query in a span)
builder.Services.AddTransient(typeof(IPipelineBehavior<,>), typeof(MediatRTracingBehavior<,>));

// Optional: Valkey/Redis instrumentation
builder.Services.AddTablezRedisInstrumentation();

Instrumentation Coverage

Component Method What You See
ASP.NET Core Auto (built-in) HTTP request spans, latency metrics
HttpClient Auto (built-in) Outbound HTTP call spans
SignalR Auto (.NET 9+) Hub method invocation spans
Semantic Kernel Auto (native) LLM call spans, token usage
MediatR MediatRTracingBehavior Command/query spans with type info
Marten MartenTracing helpers Event append, aggregate load, query spans
Valkey AddTablezRedisInstrumentation() Redis command spans
Hangfire Manual spans Background job execution spans
.NET Runtime RuntimeInstrumentation GC, threadpool, allocation metrics

End-to-End Trace Propagation

Every request gets a single TraceId that follows it across all services. This is the most important observability requirement — you can search by TraceId in Grafana and see the full journey.

Within a service: Automatic. The OTel SDK propagates trace context through ASP.NET Core → MediatR → Marten → Valkey. Logs emitted in a traced context automatically include TraceId and SpanId.

HTTP (API Gateway → Backend services): Automatic. HttpClient instrumentation injects the W3C traceparent header. The receiving service's ASP.NET Core instrumentation extracts it.

Valkey pub/sub (Service → Service events): Manual — Valkey pub/sub does not propagate trace context. We solve this with TracedEventEnvelope, which wraps every domain event with the W3C trace context:

sequenceDiagram
    participant R as Reservation Service
    participant VK as Valkey pub/sub
    participant N as Notification Service
    participant G as Guest Service

    Note over R: TraceId: abc-123
    R->>R: TracedEventEnvelope.Wrap("ReservationConfirmed", event)
    Note over R: Envelope includes traceParent: 00-abc-123-...
    R->>VK: PUBLISH reservation.confirmed {envelope}
    VK->>N: {envelope with traceParent}
    VK->>G: {envelope with traceParent}
    N->>N: envelope.StartConsumerActivity("notification")
    Note over N: Continues TraceId: abc-123
    G->>G: envelope.StartConsumerActivity("guest")
    Note over G: Continues TraceId: abc-123
// Publisher (reservation service)
var envelope = TracedEventEnvelope.Wrap("ReservationConfirmed", domainEvent);
await redis.PublishAsync("reservation.confirmed", envelope.Serialize());

// Consumer (notification service)
var envelope = TracedEventEnvelope.Deserialize(message);
using var activity = envelope.StartConsumerActivity("tablez-notification");
var evt = envelope.GetPayload<ReservationConfirmed>();
// All spans created here share the original TraceId

Result: In Grafana Tempo, searching for a single TraceId shows the complete flow: HTTP request → MediatR command → Marten event append → Valkey publish → notification SMS send → guest profile update.

Custom Instrumentation Example

// Marten event store tracing
using var activity = MartenTracing.StartAppendEvents("Reservation", reservationId);
session.Events.Append(reservationId, new ReservationConfirmed { ... });
await session.SaveChangesAsync();

// Custom business metric
var meter = TablezTelemetry.CreateMeter("Reservation");
var bookingCounter = meter.CreateCounter<long>("reservations.created");
bookingCounter.Add(1, new KeyValuePair<string, object?>("channel", "web"));

Grafana Access

Environment URL Auth
Remote https://grafana.invotek.no Cloudflare Zero Trust (invotekas@gmail.com)
Local kubectl port-forward -n observability svc/grafana 3000:80 admin / tablez-local

Pre-configured data sources with trace-to-log correlation: click a trace span in Tempo → jump to related logs in Loki. Service map auto-generated from trace data.

Cloud Migration Path

When moving to managed cloud, only the Collector exporter config changes — zero application code changes:

# OTel Collector config — add cloud exporter alongside self-hosted
exporters:
  otlp/tempo:
    endpoint: tempo.observability:4317      # Keep self-hosted
  googlecloud:                               # Add cloud
    project: "tablez-prod"

service:
  pipelines:
    traces:
      exporters: [otlp/tempo, googlecloud]  # Dual-export during migration

Cloud Migration Strategy

Built to run on k3s today, portable to managed cloud with startup credits.

Component Now (bootstrap) With cloud credits
Kubernetes k3s (self-hosted) AKS / GKE (managed)
PostgreSQL Bitnami on k3s Azure Database / Cloud SQL
Valkey Bitnami on k3s Azure Cache / Memorystore
DNS/Ingress Cloudflare Tunnel Same (unchanged)
Container registry ghcr.io Same or ACR/GCR
GitOps Flux Same (unchanged)
Observability OTel + LGTM (self-hosted) Same or swap exporter to cloud-native

Migration day: 1. Provision managed Kubernetes + managed PostgreSQL + Valkey (Terraform) 2. flux bootstrap to new cluster 3. Update overlays/production/ with new connection strings 4. Push to gitops repo → Flux deploys everything 5. Switch Cloudflare Tunnel to new cluster 6. Done. No code changes.

Cloud credits to target:

Program Credits Path
Microsoft for Startups $150k Azure Apply directly
Google for Startups $100k GCP Via StartupLab (already in contact)
AWS Activate $100k Via accelerator or VC

MCP Surface (AI-Native API)

Tablez exposes an MCP server so external AI agents can book tables directly. This is the competitive moat — no other platform offers this.

External AI Agent (ChatGPT, Claude, Siri, Google Assistant)
  → MCP Protocol
    → Tablez MCP Server
      → Same MediatR commands/queries as internal AI agent

The MCP server is a thin wrapper around the same mediator pipeline. One codebase serves both internal AI channels and external AI agents.

See /mcp-api-surface skill for implementation pattern.


Phase 1 MVP Scope

Include Exclude (Phase 2)
Restaurant config + users Floor plan canvas editor
Web booking form AI phone agent
Staff dashboard (list view) AI email agent
Basic table management Google Reserve
Guest database (manual) LLM guest enrichment
SMS confirmations No-show fee processing
Waitlist (manual) MCP server for external agents
AI chat widget Ticketed events
Availability engine Dynamic pricing
Reservation lifecycle Multi-language AI
Event sourcing from day 1 Projection rebuild tooling

Open Decisions

Decision Options Leaning
Frontend Blazor vs React vs Next.js TBD — depends on team
LLM provider Claude vs GPT-4o vs Gemini Claude (best tool calling)
Kubernetes Managed (AKS/GKE) vs k3s k3s now, managed with credits
Voice AI Pipecat + Deepgram + ElevenLabs vs managed (Vapi) TBD
Monitoring ~~OpenTelemetry + Grafana vs cloud-native~~ Decided: OpenTelemetry + LGTM stack
NuGet feed GitHub Packages vs self-hosted GitHub Packages

References

  • Tablez Spec v1.2 (Tabelz AS)
  • projects/tablez/ANALYSIS.md — Gap analysis
  • projects/tablez/COMPETITIVE-LANDSCAPE.md — Market research
  • Juval Löwy — "Righting Software" (IDesign method)
  • .claude/skills/idesign-architecture/ — IDesign reference
  • Marten documentation — https://martendb.io