Tablez — Technical Architecture¶

Version: 0.3 Date: March 2026 Author: Stig-Johnny Stoebakk / Claude-3

Design Principles¶

Clean Architecture — Dependencies point inward. Domain has no knowledge of infrastructure.
IDesign Method (Juval Löwy / "Righting Software") — 4-layer decomposition: Clients → Managers → Engines → Resource Access. Layers ordered by volatility.
Mediator Pattern — All cross-cutting communication goes through MediatR. No direct service-to-service calls.
CQRS + Event Sourcing — Commands append events. Queries read from projections. Full audit trail.
Domain-Driven Design — Aggregates, value objects, domain events. The domain layer is pure C# with no framework dependencies.
Event-Driven Architecture — Services communicate through domain events, not direct calls.

Tech Stack¶

Layer	Technology	Why
Runtime	.NET 10 / ASP.NET Core	Team expertise, performance, ecosystem
API	Minimal APIs + MediatR	Clean routing, mediator pattern
Realtime	SignalR	Floor view, live table status
Database	PostgreSQL	Proven, JSON support, row-level locking
Event Store	Marten (on PostgreSQL)	Event sourcing + document store, no extra infra
ORM	EF Core 10 (migrations only)	Schema migrations for non-event-sourced tables
Cache	Valkey (Redis fork, BSD)	No vendor lock-in, pub/sub for events + cache invalidation
Background Jobs	Hangfire (PostgreSQL storage)	Queues, delayed jobs, recurring tasks, dashboard
State Machine	Stateless (NuGet)	Reservation lifecycle, waitlist flows
AI/LLM	Semantic Kernel + Claude API	Tool calling, function calling for AI channels
SMS	Twilio	Confirmations, waitlist notifications, 2FA
Payment	Stripe	No-show fees, ticketed events
Auth	ASP.NET Identity + JWT	Restaurant staff auth
Hosting	Kubernetes (managed or k3s)	Scalable, cloud-portable
CI/CD	GitHub Actions + ARC (self-hosted)	Build, test, push images on k3s runners
GitOps	Flux CD	Lightweight, pure GitOps, image automation, Discord alerts
Cluster UI	vCluster Platform (free tier)	Web dashboard at `vcluster.invotek.no`, Cloudflare Zero Trust
DNS/Ingress	Cloudflare Tunnels	No exposed ports, DDoS protection, free TLS
Infra as Code	Terraform (Cloudflare provider)	Tunnel, DNS, Zero Trust Access as code
Secrets	Bitwarden (personal)	Password management, shared via Bitwarden Send
Git Hosting	GitHub (`tablez-dev` org)	Separate from personal repos

Volatility-Based Decomposition (IDesign)¶

Services ordered by change frequency. Top layers change often and auto-deploy. Bottom layers change rarely and require gates.

block-beta
    columns 1
    block:high["HIGH VOLATILITY — auto-deploy"]
        web["tablez-web (dashboard)"] gateway["tablez-api-gateway (routes)"]
    end
    block:freq["FREQUENT CHANGES"]
        ai["tablez-ai (prompts, tools)"]
    end
    block:mod["MODERATE CHANGES"]
        reservation["tablez-reservation"] guest["tablez-guest"] notification["tablez-notification"]
    end
    block:low["LOW VOLATILITY"]
        restaurant["tablez-restaurant"] contracts["tablez-contracts"]
    end
    block:gate["HUMAN GATE"]
        migration["tablez-migration (DB schema)"]
    end

    style high fill:#4CAF50,color:#fff
    style freq fill:#8BC34A,color:#fff
    style mod fill:#FFC107,color:#000
    style low fill:#FF9800,color:#fff
    style gate fill:#F44336,color:#fff

Repository Structure¶

Separate GitHub organization: tablez-dev/. Multi-repo for AI agent productivity — each repo is a bounded context an agent can own completely.

Repo	Purpose	Auto-deploy
`tablez-dev/tablez-contracts`	Shared DTOs, events, interfaces → NuGet	Yes
`tablez-dev/tablez-api-gateway`	YARP API gateway, routing, auth	Yes
`tablez-dev/tablez-reservation`	Core booking engine + event store	Yes
`tablez-dev/tablez-guest`	Guest CRM, profiles	Yes
`tablez-dev/tablez-restaurant`	Restaurant config, floor plan, schedule	Yes
`tablez-dev/tablez-ai`	LLM gateway, Semantic Kernel, tool API	Yes
`tablez-dev/tablez-notification`	SMS, email, push (Hangfire workers)	Yes
`tablez-dev/tablez-web`	Staff dashboard frontend (Next.js)	Yes
`tablez-dev/tablez-migration`	EF Core + Marten schema migrations	Human gate
`tablez-dev/tablez-gitops`	Flux manifests, overlays, notifications	Human gate on prod
`tablez-dev/tablez-docs`	Specs, architecture, ADRs	N/A

Shared types: tablez-contracts contains two projects: Tablez.Contracts (DTOs, events, interfaces) and Tablez.Observability (shared OpenTelemetry setup). All services reference contracts via ProjectReference using multi-repo Docker builds — CI checks out tablez-contracts alongside the service repo using a CONTRACTS_TOKEN org secret. See LOCAL-DEV.md section 8 for details.

Architecture Overview (IDesign + Clean Architecture)¶

Each service follows the same internal 4-layer structure:

graph TB
    subgraph Clients["Clients Layer"]
        REST["REST API<br/>(Minimal APIs)"]
        SignalR["SignalR Hub<br/>(Floor View)"]
        AIGateway["AI Agent Gateway<br/>(Chat/Phone/Email)"]
    end

    subgraph Managers["Managers Layer"]
        RM["ReservationManager"]
        GM["GuestManager"]
        WM["WaitlistManager"]
        NM["NotificationManager"]
        ResM["RestaurantManager"]
    end

    subgraph Engines["Engines Layer — Pure Logic, No I/O"]
        AE["AvailabilityEngine"]
        TAE["TableAssignmentEngine"]
        PE["PricingEngine"]
        WME["WaitlistMatchingEngine"]
        SE["ScheduleEngine"]
        VE["ValidationEngine"]
    end

    subgraph ResourceAccess["Resource Access Layer — 1:1 with External Systems"]
        ES["EventStore<br/>(Marten)"]
        CA["CacheAccessor<br/>(Valkey)"]
        SMS["SmsAccessor<br/>(Twilio)"]
        PA["PaymentAccessor<br/>(Stripe)"]
        LLM["LlmAccessor<br/>(Claude API)"]
    end

    subgraph External["External Systems"]
        PG[(PostgreSQL)]
        VK[(Valkey)]
        APIs["Twilio / Stripe / Claude"]
    end

    Clients -->|MediatR| Managers
    Managers --> Engines
    Managers --> ResourceAccess
    ES --> PG
    CA --> VK
    SMS --> APIs
    PA --> APIs
    LLM --> APIs

Event Sourcing (Marten)¶

All state changes are stored as immutable events. Current state is derived by replaying events. Marten uses PostgreSQL as the event store — no extra infrastructure.

Domain Events¶

// Reservation aggregate events
ReservationRequested    { GuestId, PartySize, DateTime, Channel }
ReservationConfirmed    { ReservationId, TableId, ConfirmedBy }
ReservationCancelled    { ReservationId, Reason, CancelledBy }
GuestArrived            { ReservationId, ArrivedAt }
GuestSeated             { ReservationId, TableId, SeatedAt }
GuestCompleted          { ReservationId, CompletedAt }
NoShowMarked            { ReservationId, MarkedAt }

// Waitlist aggregate events
WaitlistEntryCreated    { GuestId, PartySize, TimeWindow }
WaitlistSlotOffered     { WaitlistId, ReservationSlot, ExpiresAt }
WaitlistOfferAccepted   { WaitlistId }
WaitlistOfferExpired    { WaitlistId }
WaitlistOfferDeclined   { WaitlistId }

// Guest aggregate events
GuestProfileCreated     { GuestId, Name, Phone, Email }
GuestProfileUpdated     { GuestId, Field, OldValue, NewValue }
GuestPreferenceAdded    { GuestId, Preference }

// Table/floor events
TableStatusChanged      { TableId, OldStatus, NewStatus, ChangedBy }

Write Side (Commands)¶

sequenceDiagram
    participant C as Client
    participant M as Manager (MediatR)
    participant E as Engine
    participant ES as EventStore (Marten)
    participant PG as PostgreSQL
    participant VK as Valkey pub/sub

    C->>M: Send(Command)
    M->>E: Validate business rules
    E-->>M: Valid / Invalid
    M->>ES: Append event
    ES->>PG: Persist to event stream
    ES->>VK: Publish domain event
    VK-->>M: Other services react async
    M-->>C: Result

Read Side (Projections)¶

Marten automatically builds read models from events:

flowchart LR
    ES["Event Stream"] --> MP["Marten Projection"] --> RM["Read Model<br/>(PostgreSQL table)"]

Projection	Built from	Used by
ReservationView	Reservation events	Staff dashboard, availability check
FloorView	Table + reservation events	Live floor view (SignalR)
GuestHistory	Guest + reservation events	CRM, AI agent context
DailyAvailability	Reservation + schedule events	Booking widget, AI agent
WaitlistQueue	Waitlist events	Staff dashboard, waitlist management

Cross-Service Event Flow¶

flowchart LR
    R["tablez-reservation"] -->|ReservationConfirmed| VK["Valkey pub/sub"]
    VK --> N["tablez-notification<br/>SMS confirmation"]
    VK --> G["tablez-guest<br/>update visit count"]
    VK --> W["tablez-web<br/>SignalR floor view"]
    VK --> AI["tablez-ai<br/>update context"]

No direct service-to-service calls. Services communicate exclusively through domain events.

Benefits¶

Benefit	Tablez use case
Full audit trail	"Who changed this reservation and when?"
Temporal queries	"What did the floor look like at 19:30?"
Rebuild state	Replay events to debug or recover
Event-driven	Services react to events, no coupling
Undo/compensation	Cancellation = new event, not DELETE
Analytics	Stream events to build dashboards
GDPR	Find all events for a guest, redact/delete

Mediator Flow (MediatR)¶

Every request flows through the mediator pipeline:

sequenceDiagram
    participant HTTP as HTTP Request
    participant EP as Endpoint (Client)
    participant Val as ValidationBehavior
    participant Log as LoggingBehavior
    participant Cache as CachingBehavior
    participant H as Handler (Manager)
    participant E as Engine
    participant DB as Marten (PostgreSQL)

    HTTP->>EP: Request
    EP->>Val: MediatR.Send(Command)
    Val->>Log: Validated
    Log->>Cache: Logged
    Cache->>H: Cache miss
    H->>E: Business rules
    E-->>H: Result
    H->>DB: Append events / query
    DB-->>H: Persisted / data
    H-->>Cache: Result
    Cache-->>Log: Cached
    Log-->>Val: Logged
    Val-->>EP: Response DTO
    EP-->>HTTP: HTTP Response

Caching Strategy (Valkey)¶

Data	Strategy	TTL	Invalidation
Availability slots	Event-driven	None	Invalidate on ReservationConfirmed/Cancelled
Restaurant config	Write-through	1 hour	Invalidate on admin update
Floor plan / tables	Write-through	1 hour	Invalidate on admin update
Guest profiles	Cache-aside	15 min	TTL expiry
Service schedule	Cache-aside	30 min	Invalidate on schedule change
Active table status	Write-behind	No TTL	Real-time SignalR updates

Invalidation mechanism: Domain events published to Valkey pub/sub trigger cache eviction. Same events that drive service communication also drive cache invalidation.

State Machines (Stateless)¶

Reservation Lifecycle¶

stateDiagram-v2
    [*] --> Requested
    Requested --> Confirmed : confirm
    Requested --> Cancelled : cancel
    Confirmed --> Arrived : arrive
    Confirmed --> Cancelled : cancel
    Confirmed --> NoShow : no_show
    Arrived --> Seated : seat
    Arrived --> Cancelled : cancel
    Seated --> Completed : complete
    Completed --> [*]
    Cancelled --> [*]
    NoShow --> [*]

Each state transition appends an event to the Marten event store. The state machine validates transitions; the event store records them.

Waitlist Lifecycle¶

stateDiagram-v2
    [*] --> Queued
    Queued --> Offered : slot_available
    Queued --> Cancelled : cancel
    Offered --> Converted : accept
    Offered --> Expired : timeout
    Offered --> Declined : decline
    Expired --> Queued : requeue
    Converted --> [*]
    Cancelled --> [*]
    Declined --> [*]

Concurrent Booking (Race Condition Prevention)¶

Marten supports optimistic concurrency on event streams:

// Marten appends to the reservation stream with expected version
session.Events.Append(reservationId, expectedVersion, new ReservationConfirmed { ... });
await session.SaveChangesAsync();
// Throws ConcurrencyException if another event was appended first

For availability, combine with PostgreSQL advisory locks:

SELECT pg_advisory_xact_lock(hashtext(@restaurant_id || @date || @time));

No distributed locks needed. PostgreSQL handles serialization.

AI Agent Architecture (LLM Tool API)¶

Based on Tablez Spec v1.2 Section 11. The AI agent receives minimal static context and fetches everything via tools.

sequenceDiagram
    participant G as Guest (phone/chat/email)
    participant AI as AI Gateway (tablez-ai)
    participant SK as Semantic Kernel
    participant LLM as Claude API
    participant SVC as Backend Services (MediatR)

    G->>AI: Natural language request
    AI->>SK: Orchestrate
    SK->>LLM: Reason + decide tool calls
    LLM-->>SK: Tool call: check_availability
    SK->>SVC: CheckAvailabilityQuery
    SVC-->>SK: Available slots
    SK->>LLM: Result + continue reasoning
    LLM-->>SK: Tool call: create_reservation
    SK->>SVC: CreateReservationCommand
    SVC-->>SK: Reservation confirmed
    SK->>LLM: Format response
    LLM-->>SK: Natural language reply
    SK-->>AI: Response
    AI-->>G: "Your table is booked for 7pm!"

Tool mapping:

Tool call	MediatR handler
`check_availability`	`CheckAvailabilityQuery`
`create_reservation`	`CreateReservationCommand`
`create_waitlist`	`CreateWaitlistEntryCommand`
`get_service_overview`	`GetServiceOverviewQuery`
`get_guest_profile`	`GetGuestProfileQuery`
`update_guest_profile`	`UpdateGuestProfileCommand`

Key principle: LLM handles language. The system handles logic. LLM never decides availability — it calls check_availability and reports the result.

Background Jobs (Hangfire)¶

Job	Type	Trigger
Send SMS confirmation	Fire-and-forget	ReservationConfirmed event
Send waitlist offer SMS	Fire-and-forget	WaitlistSlotOffered event
Waitlist hold expiry	Delayed (15 min)	WaitlistSlotOffered event
No-show cleanup	Recurring (hourly)	Cron
Reminder SMS	Delayed (24h before)	ReservationConfirmed event
Projection rebuild	Manual	Admin trigger

All jobs use PostgreSQL storage — no additional infrastructure. Jobs are triggered by domain events via Valkey pub/sub.

Deployment Architecture¶

Kubernetes¶

graph TB
    subgraph CF["Cloudflare Edge"]
        tablez["tablez.com"]
        api["api.tablez.com"]
        staff["staff.tablez.com"]
        ws["ws.tablez.com"]
    end

    subgraph K8S["Kubernetes — namespace: tablez"]
        tunnel["cloudflared<br/>(DaemonSet)"]

        subgraph Services["Application Services"]
            GW["api-gateway<br/>2+ replicas"]
            R["reservation<br/>2+ replicas"]
            G["guest<br/>1+ replicas"]
            REST["restaurant<br/>1 replica"]
            AI["ai<br/>2+ replicas"]
            N["notification<br/>1 replica"]
            WEB["web<br/>2+ replicas"]
        end

        subgraph Infra["Infrastructure"]
            PG[(PostgreSQL<br/>Marten event store)]
            VK[(Valkey<br/>cache + pub/sub)]
        end
    end

    tablez --> tunnel
    api --> tunnel
    staff --> tunnel
    ws --> tunnel
    tunnel --> GW
    tunnel --> WEB
    GW --> R
    GW --> G
    GW --> REST
    GW --> AI
    R --> PG
    G --> PG
    R --> VK
    N --> VK

Cloudflare Tunnels (DNS + Ingress)¶

No exposed ports. No public IPs. No cert-manager. Cloudflare Tunnel runs inside the cluster and routes traffic from Cloudflare's edge.

Active hostnames (managed by Terraform):

Domain	Target	Purpose	Zero Trust
`grafana.invotek.no`	`grafana.observability:80`	Observability dashboards	Yes (`invotekas@gmail.com`)
`vcluster.invotek.no`	`loft.vcluster-platform:443`	vCluster Platform dashboard	Yes (`invotekas@gmail.com`)

Future hostnames (when services are production-ready):

Domain	Target	Purpose
`tablez.com`	`tablez-web:3000`	Booking widget
`api.tablez.com`	`tablez-api-gateway:8080`	REST API
`staff.tablez.com`	`tablez-web:3000`	Staff dashboard
`ws.tablez.com`	`tablez-api-gateway:8080`	SignalR

Works identically on k3s at home and managed Kubernetes in cloud. Tunnel config, DNS, and Zero Trust policies are managed as code via Terraform in tablez-gitops/terraform/.

Terraform (Cloudflare Infrastructure)¶

DNS records, Cloudflare Tunnel configuration, and Zero Trust Access policies are managed via Terraform — not the Cloudflare dashboard.

tablez-gitops/terraform/
├── versions.tf          # Provider + backend config
├── variables.tf         # Input variables (token, IDs, emails)
├── tunnel.tf            # Tunnel + ingress config + token output
├── dns.tf               # CNAME records
├── access.tf            # Zero Trust Access apps + policies
├── terraform.tfvars     # Local secrets (gitignored)
└── terraform.tfvars.example  # Template for secrets

Adding a new hostname:

Add ingress rule in tunnel.tf
Add CNAME record in dns.tf
(Optional) Add Zero Trust Access app + policy in access.tf
Run terraform plan && terraform apply

Required API token permissions: Account > Cloudflare Tunnel: Edit, Zone > DNS: Edit, Account > Access: Apps and Policies: Edit.

Setup:

cd tablez-gitops/terraform
cp terraform.tfvars.example terraform.tfvars
# Fill in cloudflare_api_token, cloudflare_account_id, cloudflare_zone_id
terraform init
terraform plan
terraform apply
# Deploy tunnel token to k8s:
kubectl create secret generic cloudflared-token -n observability \
  --from-literal=token=$(terraform output -raw tunnel_token)

Deploy Gating¶

Condition	Action
All tests pass + no DB migration	Auto-deploy to dev/staging/prod
DB migration detected in PR	Block deploy, notify Discord, require human approval
Production overlay changed	Require PR approval
Dev/staging	Always auto-deploy

Migration detection in CI:

- name: Check for migrations
  run: |
    if git diff HEAD~1 --name-only | grep -q "Migrations/"; then
      echo "REQUIRES_APPROVAL=true" >> $GITHUB_ENV
    fi

GitOps (Flux CD)¶

Repository Structure¶

tablez-dev/tablez-gitops/
├── clusters/
│   └── local/
│       └── flux-system/
│           ├── gotk-components.yaml    # Flux controller manifests
│           └── gotk-sync.yaml          # GitRepository + Kustomizations
├── infrastructure/
│   ├── base/
│   │   ├── kustomization.yaml
│   │   ├── namespace.yaml              # tablez namespace
│   │   ├── postgres.yaml               # PostgreSQL StatefulSet + Service
│   │   ├── valkey.yaml                 # Valkey Deployment + Service
│   │   ├── arc-system/                 # ARC controller (HelmRelease)
│   │   │   ├── namespace.yaml          # arc-systems + arc-runners namespaces
│   │   │   ├── helmrepository.yaml     # OCI repo for ARC charts
│   │   │   ├── helmrelease.yaml        # ARC controller deployment
│   │   │   └── kustomization.yaml
│   │   ├── arc-runners/                # Runner scale sets (one per repo)
│   │   │   ├── tablez-reservation.yaml # HelmRelease — DinD runner
│   │   │   ├── tablez-guest.yaml
│   │   │   ├── tablez-restaurant.yaml
│   │   │   ├── tablez-notification.yaml
│   │   │   ├── tablez-ai.yaml
│   │   │   ├── tablez-api-gateway.yaml
│   │   │   └── kustomization.yaml
│   │   ├── image-automation/           # Flux image automation
│   │   │   ├── image-repositories.yaml # Scan ghcr.io for new tags
│   │   │   ├── image-policies.yaml     # Select latest main-sha-timestamp tag
│   │   │   ├── image-update-automation.yaml # Commit tag updates to gitops
│   │   │   └── kustomization.yaml
│   │   └── observability/              # LGTM stack (OpenTelemetry)
│   │       ├── namespace.yaml          # observability namespace
│   │       ├── helmrepositories.yaml   # prometheus-community, grafana, open-telemetry
│   │       ├── otel-collector.yaml     # Central telemetry pipeline
│   │       ├── prometheus.yaml         # Metrics (kube-prometheus-stack)
│   │       ├── tempo.yaml              # Traces
│   │       ├── loki.yaml               # Logs
│   │       ├── grafana.yaml            # Dashboards
│   │       ├── cloudflared.yaml       # Cloudflare Tunnel connector
│   │       └── kustomization.yaml
│   └── overlays/
│       └── local/
│           └── kustomization.yaml
├── apps/
│   ├── base/
│   │   ├── reservation/                # Deployment + Service + health checks
│   │   ├── guest/
│   │   ├── restaurant/
│   │   ├── notification/
│   │   ├── ai/
│   │   └── api-gateway/
│   └── overlays/
│       └── local/
│           ├── kustomization.yaml      # References all 6 services
│           ├── reservation/
│           ├── guest/
│           ├── restaurant/
│           ├── notification/
│           ├── ai/
│           └── api-gateway/
├── terraform/                         # Cloudflare infrastructure (not managed by Flux)
│   ├── versions.tf                    # Provider + backend config
│   ├── variables.tf                   # Input variables
│   ├── tunnel.tf                      # Cloudflare Tunnel + ingress
│   ├── dns.tf                         # CNAME records
│   ├── access.tf                      # Zero Trust Access policies
│   └── terraform.tfvars.example       # Template for secrets
└── README.md

Flux reconciles infrastructure first (PostgreSQL, Valkey, ARC controller + runner scale sets, observability stack), then apps (all 6 service deployments). Terraform manages Cloudflare resources (tunnel, DNS, Zero Trust) separately. Everything is self-contained — move to a new cluster by installing Flux, pointing at this repo, creating secrets, and running terraform apply.

Deployment Flow¶

flowchart LR
    PR["PR merged to main"] --> ARC["ARC runner<br/>(self-hosted, DinD)"]
    ARC --> GHCR["ghcr.io<br/>main-sha-timestamp tag"]
    GHCR --> IR["Flux Image Reflector<br/>scans every 5m"]
    IR --> IA["Flux Image Automation<br/>commits tag update"]
    IA --> FK["Flux Kustomize Controller<br/>reconciles gitops repo"]
    FK --> K8S["Kubernetes<br/>rolling update"]

How it works:

Code merges to main → ARC runner builds and pushes image tagged main-<sha7>-<unix_timestamp>
Flux Image Reflector scans ghcr.io/tablez-dev/* every 5 minutes, detects the new tag
Image Policy selects the tag with the highest timestamp (most recent build)
Image Update Automation commits the new tag to tablez-gitops deployment manifests (via # {"$imagepolicy": ...} setter markers)
Flux Kustomize Controller reconciles and triggers a rolling update

Tag format: main-<sha7>-<unix_timestamp> (e.g., main-a1b2c3d-1773128998). Pure SHA tags are not sortable — the timestamp suffix allows Flux to determine ordering.

ARC runners: DinD sidecar defined manually (not containerMode: dind) to pass --dns=8.8.8.8 to dockerd. All workflows use network: host on docker/build-push-action because BuildKit's bridge network has broken DNS in k3s DinD (see LOCAL-DEV.md section 8).

Helm vs Kustomize¶

What	Tool	Why
All tablez services	Kustomize	Simple, no templating overhead
PostgreSQL	Kustomize (raw manifest)	StatefulSet with PVC, simple enough without Helm
Valkey	Kustomize (raw manifest)	Single Deployment + Service
ARC controller	Helm (via Flux HelmRelease)	Official chart, CRDs managed by Helm
ARC runner scale sets	Helm (via Flux HelmRelease)	One HelmRelease per repo, manual DinD sidecar with `--dns` flags
Prometheus	Helm (kube-prometheus-stack)	CRDs, ServiceMonitors, complex config
Tempo	Helm (grafana/tempo)	Official chart, storage config
Loki	Helm (grafana/loki)	Official chart, single-binary mode
Grafana	Helm (grafana/grafana)	Data sources, dashboards as values
OTel Collector	Helm (open-telemetry)	Pipeline config, receiver/exporter setup
Cloudflared	Kustomize	Simple DaemonSet

Observability (OpenTelemetry + LGTM Stack)¶

Full observability from day one. All telemetry flows through the OpenTelemetry Collector, which routes to purpose-built backends.

Architecture¶

flowchart LR
    subgraph Services["Tablez Services (OTLP)"]
        R["reservation"]
        G["guest"]
        REST["restaurant"]
        N["notification"]
        AI["ai"]
        GW["api-gateway"]
    end

    subgraph Collector["OTel Collector"]
        OC["opentelemetry-collector<br/>Deployment"]
    end

    subgraph Backends["LGTM Stack"]
        P["Prometheus<br/>(metrics)"]
        T["Tempo<br/>(traces)"]
        L["Loki<br/>(logs)"]
        GR["Grafana<br/>(dashboards)"]
    end

    R & G & REST & N & AI & GW -->|OTLP/gRPC| OC
    OC -->|remote write| P
    OC -->|OTLP| T
    OC -->|OTLP/HTTP| L
    GR --> P & T & L

Stack Components¶

Component	Purpose	Deployment	Retention
OTel Collector	Central telemetry pipeline — receives, batches, routes	Deployment (1 replica)	N/A (pass-through)
Prometheus	Metrics storage + PromQL queries	kube-prometheus-stack (Helm)	7 days
Tempo	Distributed trace storage	Single-binary (Helm)	72 hours
Loki	Log aggregation (label-indexed)	Single-binary (Helm)	7 days
Grafana	Unified dashboards with trace↔log↔metric correlation	Standalone (Helm)	Persistent

All deployed as Flux HelmReleases in observability namespace inside the vcluster. GitOps source: tablez-gitops/infrastructure/base/observability/.

.NET Instrumentation¶

Shared project Tablez.Observability (in tablez-contracts repo) provides one-line setup:

// Program.cs — two lines for full observability
builder.Services.AddTablezObservability("Reservation");
builder.Logging.AddTablezLogging();

// Optional: MediatR tracing (wraps every command/query in a span)
builder.Services.AddTransient(typeof(IPipelineBehavior<,>), typeof(MediatRTracingBehavior<,>));

// Optional: Valkey/Redis instrumentation
builder.Services.AddTablezRedisInstrumentation();

Instrumentation Coverage¶

Component	Method	What You See
ASP.NET Core	Auto (built-in)	HTTP request spans, latency metrics
HttpClient	Auto (built-in)	Outbound HTTP call spans
SignalR	Auto (.NET 9+)	Hub method invocation spans
Semantic Kernel	Auto (native)	LLM call spans, token usage
MediatR	`MediatRTracingBehavior`	Command/query spans with type info
Marten	`MartenTracing` helpers	Event append, aggregate load, query spans
Valkey	`AddTablezRedisInstrumentation()`	Redis command spans
Hangfire	Manual spans	Background job execution spans
.NET Runtime	`RuntimeInstrumentation`	GC, threadpool, allocation metrics

End-to-End Trace Propagation¶

Every request gets a single TraceId that follows it across all services. This is the most important observability requirement — you can search by TraceId in Grafana and see the full journey.

Within a service: Automatic. The OTel SDK propagates trace context through ASP.NET Core → MediatR → Marten → Valkey. Logs emitted in a traced context automatically include TraceId and SpanId.

HTTP (API Gateway → Backend services): Automatic. HttpClient instrumentation injects the W3C traceparent header. The receiving service's ASP.NET Core instrumentation extracts it.

Valkey pub/sub (Service → Service events): Manual — Valkey pub/sub does not propagate trace context. We solve this with TracedEventEnvelope, which wraps every domain event with the W3C trace context:

sequenceDiagram
    participant R as Reservation Service
    participant VK as Valkey pub/sub
    participant N as Notification Service
    participant G as Guest Service

    Note over R: TraceId: abc-123
    R->>R: TracedEventEnvelope.Wrap("ReservationConfirmed", event)
    Note over R: Envelope includes traceParent: 00-abc-123-...
    R->>VK: PUBLISH reservation.confirmed {envelope}
    VK->>N: {envelope with traceParent}
    VK->>G: {envelope with traceParent}
    N->>N: envelope.StartConsumerActivity("notification")
    Note over N: Continues TraceId: abc-123
    G->>G: envelope.StartConsumerActivity("guest")
    Note over G: Continues TraceId: abc-123

// Publisher (reservation service)
var envelope = TracedEventEnvelope.Wrap("ReservationConfirmed", domainEvent);
await redis.PublishAsync("reservation.confirmed", envelope.Serialize());

// Consumer (notification service)
var envelope = TracedEventEnvelope.Deserialize(message);
using var activity = envelope.StartConsumerActivity("tablez-notification");
var evt = envelope.GetPayload<ReservationConfirmed>();
// All spans created here share the original TraceId

Result: In Grafana Tempo, searching for a single TraceId shows the complete flow: HTTP request → MediatR command → Marten event append → Valkey publish → notification SMS send → guest profile update.

Custom Instrumentation Example¶

// Marten event store tracing
using var activity = MartenTracing.StartAppendEvents("Reservation", reservationId);
session.Events.Append(reservationId, new ReservationConfirmed { ... });
await session.SaveChangesAsync();

// Custom business metric
var meter = TablezTelemetry.CreateMeter("Reservation");
var bookingCounter = meter.CreateCounter<long>("reservations.created");
bookingCounter.Add(1, new KeyValuePair<string, object?>("channel", "web"));

Grafana Access¶

Environment	URL	Auth
Remote	`https://grafana.invotek.no`	Cloudflare Zero Trust (`invotekas@gmail.com`)
Local	`kubectl port-forward -n observability svc/grafana 3000:80`	admin / tablez-local

Pre-configured data sources with trace-to-log correlation: click a trace span in Tempo → jump to related logs in Loki. Service map auto-generated from trace data.

Cloud Migration Path¶

When moving to managed cloud, only the Collector exporter config changes — zero application code changes:

# OTel Collector config — add cloud exporter alongside self-hosted
exporters:
  otlp/tempo:
    endpoint: tempo.observability:4317      # Keep self-hosted
  googlecloud:                               # Add cloud
    project: "tablez-prod"

service:
  pipelines:
    traces:
      exporters: [otlp/tempo, googlecloud]  # Dual-export during migration

Cloud Migration Strategy¶

Built to run on k3s today, portable to managed cloud with startup credits.

Component	Now (bootstrap)	With cloud credits
Kubernetes	k3s (self-hosted)	AKS / GKE (managed)
PostgreSQL	Bitnami on k3s	Azure Database / Cloud SQL
Valkey	Bitnami on k3s	Azure Cache / Memorystore
DNS/Ingress	Cloudflare Tunnel	Same (unchanged)
Container registry	ghcr.io	Same or ACR/GCR
GitOps	Flux	Same (unchanged)
Observability	OTel + LGTM (self-hosted)	Same or swap exporter to cloud-native

Migration day: 1. Provision managed Kubernetes + managed PostgreSQL + Valkey (Terraform) 2. flux bootstrap to new cluster 3. Update overlays/production/ with new connection strings 4. Push to gitops repo → Flux deploys everything 5. Switch Cloudflare Tunnel to new cluster 6. Done. No code changes.

Cloud credits to target:

Program	Credits	Path
Microsoft for Startups	$150k Azure	Apply directly
Google for Startups	$100k GCP	Via StartupLab (already in contact)
AWS Activate	$100k	Via accelerator or VC

MCP Surface (AI-Native API)¶

Tablez exposes an MCP server so external AI agents can book tables directly. This is the competitive moat — no other platform offers this.

External AI Agent (ChatGPT, Claude, Siri, Google Assistant)
  → MCP Protocol
    → Tablez MCP Server
      → Same MediatR commands/queries as internal AI agent

The MCP server is a thin wrapper around the same mediator pipeline. One codebase serves both internal AI channels and external AI agents.

See /mcp-api-surface skill for implementation pattern.

Phase 1 MVP Scope¶

Include	Exclude (Phase 2)
Restaurant config + users	Floor plan canvas editor
Web booking form	AI phone agent
Staff dashboard (list view)	AI email agent
Basic table management	Google Reserve
Guest database (manual)	LLM guest enrichment
SMS confirmations	No-show fee processing
Waitlist (manual)	MCP server for external agents
AI chat widget	Ticketed events
Availability engine	Dynamic pricing
Reservation lifecycle	Multi-language AI
Event sourcing from day 1	Projection rebuild tooling

Open Decisions¶

Decision	Options	Leaning
Frontend	Blazor vs React vs Next.js	TBD — depends on team
LLM provider	Claude vs GPT-4o vs Gemini	Claude (best tool calling)
Kubernetes	Managed (AKS/GKE) vs k3s	k3s now, managed with credits
Voice AI	Pipecat + Deepgram + ElevenLabs vs managed (Vapi)	TBD
Monitoring	~~OpenTelemetry + Grafana vs cloud-native~~	Decided: OpenTelemetry + LGTM stack
NuGet feed	GitHub Packages vs self-hosted	GitHub Packages

References¶

Tablez Spec v1.2 (Tabelz AS)
projects/tablez/ANALYSIS.md — Gap analysis
projects/tablez/COMPETITIVE-LANDSCAPE.md — Market research
Juval Löwy — "Righting Software" (IDesign method)
.claude/skills/idesign-architecture/ — IDesign reference
Marten documentation — https://martendb.io