Tablez — Technical Architecture¶
Version: 0.3 Date: March 2026 Author: Stig-Johnny Stoebakk / Claude-3
Design Principles¶
- Clean Architecture — Dependencies point inward. Domain has no knowledge of infrastructure.
- IDesign Method (Juval Löwy / "Righting Software") — 4-layer decomposition: Clients → Managers → Engines → Resource Access. Layers ordered by volatility.
- Mediator Pattern — All cross-cutting communication goes through MediatR. No direct service-to-service calls.
- CQRS + Event Sourcing — Commands append events. Queries read from projections. Full audit trail.
- Domain-Driven Design — Aggregates, value objects, domain events. The domain layer is pure C# with no framework dependencies.
- Event-Driven Architecture — Services communicate through domain events, not direct calls.
Tech Stack¶
| Layer | Technology | Why |
|---|---|---|
| Runtime | .NET 10 / ASP.NET Core | Team expertise, performance, ecosystem |
| API | Minimal APIs + MediatR | Clean routing, mediator pattern |
| Realtime | SignalR | Floor view, live table status |
| Database | PostgreSQL | Proven, JSON support, row-level locking |
| Event Store | Marten (on PostgreSQL) | Event sourcing + document store, no extra infra |
| ORM | EF Core 10 (migrations only) | Schema migrations for non-event-sourced tables |
| Cache | Valkey (Redis fork, BSD) | No vendor lock-in, pub/sub for events + cache invalidation |
| Background Jobs | Hangfire (PostgreSQL storage) | Queues, delayed jobs, recurring tasks, dashboard |
| State Machine | Stateless (NuGet) | Reservation lifecycle, waitlist flows |
| AI/LLM | Semantic Kernel + Claude API | Tool calling, function calling for AI channels |
| SMS | Twilio | Confirmations, waitlist notifications, 2FA |
| Payment | Stripe | No-show fees, ticketed events |
| Auth | ASP.NET Identity + JWT | Restaurant staff auth |
| Hosting | Kubernetes (managed or k3s) | Scalable, cloud-portable |
| CI/CD | GitHub Actions + ARC (self-hosted) | Build, test, push images on k3s runners |
| GitOps | Flux CD | Lightweight, pure GitOps, image automation, Discord alerts |
| Cluster UI | vCluster Platform (free tier) | Web dashboard at vcluster.invotek.no, Cloudflare Zero Trust |
| DNS/Ingress | Cloudflare Tunnels | No exposed ports, DDoS protection, free TLS |
| Infra as Code | Terraform (Cloudflare provider) | Tunnel, DNS, Zero Trust Access as code |
| Secrets | Bitwarden (personal) | Password management, shared via Bitwarden Send |
| Git Hosting | GitHub (tablez-dev org) |
Separate from personal repos |
Volatility-Based Decomposition (IDesign)¶
Services ordered by change frequency. Top layers change often and auto-deploy. Bottom layers change rarely and require gates.
block-beta
columns 1
block:high["HIGH VOLATILITY — auto-deploy"]
web["tablez-web (dashboard)"] gateway["tablez-api-gateway (routes)"]
end
block:freq["FREQUENT CHANGES"]
ai["tablez-ai (prompts, tools)"]
end
block:mod["MODERATE CHANGES"]
reservation["tablez-reservation"] guest["tablez-guest"] notification["tablez-notification"]
end
block:low["LOW VOLATILITY"]
restaurant["tablez-restaurant"] contracts["tablez-contracts"]
end
block:gate["HUMAN GATE"]
migration["tablez-migration (DB schema)"]
end
style high fill:#4CAF50,color:#fff
style freq fill:#8BC34A,color:#fff
style mod fill:#FFC107,color:#000
style low fill:#FF9800,color:#fff
style gate fill:#F44336,color:#fff
Repository Structure¶
Separate GitHub organization: tablez-dev/. Multi-repo for AI agent productivity — each repo is a bounded context an agent can own completely.
| Repo | Purpose | Auto-deploy |
|---|---|---|
tablez-dev/tablez-contracts |
Shared DTOs, events, interfaces → NuGet | Yes |
tablez-dev/tablez-api-gateway |
YARP API gateway, routing, auth | Yes |
tablez-dev/tablez-reservation |
Core booking engine + event store | Yes |
tablez-dev/tablez-guest |
Guest CRM, profiles | Yes |
tablez-dev/tablez-restaurant |
Restaurant config, floor plan, schedule | Yes |
tablez-dev/tablez-ai |
LLM gateway, Semantic Kernel, tool API | Yes |
tablez-dev/tablez-notification |
SMS, email, push (Hangfire workers) | Yes |
tablez-dev/tablez-web |
Staff dashboard frontend (Next.js) | Yes |
tablez-dev/tablez-migration |
EF Core + Marten schema migrations | Human gate |
tablez-dev/tablez-gitops |
Flux manifests, overlays, notifications | Human gate on prod |
tablez-dev/tablez-docs |
Specs, architecture, ADRs | N/A |
Shared types: tablez-contracts contains two projects: Tablez.Contracts (DTOs, events, interfaces) and Tablez.Observability (shared OpenTelemetry setup). All services reference contracts via ProjectReference using multi-repo Docker builds — CI checks out tablez-contracts alongside the service repo using a CONTRACTS_TOKEN org secret. See LOCAL-DEV.md section 8 for details.
Architecture Overview (IDesign + Clean Architecture)¶
Each service follows the same internal 4-layer structure:
graph TB
subgraph Clients["Clients Layer"]
REST["REST API<br/>(Minimal APIs)"]
SignalR["SignalR Hub<br/>(Floor View)"]
AIGateway["AI Agent Gateway<br/>(Chat/Phone/Email)"]
end
subgraph Managers["Managers Layer"]
RM["ReservationManager"]
GM["GuestManager"]
WM["WaitlistManager"]
NM["NotificationManager"]
ResM["RestaurantManager"]
end
subgraph Engines["Engines Layer — Pure Logic, No I/O"]
AE["AvailabilityEngine"]
TAE["TableAssignmentEngine"]
PE["PricingEngine"]
WME["WaitlistMatchingEngine"]
SE["ScheduleEngine"]
VE["ValidationEngine"]
end
subgraph ResourceAccess["Resource Access Layer — 1:1 with External Systems"]
ES["EventStore<br/>(Marten)"]
CA["CacheAccessor<br/>(Valkey)"]
SMS["SmsAccessor<br/>(Twilio)"]
PA["PaymentAccessor<br/>(Stripe)"]
LLM["LlmAccessor<br/>(Claude API)"]
end
subgraph External["External Systems"]
PG[(PostgreSQL)]
VK[(Valkey)]
APIs["Twilio / Stripe / Claude"]
end
Clients -->|MediatR| Managers
Managers --> Engines
Managers --> ResourceAccess
ES --> PG
CA --> VK
SMS --> APIs
PA --> APIs
LLM --> APIs
Event Sourcing (Marten)¶
All state changes are stored as immutable events. Current state is derived by replaying events. Marten uses PostgreSQL as the event store — no extra infrastructure.
Domain Events¶
// Reservation aggregate events
ReservationRequested { GuestId, PartySize, DateTime, Channel }
ReservationConfirmed { ReservationId, TableId, ConfirmedBy }
ReservationCancelled { ReservationId, Reason, CancelledBy }
GuestArrived { ReservationId, ArrivedAt }
GuestSeated { ReservationId, TableId, SeatedAt }
GuestCompleted { ReservationId, CompletedAt }
NoShowMarked { ReservationId, MarkedAt }
// Waitlist aggregate events
WaitlistEntryCreated { GuestId, PartySize, TimeWindow }
WaitlistSlotOffered { WaitlistId, ReservationSlot, ExpiresAt }
WaitlistOfferAccepted { WaitlistId }
WaitlistOfferExpired { WaitlistId }
WaitlistOfferDeclined { WaitlistId }
// Guest aggregate events
GuestProfileCreated { GuestId, Name, Phone, Email }
GuestProfileUpdated { GuestId, Field, OldValue, NewValue }
GuestPreferenceAdded { GuestId, Preference }
// Table/floor events
TableStatusChanged { TableId, OldStatus, NewStatus, ChangedBy }
Write Side (Commands)¶
sequenceDiagram
participant C as Client
participant M as Manager (MediatR)
participant E as Engine
participant ES as EventStore (Marten)
participant PG as PostgreSQL
participant VK as Valkey pub/sub
C->>M: Send(Command)
M->>E: Validate business rules
E-->>M: Valid / Invalid
M->>ES: Append event
ES->>PG: Persist to event stream
ES->>VK: Publish domain event
VK-->>M: Other services react async
M-->>C: Result
Read Side (Projections)¶
Marten automatically builds read models from events:
flowchart LR
ES["Event Stream"] --> MP["Marten Projection"] --> RM["Read Model<br/>(PostgreSQL table)"]
| Projection | Built from | Used by |
|---|---|---|
| ReservationView | Reservation events | Staff dashboard, availability check |
| FloorView | Table + reservation events | Live floor view (SignalR) |
| GuestHistory | Guest + reservation events | CRM, AI agent context |
| DailyAvailability | Reservation + schedule events | Booking widget, AI agent |
| WaitlistQueue | Waitlist events | Staff dashboard, waitlist management |
Cross-Service Event Flow¶
flowchart LR
R["tablez-reservation"] -->|ReservationConfirmed| VK["Valkey pub/sub"]
VK --> N["tablez-notification<br/>SMS confirmation"]
VK --> G["tablez-guest<br/>update visit count"]
VK --> W["tablez-web<br/>SignalR floor view"]
VK --> AI["tablez-ai<br/>update context"]
No direct service-to-service calls. Services communicate exclusively through domain events.
Benefits¶
| Benefit | Tablez use case |
|---|---|
| Full audit trail | "Who changed this reservation and when?" |
| Temporal queries | "What did the floor look like at 19:30?" |
| Rebuild state | Replay events to debug or recover |
| Event-driven | Services react to events, no coupling |
| Undo/compensation | Cancellation = new event, not DELETE |
| Analytics | Stream events to build dashboards |
| GDPR | Find all events for a guest, redact/delete |
Mediator Flow (MediatR)¶
Every request flows through the mediator pipeline:
sequenceDiagram
participant HTTP as HTTP Request
participant EP as Endpoint (Client)
participant Val as ValidationBehavior
participant Log as LoggingBehavior
participant Cache as CachingBehavior
participant H as Handler (Manager)
participant E as Engine
participant DB as Marten (PostgreSQL)
HTTP->>EP: Request
EP->>Val: MediatR.Send(Command)
Val->>Log: Validated
Log->>Cache: Logged
Cache->>H: Cache miss
H->>E: Business rules
E-->>H: Result
H->>DB: Append events / query
DB-->>H: Persisted / data
H-->>Cache: Result
Cache-->>Log: Cached
Log-->>Val: Logged
Val-->>EP: Response DTO
EP-->>HTTP: HTTP Response
Caching Strategy (Valkey)¶
| Data | Strategy | TTL | Invalidation |
|---|---|---|---|
| Availability slots | Event-driven | None | Invalidate on ReservationConfirmed/Cancelled |
| Restaurant config | Write-through | 1 hour | Invalidate on admin update |
| Floor plan / tables | Write-through | 1 hour | Invalidate on admin update |
| Guest profiles | Cache-aside | 15 min | TTL expiry |
| Service schedule | Cache-aside | 30 min | Invalidate on schedule change |
| Active table status | Write-behind | No TTL | Real-time SignalR updates |
Invalidation mechanism: Domain events published to Valkey pub/sub trigger cache eviction. Same events that drive service communication also drive cache invalidation.
State Machines (Stateless)¶
Reservation Lifecycle¶
stateDiagram-v2
[*] --> Requested
Requested --> Confirmed : confirm
Requested --> Cancelled : cancel
Confirmed --> Arrived : arrive
Confirmed --> Cancelled : cancel
Confirmed --> NoShow : no_show
Arrived --> Seated : seat
Arrived --> Cancelled : cancel
Seated --> Completed : complete
Completed --> [*]
Cancelled --> [*]
NoShow --> [*]
Each state transition appends an event to the Marten event store. The state machine validates transitions; the event store records them.
Waitlist Lifecycle¶
stateDiagram-v2
[*] --> Queued
Queued --> Offered : slot_available
Queued --> Cancelled : cancel
Offered --> Converted : accept
Offered --> Expired : timeout
Offered --> Declined : decline
Expired --> Queued : requeue
Converted --> [*]
Cancelled --> [*]
Declined --> [*]
Concurrent Booking (Race Condition Prevention)¶
Marten supports optimistic concurrency on event streams:
// Marten appends to the reservation stream with expected version
session.Events.Append(reservationId, expectedVersion, new ReservationConfirmed { ... });
await session.SaveChangesAsync();
// Throws ConcurrencyException if another event was appended first
For availability, combine with PostgreSQL advisory locks:
No distributed locks needed. PostgreSQL handles serialization.
AI Agent Architecture (LLM Tool API)¶
Based on Tablez Spec v1.2 Section 11. The AI agent receives minimal static context and fetches everything via tools.
sequenceDiagram
participant G as Guest (phone/chat/email)
participant AI as AI Gateway (tablez-ai)
participant SK as Semantic Kernel
participant LLM as Claude API
participant SVC as Backend Services (MediatR)
G->>AI: Natural language request
AI->>SK: Orchestrate
SK->>LLM: Reason + decide tool calls
LLM-->>SK: Tool call: check_availability
SK->>SVC: CheckAvailabilityQuery
SVC-->>SK: Available slots
SK->>LLM: Result + continue reasoning
LLM-->>SK: Tool call: create_reservation
SK->>SVC: CreateReservationCommand
SVC-->>SK: Reservation confirmed
SK->>LLM: Format response
LLM-->>SK: Natural language reply
SK-->>AI: Response
AI-->>G: "Your table is booked for 7pm!"
Tool mapping:
| Tool call | MediatR handler |
|---|---|
check_availability |
CheckAvailabilityQuery |
create_reservation |
CreateReservationCommand |
create_waitlist |
CreateWaitlistEntryCommand |
get_service_overview |
GetServiceOverviewQuery |
get_guest_profile |
GetGuestProfileQuery |
update_guest_profile |
UpdateGuestProfileCommand |
Key principle: LLM handles language. The system handles logic. LLM never decides availability — it calls check_availability and reports the result.
Background Jobs (Hangfire)¶
| Job | Type | Trigger |
|---|---|---|
| Send SMS confirmation | Fire-and-forget | ReservationConfirmed event |
| Send waitlist offer SMS | Fire-and-forget | WaitlistSlotOffered event |
| Waitlist hold expiry | Delayed (15 min) | WaitlistSlotOffered event |
| No-show cleanup | Recurring (hourly) | Cron |
| Reminder SMS | Delayed (24h before) | ReservationConfirmed event |
| Projection rebuild | Manual | Admin trigger |
All jobs use PostgreSQL storage — no additional infrastructure. Jobs are triggered by domain events via Valkey pub/sub.
Deployment Architecture¶
Kubernetes¶
graph TB
subgraph CF["Cloudflare Edge"]
tablez["tablez.com"]
api["api.tablez.com"]
staff["staff.tablez.com"]
ws["ws.tablez.com"]
end
subgraph K8S["Kubernetes — namespace: tablez"]
tunnel["cloudflared<br/>(DaemonSet)"]
subgraph Services["Application Services"]
GW["api-gateway<br/>2+ replicas"]
R["reservation<br/>2+ replicas"]
G["guest<br/>1+ replicas"]
REST["restaurant<br/>1 replica"]
AI["ai<br/>2+ replicas"]
N["notification<br/>1 replica"]
WEB["web<br/>2+ replicas"]
end
subgraph Infra["Infrastructure"]
PG[(PostgreSQL<br/>Marten event store)]
VK[(Valkey<br/>cache + pub/sub)]
end
end
tablez --> tunnel
api --> tunnel
staff --> tunnel
ws --> tunnel
tunnel --> GW
tunnel --> WEB
GW --> R
GW --> G
GW --> REST
GW --> AI
R --> PG
G --> PG
R --> VK
N --> VK
Cloudflare Tunnels (DNS + Ingress)¶
No exposed ports. No public IPs. No cert-manager. Cloudflare Tunnel runs inside the cluster and routes traffic from Cloudflare's edge.
Active hostnames (managed by Terraform):
| Domain | Target | Purpose | Zero Trust |
|---|---|---|---|
grafana.invotek.no |
grafana.observability:80 |
Observability dashboards | Yes (invotekas@gmail.com) |
vcluster.invotek.no |
loft.vcluster-platform:443 |
vCluster Platform dashboard | Yes (invotekas@gmail.com) |
Future hostnames (when services are production-ready):
| Domain | Target | Purpose |
|---|---|---|
tablez.com |
tablez-web:3000 |
Booking widget |
api.tablez.com |
tablez-api-gateway:8080 |
REST API |
staff.tablez.com |
tablez-web:3000 |
Staff dashboard |
ws.tablez.com |
tablez-api-gateway:8080 |
SignalR |
Works identically on k3s at home and managed Kubernetes in cloud. Tunnel config, DNS, and Zero Trust policies are managed as code via Terraform in tablez-gitops/terraform/.
Terraform (Cloudflare Infrastructure)¶
DNS records, Cloudflare Tunnel configuration, and Zero Trust Access policies are managed via Terraform — not the Cloudflare dashboard.
tablez-gitops/terraform/
├── versions.tf # Provider + backend config
├── variables.tf # Input variables (token, IDs, emails)
├── tunnel.tf # Tunnel + ingress config + token output
├── dns.tf # CNAME records
├── access.tf # Zero Trust Access apps + policies
├── terraform.tfvars # Local secrets (gitignored)
└── terraform.tfvars.example # Template for secrets
Adding a new hostname:
- Add ingress rule in
tunnel.tf - Add CNAME record in
dns.tf - (Optional) Add Zero Trust Access app + policy in
access.tf - Run
terraform plan && terraform apply
Required API token permissions: Account > Cloudflare Tunnel: Edit, Zone > DNS: Edit, Account > Access: Apps and Policies: Edit.
Setup:
cd tablez-gitops/terraform
cp terraform.tfvars.example terraform.tfvars
# Fill in cloudflare_api_token, cloudflare_account_id, cloudflare_zone_id
terraform init
terraform plan
terraform apply
# Deploy tunnel token to k8s:
kubectl create secret generic cloudflared-token -n observability \
--from-literal=token=$(terraform output -raw tunnel_token)
Deploy Gating¶
| Condition | Action |
|---|---|
| All tests pass + no DB migration | Auto-deploy to dev/staging/prod |
| DB migration detected in PR | Block deploy, notify Discord, require human approval |
| Production overlay changed | Require PR approval |
| Dev/staging | Always auto-deploy |
Migration detection in CI:
- name: Check for migrations
run: |
if git diff HEAD~1 --name-only | grep -q "Migrations/"; then
echo "REQUIRES_APPROVAL=true" >> $GITHUB_ENV
fi
GitOps (Flux CD)¶
Repository Structure¶
tablez-dev/tablez-gitops/
├── clusters/
│ └── local/
│ └── flux-system/
│ ├── gotk-components.yaml # Flux controller manifests
│ └── gotk-sync.yaml # GitRepository + Kustomizations
├── infrastructure/
│ ├── base/
│ │ ├── kustomization.yaml
│ │ ├── namespace.yaml # tablez namespace
│ │ ├── postgres.yaml # PostgreSQL StatefulSet + Service
│ │ ├── valkey.yaml # Valkey Deployment + Service
│ │ ├── arc-system/ # ARC controller (HelmRelease)
│ │ │ ├── namespace.yaml # arc-systems + arc-runners namespaces
│ │ │ ├── helmrepository.yaml # OCI repo for ARC charts
│ │ │ ├── helmrelease.yaml # ARC controller deployment
│ │ │ └── kustomization.yaml
│ │ ├── arc-runners/ # Runner scale sets (one per repo)
│ │ │ ├── tablez-reservation.yaml # HelmRelease — DinD runner
│ │ │ ├── tablez-guest.yaml
│ │ │ ├── tablez-restaurant.yaml
│ │ │ ├── tablez-notification.yaml
│ │ │ ├── tablez-ai.yaml
│ │ │ ├── tablez-api-gateway.yaml
│ │ │ └── kustomization.yaml
│ │ ├── image-automation/ # Flux image automation
│ │ │ ├── image-repositories.yaml # Scan ghcr.io for new tags
│ │ │ ├── image-policies.yaml # Select latest main-sha-timestamp tag
│ │ │ ├── image-update-automation.yaml # Commit tag updates to gitops
│ │ │ └── kustomization.yaml
│ │ └── observability/ # LGTM stack (OpenTelemetry)
│ │ ├── namespace.yaml # observability namespace
│ │ ├── helmrepositories.yaml # prometheus-community, grafana, open-telemetry
│ │ ├── otel-collector.yaml # Central telemetry pipeline
│ │ ├── prometheus.yaml # Metrics (kube-prometheus-stack)
│ │ ├── tempo.yaml # Traces
│ │ ├── loki.yaml # Logs
│ │ ├── grafana.yaml # Dashboards
│ │ ├── cloudflared.yaml # Cloudflare Tunnel connector
│ │ └── kustomization.yaml
│ └── overlays/
│ └── local/
│ └── kustomization.yaml
├── apps/
│ ├── base/
│ │ ├── reservation/ # Deployment + Service + health checks
│ │ ├── guest/
│ │ ├── restaurant/
│ │ ├── notification/
│ │ ├── ai/
│ │ └── api-gateway/
│ └── overlays/
│ └── local/
│ ├── kustomization.yaml # References all 6 services
│ ├── reservation/
│ ├── guest/
│ ├── restaurant/
│ ├── notification/
│ ├── ai/
│ └── api-gateway/
├── terraform/ # Cloudflare infrastructure (not managed by Flux)
│ ├── versions.tf # Provider + backend config
│ ├── variables.tf # Input variables
│ ├── tunnel.tf # Cloudflare Tunnel + ingress
│ ├── dns.tf # CNAME records
│ ├── access.tf # Zero Trust Access policies
│ └── terraform.tfvars.example # Template for secrets
└── README.md
Flux reconciles infrastructure first (PostgreSQL, Valkey, ARC controller + runner scale sets, observability stack), then apps (all 6 service deployments). Terraform manages Cloudflare resources (tunnel, DNS, Zero Trust) separately. Everything is self-contained — move to a new cluster by installing Flux, pointing at this repo, creating secrets, and running terraform apply.
Deployment Flow¶
flowchart LR
PR["PR merged to main"] --> ARC["ARC runner<br/>(self-hosted, DinD)"]
ARC --> GHCR["ghcr.io<br/>main-sha-timestamp tag"]
GHCR --> IR["Flux Image Reflector<br/>scans every 5m"]
IR --> IA["Flux Image Automation<br/>commits tag update"]
IA --> FK["Flux Kustomize Controller<br/>reconciles gitops repo"]
FK --> K8S["Kubernetes<br/>rolling update"]
How it works:
- Code merges to
main→ ARC runner builds and pushes image taggedmain-<sha7>-<unix_timestamp> - Flux Image Reflector scans
ghcr.io/tablez-dev/*every 5 minutes, detects the new tag - Image Policy selects the tag with the highest timestamp (most recent build)
- Image Update Automation commits the new tag to
tablez-gitopsdeployment manifests (via# {"$imagepolicy": ...}setter markers) - Flux Kustomize Controller reconciles and triggers a rolling update
Tag format: main-<sha7>-<unix_timestamp> (e.g., main-a1b2c3d-1773128998). Pure SHA tags are not sortable — the timestamp suffix allows Flux to determine ordering.
ARC runners: DinD sidecar defined manually (not containerMode: dind) to pass --dns=8.8.8.8 to dockerd. All workflows use network: host on docker/build-push-action because BuildKit's bridge network has broken DNS in k3s DinD (see LOCAL-DEV.md section 8).
Helm vs Kustomize¶
| What | Tool | Why |
|---|---|---|
| All tablez services | Kustomize | Simple, no templating overhead |
| PostgreSQL | Kustomize (raw manifest) | StatefulSet with PVC, simple enough without Helm |
| Valkey | Kustomize (raw manifest) | Single Deployment + Service |
| ARC controller | Helm (via Flux HelmRelease) | Official chart, CRDs managed by Helm |
| ARC runner scale sets | Helm (via Flux HelmRelease) | One HelmRelease per repo, manual DinD sidecar with --dns flags |
| Prometheus | Helm (kube-prometheus-stack) | CRDs, ServiceMonitors, complex config |
| Tempo | Helm (grafana/tempo) | Official chart, storage config |
| Loki | Helm (grafana/loki) | Official chart, single-binary mode |
| Grafana | Helm (grafana/grafana) | Data sources, dashboards as values |
| OTel Collector | Helm (open-telemetry) | Pipeline config, receiver/exporter setup |
| Cloudflared | Kustomize | Simple DaemonSet |
Observability (OpenTelemetry + LGTM Stack)¶
Full observability from day one. All telemetry flows through the OpenTelemetry Collector, which routes to purpose-built backends.
Architecture¶
flowchart LR
subgraph Services["Tablez Services (OTLP)"]
R["reservation"]
G["guest"]
REST["restaurant"]
N["notification"]
AI["ai"]
GW["api-gateway"]
end
subgraph Collector["OTel Collector"]
OC["opentelemetry-collector<br/>Deployment"]
end
subgraph Backends["LGTM Stack"]
P["Prometheus<br/>(metrics)"]
T["Tempo<br/>(traces)"]
L["Loki<br/>(logs)"]
GR["Grafana<br/>(dashboards)"]
end
R & G & REST & N & AI & GW -->|OTLP/gRPC| OC
OC -->|remote write| P
OC -->|OTLP| T
OC -->|OTLP/HTTP| L
GR --> P & T & L
Stack Components¶
| Component | Purpose | Deployment | Retention |
|---|---|---|---|
| OTel Collector | Central telemetry pipeline — receives, batches, routes | Deployment (1 replica) | N/A (pass-through) |
| Prometheus | Metrics storage + PromQL queries | kube-prometheus-stack (Helm) | 7 days |
| Tempo | Distributed trace storage | Single-binary (Helm) | 72 hours |
| Loki | Log aggregation (label-indexed) | Single-binary (Helm) | 7 days |
| Grafana | Unified dashboards with trace↔log↔metric correlation | Standalone (Helm) | Persistent |
All deployed as Flux HelmReleases in observability namespace inside the vcluster. GitOps source: tablez-gitops/infrastructure/base/observability/.
.NET Instrumentation¶
Shared project Tablez.Observability (in tablez-contracts repo) provides one-line setup:
// Program.cs — two lines for full observability
builder.Services.AddTablezObservability("Reservation");
builder.Logging.AddTablezLogging();
// Optional: MediatR tracing (wraps every command/query in a span)
builder.Services.AddTransient(typeof(IPipelineBehavior<,>), typeof(MediatRTracingBehavior<,>));
// Optional: Valkey/Redis instrumentation
builder.Services.AddTablezRedisInstrumentation();
Instrumentation Coverage¶
| Component | Method | What You See |
|---|---|---|
| ASP.NET Core | Auto (built-in) | HTTP request spans, latency metrics |
| HttpClient | Auto (built-in) | Outbound HTTP call spans |
| SignalR | Auto (.NET 9+) | Hub method invocation spans |
| Semantic Kernel | Auto (native) | LLM call spans, token usage |
| MediatR | MediatRTracingBehavior |
Command/query spans with type info |
| Marten | MartenTracing helpers |
Event append, aggregate load, query spans |
| Valkey | AddTablezRedisInstrumentation() |
Redis command spans |
| Hangfire | Manual spans | Background job execution spans |
| .NET Runtime | RuntimeInstrumentation |
GC, threadpool, allocation metrics |
End-to-End Trace Propagation¶
Every request gets a single TraceId that follows it across all services. This is the most important observability requirement — you can search by TraceId in Grafana and see the full journey.
Within a service: Automatic. The OTel SDK propagates trace context through ASP.NET Core → MediatR → Marten → Valkey. Logs emitted in a traced context automatically include TraceId and SpanId.
HTTP (API Gateway → Backend services): Automatic. HttpClient instrumentation injects the W3C traceparent header. The receiving service's ASP.NET Core instrumentation extracts it.
Valkey pub/sub (Service → Service events): Manual — Valkey pub/sub does not propagate trace context. We solve this with TracedEventEnvelope, which wraps every domain event with the W3C trace context:
sequenceDiagram
participant R as Reservation Service
participant VK as Valkey pub/sub
participant N as Notification Service
participant G as Guest Service
Note over R: TraceId: abc-123
R->>R: TracedEventEnvelope.Wrap("ReservationConfirmed", event)
Note over R: Envelope includes traceParent: 00-abc-123-...
R->>VK: PUBLISH reservation.confirmed {envelope}
VK->>N: {envelope with traceParent}
VK->>G: {envelope with traceParent}
N->>N: envelope.StartConsumerActivity("notification")
Note over N: Continues TraceId: abc-123
G->>G: envelope.StartConsumerActivity("guest")
Note over G: Continues TraceId: abc-123
// Publisher (reservation service)
var envelope = TracedEventEnvelope.Wrap("ReservationConfirmed", domainEvent);
await redis.PublishAsync("reservation.confirmed", envelope.Serialize());
// Consumer (notification service)
var envelope = TracedEventEnvelope.Deserialize(message);
using var activity = envelope.StartConsumerActivity("tablez-notification");
var evt = envelope.GetPayload<ReservationConfirmed>();
// All spans created here share the original TraceId
Result: In Grafana Tempo, searching for a single TraceId shows the complete flow: HTTP request → MediatR command → Marten event append → Valkey publish → notification SMS send → guest profile update.
Custom Instrumentation Example¶
// Marten event store tracing
using var activity = MartenTracing.StartAppendEvents("Reservation", reservationId);
session.Events.Append(reservationId, new ReservationConfirmed { ... });
await session.SaveChangesAsync();
// Custom business metric
var meter = TablezTelemetry.CreateMeter("Reservation");
var bookingCounter = meter.CreateCounter<long>("reservations.created");
bookingCounter.Add(1, new KeyValuePair<string, object?>("channel", "web"));
Grafana Access¶
| Environment | URL | Auth |
|---|---|---|
| Remote | https://grafana.invotek.no |
Cloudflare Zero Trust (invotekas@gmail.com) |
| Local | kubectl port-forward -n observability svc/grafana 3000:80 |
admin / tablez-local |
Pre-configured data sources with trace-to-log correlation: click a trace span in Tempo → jump to related logs in Loki. Service map auto-generated from trace data.
Cloud Migration Path¶
When moving to managed cloud, only the Collector exporter config changes — zero application code changes:
# OTel Collector config — add cloud exporter alongside self-hosted
exporters:
otlp/tempo:
endpoint: tempo.observability:4317 # Keep self-hosted
googlecloud: # Add cloud
project: "tablez-prod"
service:
pipelines:
traces:
exporters: [otlp/tempo, googlecloud] # Dual-export during migration
Cloud Migration Strategy¶
Built to run on k3s today, portable to managed cloud with startup credits.
| Component | Now (bootstrap) | With cloud credits |
|---|---|---|
| Kubernetes | k3s (self-hosted) | AKS / GKE (managed) |
| PostgreSQL | Bitnami on k3s | Azure Database / Cloud SQL |
| Valkey | Bitnami on k3s | Azure Cache / Memorystore |
| DNS/Ingress | Cloudflare Tunnel | Same (unchanged) |
| Container registry | ghcr.io | Same or ACR/GCR |
| GitOps | Flux | Same (unchanged) |
| Observability | OTel + LGTM (self-hosted) | Same or swap exporter to cloud-native |
Migration day:
1. Provision managed Kubernetes + managed PostgreSQL + Valkey (Terraform)
2. flux bootstrap to new cluster
3. Update overlays/production/ with new connection strings
4. Push to gitops repo → Flux deploys everything
5. Switch Cloudflare Tunnel to new cluster
6. Done. No code changes.
Cloud credits to target:
| Program | Credits | Path |
|---|---|---|
| Microsoft for Startups | $150k Azure | Apply directly |
| Google for Startups | $100k GCP | Via StartupLab (already in contact) |
| AWS Activate | $100k | Via accelerator or VC |
MCP Surface (AI-Native API)¶
Tablez exposes an MCP server so external AI agents can book tables directly. This is the competitive moat — no other platform offers this.
External AI Agent (ChatGPT, Claude, Siri, Google Assistant)
→ MCP Protocol
→ Tablez MCP Server
→ Same MediatR commands/queries as internal AI agent
The MCP server is a thin wrapper around the same mediator pipeline. One codebase serves both internal AI channels and external AI agents.
See /mcp-api-surface skill for implementation pattern.
Phase 1 MVP Scope¶
| Include | Exclude (Phase 2) |
|---|---|
| Restaurant config + users | Floor plan canvas editor |
| Web booking form | AI phone agent |
| Staff dashboard (list view) | AI email agent |
| Basic table management | Google Reserve |
| Guest database (manual) | LLM guest enrichment |
| SMS confirmations | No-show fee processing |
| Waitlist (manual) | MCP server for external agents |
| AI chat widget | Ticketed events |
| Availability engine | Dynamic pricing |
| Reservation lifecycle | Multi-language AI |
| Event sourcing from day 1 | Projection rebuild tooling |
Open Decisions¶
| Decision | Options | Leaning |
|---|---|---|
| Frontend | Blazor vs React vs Next.js | TBD — depends on team |
| LLM provider | Claude vs GPT-4o vs Gemini | Claude (best tool calling) |
| Kubernetes | Managed (AKS/GKE) vs k3s | k3s now, managed with credits |
| Voice AI | Pipecat + Deepgram + ElevenLabs vs managed (Vapi) | TBD |
| Monitoring | ~~OpenTelemetry + Grafana vs cloud-native~~ | Decided: OpenTelemetry + LGTM stack |
| NuGet feed | GitHub Packages vs self-hosted | GitHub Packages |
References¶
- Tablez Spec v1.2 (Tabelz AS)
projects/tablez/ANALYSIS.md— Gap analysisprojects/tablez/COMPETITIVE-LANDSCAPE.md— Market research- Juval Löwy — "Righting Software" (IDesign method)
.claude/skills/idesign-architecture/— IDesign reference- Marten documentation — https://martendb.io