Update CLAUDE.md with downtime domain, add scalability assessment doc

Add missing downtime tracker documentation to CLAUDE.md (blueprint, routes, models, frontend components, hooks, design decisions, work hours settings). Add docs/scalability-assessment.md with full analysis targeting ~3K users / 500 concurrent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 14:33:17 -07:00
parent d5ade68cc4
commit 27cbfdd519
2 changed files with 144 additions and 4 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -38,12 +38,13 @@ frontend/             React + TypeScript
      users/          User-specific components (columns, form)
      licenses/       License-specific components
      certs/          Cert-specific components (cert-form with passphrase reveal)
+      downtime/       Downtime components (downtime-columns with work-hours classification, downtime-form)
      dashboard/      Dashboard widgets (stats-overview with sponsorship metrics, expiring-items, recent-activity)
      feedback/       Feedback modal (screenshot attach with auto-scaling)
      layout/         App shell, sidebar (with admin nav), topbar, protected-route
    hooks/            React Query hooks (use-projects, use-users, use-project-users,
-                      use-project-members, use-licenses, use-certs, use-feedback,
-                      use-settings, use-keycloak-groups, use-keycloak-sync)
+                      use-project-members, use-licenses, use-certs, use-downtime,
+                      use-feedback, use-settings, use-keycloak-groups, use-keycloak-sync)
    contexts/         Auth context provider (is_admin flag, permissions helpers, OIDC logout)
    lib/              API client (axios), utils, constants

@@ -81,7 +82,7 @@ cd frontend && npm run build        # Production build (also runs tsc)
 ## Architecture Conventions

 ### Backend
- Flask blueprints: one per domain (auth, projects, users, licenses, certs, dashboard, feedback, settings)
+- Flask blueprints: one per domain (auth, projects, users, licenses, certs, dashboard, downtime, feedback, settings)
 - All API routes prefixed with `/api/`
 - All list endpoints return wrapped objects: `{"projects": [...]}`, `{"licenses": [...]}`, `{"feedback": [...]}`, etc.
 - Single-item endpoints return: `{"project": {...}}`, `{"feedback": {...}}`, etc.
@@ -104,7 +105,9 @@ cd frontend && npm run build        # Production build (also runs tsc)
 - Cert private keys and passphrases encrypted at rest with Fernet; cert PEM stored unencrypted (public data)
 - License and cert status computed via hybrid properties, not stored columns. License status: archived → pending (no purchase date or future purchase date) → perpetual → expired → expiring_soon → active
 - Feedback screenshots stored as JSON array of base64 strings in the database
- Settings stored as key-value pairs via the Setting model (expiry thresholds, notification config, sponsorship settings, etc.)
+- Downtime Tracker API: `/api/downtime` (list/create/update/delete). Fields: application, start_time, end_time, cause, lessons_learned, resolution, enclave (IL5/IL6 comma-separated), scope (disabled/limited), planned (boolean). Search spans application, cause, lessons_learned, resolution, submitted_by.
+- Settings stored as key-value pairs via the Setting model (expiry thresholds, notification config, sponsorship, work hours, etc.)
+- Work hours settings: `work_hours_start`, `work_hours_end`, `work_hours_timezone` — used by downtime tracker to classify events as during/after work hours
 - Email notifications via SMTP (configurable per-alert recipients, disabled when SMTP_HOST not set)

 ### Frontend
@@ -126,6 +129,8 @@ cd frontend && npm run build        # Production build (also runs tsc)
 - Members tab on project detail shows KC-based members with sponsor status and action buttons (sponsor/release/remove)
 - Cert form includes passphrase reveal button (fetches decrypted passphrase on demand)
 - Dashboard stats overview includes sponsorship metrics (sponsored count, unsponsored count with warning badge)
+- Downtime Tracker page with filterable table (application, enclave, scope, planned), work-hours classification column (during/after based on configurable work hours settings), and inline create/edit/delete
+- Admin settings page uses unified `SettingRow` layout across grouped cards (Thresholds & Scheduling, Project & User Defaults, Security)

 ### Naming
 - Backend: snake_case for Python, kebab-case for URL paths
@@ -153,6 +158,7 @@ cd frontend && npm run build        # Production build (also runs tsc)
 - **Email notifications** — SMTP-based with per-alert recipient configuration. Supports notifications for license/cert expiry, new feedback, user changes, and sponsorship release. Disabled when `SMTP_HOST` is not set.
 - **Cert encryption** — Fernet key persists to `.fernet_key` file in dev, env var in production. Changing the key makes existing encrypted data unrecoverable.
 - **Cert passphrase storage** — PKCS12 import passphrases optionally stored encrypted (Fernet) in `passphrase_encrypted` column. Retrieved via separate endpoint for security. PKCS12 export still without passphrase (explicit product decision).
+- **Downtime Tracker** — Logs application outage events with start/end times, cause, resolution, and lessons learned. Enclave field supports multi-select (IL5, IL6) stored as comma-separated string, serialized as array in API. Scope (disabled/limited) and planned (boolean) classify the nature of outages. Work-hours classification computed client-side using `date-fns-tz` against configurable work hours settings.
 - **Backend replicas must stay at 1** while using SQLite (no concurrent writes). Scale freely after Postgres migration.

 ## Stubs (Not Yet Implemented)
--- a/docs/scalability-assessment.md
+++ b/docs/scalability-assessment.md
@@ -0,0 +1,134 @@
+# Scalability Assessment: ~3K Users, 500 Concurrent
+
+*Date: 2026-03-19*
+
+## Blockers (Must fix before scaling)
+
+### 1. SQLite — Cannot Scale Past 1 Replica
+
+Single-writer limitation means 1 pod, 2 gunicorn workers = **max 2 concurrent requests**. At 500 concurrent users, request queueing will be catastrophic. **Postgres migration is prerequisite #1.**
+
+### 2. Filesystem Sessions — Users Lose Login on Pod Switch
+
+`SESSION_TYPE = "filesystem"` in `config.py:39`. Each pod has its own ephemeral disk. Load balancer sends user to Pod B → session doesn't exist → forced re-login. Even with 1 replica, a pod restart loses all sessions.
+
+**Fix:** Switch to Redis-backed sessions (`SESSION_TYPE = "redis"`).
+
+### 3. No Pagination — All List Endpoints Return Everything
+
+Every list endpoint (`/projects`, `/users`, `/licenses`, `/certs`, `/feedback`, `/downtime`) does `query.all()` with no offset/limit. With thousands of records, these will timeout or OOM. Frontend tables are client-side paginated, meaning the entire dataset is loaded into browser memory.
+
+### 4. Gunicorn: 2 Workers, No Async
+
+`backend/Dockerfile` hardcodes `--workers 2 --timeout 120`. Two sync workers means two concurrent requests per pod. A single slow Keycloak call or SMTP send blocks 50% of capacity.
+
+**Fix:** `--workers 8 --worker-class gevent` (or at least `(2 x CPU) + 1` sync workers).
+
+---
+
+## High Severity
+
+### 5. N+1 Keycloak Calls in List Project Members
+
+`GET /api/projects/<key>/members` makes **2 KC API calls per member** (get_user + get_user_sessions). A project with 100 members = **~214 KC HTTP calls**. This endpoint will timeout at scale.
+
+**Fix:** Batch-fetch user data; cache sponsor attributes locally; persist last_login to DB (partially started).
+
+### 6. Synchronous Email Sending
+
+`send_email()` blocks the request handler for 1-10s per SMTP call. Called during sponsor release and member removal — user-facing latency.
+
+**Fix:** Queue emails the same way KC writes are queued (DB-backed queue + worker).
+
+### 7. No SQLAlchemy Connection Pooling Config
+
+No `pool_size`, `pool_recycle`, or `pool_pre_ping` set. After Postgres migration, connection exhaustion is likely under load.
+
+**Fix in `config.py`:**
+
+```python
+SQLALCHEMY_ENGINE_OPTIONS = {
+    "pool_size": 20,
+    "pool_recycle": 3600,
+    "pool_pre_ping": True,
+}
+```
+
+### 8. No HTTP Connection Pooling to Keycloak
+
+`keycloak.py` uses raw `requests.request()` per call — no connection reuse. Under load, this creates/tears down TCP connections constantly.
+
+**Fix:** Use a module-level `requests.Session` with `HTTPAdapter(pool_maxsize=20)`.
+
+---
+
+## Medium Severity
+
+### 9. Settings Read on Every Access
+
+`Setting.get(key)` hits DB every call. Dashboard stats alone calls it 4+ times per request. No in-process cache.
+
+**Fix:** Request-scoped or TTL-based LRU cache for settings.
+
+### 10. Feedback Screenshots as Base64 in DB
+
+Screenshots stored as base64 JSON arrays in a TEXT column. List endpoint returns full screenshots for all records — multi-MB responses.
+
+**Fix:** Exclude screenshots from list; return only on individual GET. Long-term: move to S3/object storage.
+
+### 11. Startup Blocks on KC Sync
+
+`run.py` calls `sync_users_from_keycloak()` synchronously before accepting requests. With 1000+ KC users, this is 1000+ API calls taking 30s+. Kubernetes readiness probes may kill the pod.
+
+**Fix:** Move startup sync to background thread; serve requests immediately.
+
+### 12. Audit Log — Unbounded, Unindexed
+
+No retention cleanup (default: keep forever). `details` column searched via `contains()` = full table scan. Will degrade as audit entries grow.
+
+### 13. License Files as BLOBs in DB
+
+`LicenseFile.file_data` stored as `LargeBinary`. 50MB files loaded entirely into memory for download. No streaming.
+
+### 14. Helm Resource Limits Too Low
+
+Backend: 100m CPU / 256Mi RAM request. This will get CPU-throttled immediately under load.
+
+### 15. Nginx Missing Gzip and Rate Limiting
+
+No compression configured. No `limit_req`. Frontend responses sent uncompressed.
+
+### 16. React Query — No staleTime
+
+All hooks use default `staleTime: 0`. Every component mount triggers a refetch. At 500 concurrent users, this multiplies backend load.
+
+---
+
+## What's Already Good
+
+- **KC queue architecture** — All KC writes are queued to a background worker with retry, batching, and multi-pod safety (`SKIP LOCKED`). This is production-grade.
+- **Permission caching** — Only 1 KC call at login; permissions cached in session for the entire session lifetime.
+- **Request-path KC calls minimized** — Writes never happen inline. Only reads for group/member management.
+- **Security headers, CORS, auth decorators** — Solid foundation.
+
+---
+
+## Priority Roadmap for 500 Concurrent Users
+
+| Priority | Item | Effort |
+|----------|------|--------|
+| **P0** | Migrate to Postgres | Medium (change `DATABASE_URL`, add pooling config, test) |
+| **P0** | Redis sessions | Small (add `flask-session[redis]`, set `SESSION_TYPE="redis"`) |
+| **P0** | Increase gunicorn workers + add gevent | Small (Dockerfile change) |
+| **P1** | Server-side pagination on all list endpoints | Medium (backend + frontend changes) |
+| **P1** | Fix N+1 in list_project_members | Medium (batch KC reads or cache locally) |
+| **P1** | Async email via queue | Small (reuse existing KC queue pattern) |
+| **P2** | SQLAlchemy pool config | Small |
+| **P2** | KC HTTP connection pooling | Small |
+| **P2** | Settings cache | Small |
+| **P2** | Helm resource limits bump | Small |
+| **P3** | Audit log indexing + retention | Small |
+| **P3** | Move screenshots/files to object storage | Medium |
+| **P3** | Frontend staleTime + code splitting | Small |
+
+The P0 items are hard blockers — the app literally cannot serve 500 concurrent users without Postgres, Redis sessions, and more workers. P1 items will cause degraded performance at scale. P2/P3 are optimizations that matter at sustained load.