Update CLAUDE.md with downtime domain, add scalability assessment doc

Add missing downtime tracker documentation to CLAUDE.md (blueprint,
routes, models, frontend components, hooks, design decisions, work
hours settings). Add docs/scalability-assessment.md with full analysis
targeting ~3K users / 500 concurrent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-19 14:33:17 -07:00
parent d5ade68cc4
commit 27cbfdd519
2 changed files with 144 additions and 4 deletions

View File

@@ -38,12 +38,13 @@ frontend/ React + TypeScript
users/ User-specific components (columns, form)
licenses/ License-specific components
certs/ Cert-specific components (cert-form with passphrase reveal)
downtime/ Downtime components (downtime-columns with work-hours classification, downtime-form)
dashboard/ Dashboard widgets (stats-overview with sponsorship metrics, expiring-items, recent-activity)
feedback/ Feedback modal (screenshot attach with auto-scaling)
layout/ App shell, sidebar (with admin nav), topbar, protected-route
hooks/ React Query hooks (use-projects, use-users, use-project-users,
use-project-members, use-licenses, use-certs, use-feedback,
use-settings, use-keycloak-groups, use-keycloak-sync)
use-project-members, use-licenses, use-certs, use-downtime,
use-feedback, use-settings, use-keycloak-groups, use-keycloak-sync)
contexts/ Auth context provider (is_admin flag, permissions helpers, OIDC logout)
lib/ API client (axios), utils, constants
@@ -81,7 +82,7 @@ cd frontend && npm run build # Production build (also runs tsc)
## Architecture Conventions
### Backend
- Flask blueprints: one per domain (auth, projects, users, licenses, certs, dashboard, feedback, settings)
- Flask blueprints: one per domain (auth, projects, users, licenses, certs, dashboard, downtime, feedback, settings)
- All API routes prefixed with `/api/`
- All list endpoints return wrapped objects: `{"projects": [...]}`, `{"licenses": [...]}`, `{"feedback": [...]}`, etc.
- Single-item endpoints return: `{"project": {...}}`, `{"feedback": {...}}`, etc.
@@ -104,7 +105,9 @@ cd frontend && npm run build # Production build (also runs tsc)
- Cert private keys and passphrases encrypted at rest with Fernet; cert PEM stored unencrypted (public data)
- License and cert status computed via hybrid properties, not stored columns. License status: archived → pending (no purchase date or future purchase date) → perpetual → expired → expiring_soon → active
- Feedback screenshots stored as JSON array of base64 strings in the database
- Settings stored as key-value pairs via the Setting model (expiry thresholds, notification config, sponsorship settings, etc.)
- Downtime Tracker API: `/api/downtime` (list/create/update/delete). Fields: application, start_time, end_time, cause, lessons_learned, resolution, enclave (IL5/IL6 comma-separated), scope (disabled/limited), planned (boolean). Search spans application, cause, lessons_learned, resolution, submitted_by.
- Settings stored as key-value pairs via the Setting model (expiry thresholds, notification config, sponsorship, work hours, etc.)
- Work hours settings: `work_hours_start`, `work_hours_end`, `work_hours_timezone` — used by downtime tracker to classify events as during/after work hours
- Email notifications via SMTP (configurable per-alert recipients, disabled when SMTP_HOST not set)
### Frontend
@@ -126,6 +129,8 @@ cd frontend && npm run build # Production build (also runs tsc)
- Members tab on project detail shows KC-based members with sponsor status and action buttons (sponsor/release/remove)
- Cert form includes passphrase reveal button (fetches decrypted passphrase on demand)
- Dashboard stats overview includes sponsorship metrics (sponsored count, unsponsored count with warning badge)
- Downtime Tracker page with filterable table (application, enclave, scope, planned), work-hours classification column (during/after based on configurable work hours settings), and inline create/edit/delete
- Admin settings page uses unified `SettingRow` layout across grouped cards (Thresholds & Scheduling, Project & User Defaults, Security)
### Naming
- Backend: snake_case for Python, kebab-case for URL paths
@@ -153,6 +158,7 @@ cd frontend && npm run build # Production build (also runs tsc)
- **Email notifications** — SMTP-based with per-alert recipient configuration. Supports notifications for license/cert expiry, new feedback, user changes, and sponsorship release. Disabled when `SMTP_HOST` is not set.
- **Cert encryption** — Fernet key persists to `.fernet_key` file in dev, env var in production. Changing the key makes existing encrypted data unrecoverable.
- **Cert passphrase storage** — PKCS12 import passphrases optionally stored encrypted (Fernet) in `passphrase_encrypted` column. Retrieved via separate endpoint for security. PKCS12 export still without passphrase (explicit product decision).
- **Downtime Tracker** — Logs application outage events with start/end times, cause, resolution, and lessons learned. Enclave field supports multi-select (IL5, IL6) stored as comma-separated string, serialized as array in API. Scope (disabled/limited) and planned (boolean) classify the nature of outages. Work-hours classification computed client-side using `date-fns-tz` against configurable work hours settings.
- **Backend replicas must stay at 1** while using SQLite (no concurrent writes). Scale freely after Postgres migration.
## Stubs (Not Yet Implemented)

View File

@@ -0,0 +1,134 @@
# Scalability Assessment: ~3K Users, 500 Concurrent
*Date: 2026-03-19*
## Blockers (Must fix before scaling)
### 1. SQLite — Cannot Scale Past 1 Replica
Single-writer limitation means 1 pod, 2 gunicorn workers = **max 2 concurrent requests**. At 500 concurrent users, request queueing will be catastrophic. **Postgres migration is prerequisite #1.**
### 2. Filesystem Sessions — Users Lose Login on Pod Switch
`SESSION_TYPE = "filesystem"` in `config.py:39`. Each pod has its own ephemeral disk. Load balancer sends user to Pod B → session doesn't exist → forced re-login. Even with 1 replica, a pod restart loses all sessions.
**Fix:** Switch to Redis-backed sessions (`SESSION_TYPE = "redis"`).
### 3. No Pagination — All List Endpoints Return Everything
Every list endpoint (`/projects`, `/users`, `/licenses`, `/certs`, `/feedback`, `/downtime`) does `query.all()` with no offset/limit. With thousands of records, these will timeout or OOM. Frontend tables are client-side paginated, meaning the entire dataset is loaded into browser memory.
### 4. Gunicorn: 2 Workers, No Async
`backend/Dockerfile` hardcodes `--workers 2 --timeout 120`. Two sync workers means two concurrent requests per pod. A single slow Keycloak call or SMTP send blocks 50% of capacity.
**Fix:** `--workers 8 --worker-class gevent` (or at least `(2 x CPU) + 1` sync workers).
---
## High Severity
### 5. N+1 Keycloak Calls in List Project Members
`GET /api/projects/<key>/members` makes **2 KC API calls per member** (get_user + get_user_sessions). A project with 100 members = **~214 KC HTTP calls**. This endpoint will timeout at scale.
**Fix:** Batch-fetch user data; cache sponsor attributes locally; persist last_login to DB (partially started).
### 6. Synchronous Email Sending
`send_email()` blocks the request handler for 1-10s per SMTP call. Called during sponsor release and member removal — user-facing latency.
**Fix:** Queue emails the same way KC writes are queued (DB-backed queue + worker).
### 7. No SQLAlchemy Connection Pooling Config
No `pool_size`, `pool_recycle`, or `pool_pre_ping` set. After Postgres migration, connection exhaustion is likely under load.
**Fix in `config.py`:**
```python
SQLALCHEMY_ENGINE_OPTIONS = {
"pool_size": 20,
"pool_recycle": 3600,
"pool_pre_ping": True,
}
```
### 8. No HTTP Connection Pooling to Keycloak
`keycloak.py` uses raw `requests.request()` per call — no connection reuse. Under load, this creates/tears down TCP connections constantly.
**Fix:** Use a module-level `requests.Session` with `HTTPAdapter(pool_maxsize=20)`.
---
## Medium Severity
### 9. Settings Read on Every Access
`Setting.get(key)` hits DB every call. Dashboard stats alone calls it 4+ times per request. No in-process cache.
**Fix:** Request-scoped or TTL-based LRU cache for settings.
### 10. Feedback Screenshots as Base64 in DB
Screenshots stored as base64 JSON arrays in a TEXT column. List endpoint returns full screenshots for all records — multi-MB responses.
**Fix:** Exclude screenshots from list; return only on individual GET. Long-term: move to S3/object storage.
### 11. Startup Blocks on KC Sync
`run.py` calls `sync_users_from_keycloak()` synchronously before accepting requests. With 1000+ KC users, this is 1000+ API calls taking 30s+. Kubernetes readiness probes may kill the pod.
**Fix:** Move startup sync to background thread; serve requests immediately.
### 12. Audit Log — Unbounded, Unindexed
No retention cleanup (default: keep forever). `details` column searched via `contains()` = full table scan. Will degrade as audit entries grow.
### 13. License Files as BLOBs in DB
`LicenseFile.file_data` stored as `LargeBinary`. 50MB files loaded entirely into memory for download. No streaming.
### 14. Helm Resource Limits Too Low
Backend: 100m CPU / 256Mi RAM request. This will get CPU-throttled immediately under load.
### 15. Nginx Missing Gzip and Rate Limiting
No compression configured. No `limit_req`. Frontend responses sent uncompressed.
### 16. React Query — No staleTime
All hooks use default `staleTime: 0`. Every component mount triggers a refetch. At 500 concurrent users, this multiplies backend load.
---
## What's Already Good
- **KC queue architecture** — All KC writes are queued to a background worker with retry, batching, and multi-pod safety (`SKIP LOCKED`). This is production-grade.
- **Permission caching** — Only 1 KC call at login; permissions cached in session for the entire session lifetime.
- **Request-path KC calls minimized** — Writes never happen inline. Only reads for group/member management.
- **Security headers, CORS, auth decorators** — Solid foundation.
---
## Priority Roadmap for 500 Concurrent Users
| Priority | Item | Effort |
|----------|------|--------|
| **P0** | Migrate to Postgres | Medium (change `DATABASE_URL`, add pooling config, test) |
| **P0** | Redis sessions | Small (add `flask-session[redis]`, set `SESSION_TYPE="redis"`) |
| **P0** | Increase gunicorn workers + add gevent | Small (Dockerfile change) |
| **P1** | Server-side pagination on all list endpoints | Medium (backend + frontend changes) |
| **P1** | Fix N+1 in list_project_members | Medium (batch KC reads or cache locally) |
| **P1** | Async email via queue | Small (reuse existing KC queue pattern) |
| **P2** | SQLAlchemy pool config | Small |
| **P2** | KC HTTP connection pooling | Small |
| **P2** | Settings cache | Small |
| **P2** | Helm resource limits bump | Small |
| **P3** | Audit log indexing + retention | Small |
| **P3** | Move screenshots/files to object storage | Medium |
| **P3** | Frontend staleTime + code splitting | Small |
The P0 items are hard blockers — the app literally cannot serve 500 concurrent users without Postgres, Redis sessions, and more workers. P1 items will cause degraded performance at scale. P2/P3 are optimizations that matter at sustained load.