Update CLAUDE.md with downtime domain, add scalability assessment doc
Add missing downtime tracker documentation to CLAUDE.md (blueprint, routes, models, frontend components, hooks, design decisions, work hours settings). Add docs/scalability-assessment.md with full analysis targeting ~3K users / 500 concurrent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
14
CLAUDE.md
14
CLAUDE.md
@@ -38,12 +38,13 @@ frontend/ React + TypeScript
|
||||
users/ User-specific components (columns, form)
|
||||
licenses/ License-specific components
|
||||
certs/ Cert-specific components (cert-form with passphrase reveal)
|
||||
downtime/ Downtime components (downtime-columns with work-hours classification, downtime-form)
|
||||
dashboard/ Dashboard widgets (stats-overview with sponsorship metrics, expiring-items, recent-activity)
|
||||
feedback/ Feedback modal (screenshot attach with auto-scaling)
|
||||
layout/ App shell, sidebar (with admin nav), topbar, protected-route
|
||||
hooks/ React Query hooks (use-projects, use-users, use-project-users,
|
||||
use-project-members, use-licenses, use-certs, use-feedback,
|
||||
use-settings, use-keycloak-groups, use-keycloak-sync)
|
||||
use-project-members, use-licenses, use-certs, use-downtime,
|
||||
use-feedback, use-settings, use-keycloak-groups, use-keycloak-sync)
|
||||
contexts/ Auth context provider (is_admin flag, permissions helpers, OIDC logout)
|
||||
lib/ API client (axios), utils, constants
|
||||
|
||||
@@ -81,7 +82,7 @@ cd frontend && npm run build # Production build (also runs tsc)
|
||||
## Architecture Conventions
|
||||
|
||||
### Backend
|
||||
- Flask blueprints: one per domain (auth, projects, users, licenses, certs, dashboard, feedback, settings)
|
||||
- Flask blueprints: one per domain (auth, projects, users, licenses, certs, dashboard, downtime, feedback, settings)
|
||||
- All API routes prefixed with `/api/`
|
||||
- All list endpoints return wrapped objects: `{"projects": [...]}`, `{"licenses": [...]}`, `{"feedback": [...]}`, etc.
|
||||
- Single-item endpoints return: `{"project": {...}}`, `{"feedback": {...}}`, etc.
|
||||
@@ -104,7 +105,9 @@ cd frontend && npm run build # Production build (also runs tsc)
|
||||
- Cert private keys and passphrases encrypted at rest with Fernet; cert PEM stored unencrypted (public data)
|
||||
- License and cert status computed via hybrid properties, not stored columns. License status: archived → pending (no purchase date or future purchase date) → perpetual → expired → expiring_soon → active
|
||||
- Feedback screenshots stored as JSON array of base64 strings in the database
|
||||
- Settings stored as key-value pairs via the Setting model (expiry thresholds, notification config, sponsorship settings, etc.)
|
||||
- Downtime Tracker API: `/api/downtime` (list/create/update/delete). Fields: application, start_time, end_time, cause, lessons_learned, resolution, enclave (IL5/IL6 comma-separated), scope (disabled/limited), planned (boolean). Search spans application, cause, lessons_learned, resolution, submitted_by.
|
||||
- Settings stored as key-value pairs via the Setting model (expiry thresholds, notification config, sponsorship, work hours, etc.)
|
||||
- Work hours settings: `work_hours_start`, `work_hours_end`, `work_hours_timezone` — used by downtime tracker to classify events as during/after work hours
|
||||
- Email notifications via SMTP (configurable per-alert recipients, disabled when SMTP_HOST not set)
|
||||
|
||||
### Frontend
|
||||
@@ -126,6 +129,8 @@ cd frontend && npm run build # Production build (also runs tsc)
|
||||
- Members tab on project detail shows KC-based members with sponsor status and action buttons (sponsor/release/remove)
|
||||
- Cert form includes passphrase reveal button (fetches decrypted passphrase on demand)
|
||||
- Dashboard stats overview includes sponsorship metrics (sponsored count, unsponsored count with warning badge)
|
||||
- Downtime Tracker page with filterable table (application, enclave, scope, planned), work-hours classification column (during/after based on configurable work hours settings), and inline create/edit/delete
|
||||
- Admin settings page uses unified `SettingRow` layout across grouped cards (Thresholds & Scheduling, Project & User Defaults, Security)
|
||||
|
||||
### Naming
|
||||
- Backend: snake_case for Python, kebab-case for URL paths
|
||||
@@ -153,6 +158,7 @@ cd frontend && npm run build # Production build (also runs tsc)
|
||||
- **Email notifications** — SMTP-based with per-alert recipient configuration. Supports notifications for license/cert expiry, new feedback, user changes, and sponsorship release. Disabled when `SMTP_HOST` is not set.
|
||||
- **Cert encryption** — Fernet key persists to `.fernet_key` file in dev, env var in production. Changing the key makes existing encrypted data unrecoverable.
|
||||
- **Cert passphrase storage** — PKCS12 import passphrases optionally stored encrypted (Fernet) in `passphrase_encrypted` column. Retrieved via separate endpoint for security. PKCS12 export still without passphrase (explicit product decision).
|
||||
- **Downtime Tracker** — Logs application outage events with start/end times, cause, resolution, and lessons learned. Enclave field supports multi-select (IL5, IL6) stored as comma-separated string, serialized as array in API. Scope (disabled/limited) and planned (boolean) classify the nature of outages. Work-hours classification computed client-side using `date-fns-tz` against configurable work hours settings.
|
||||
- **Backend replicas must stay at 1** while using SQLite (no concurrent writes). Scale freely after Postgres migration.
|
||||
|
||||
## Stubs (Not Yet Implemented)
|
||||
|
||||
134
docs/scalability-assessment.md
Normal file
134
docs/scalability-assessment.md
Normal file
@@ -0,0 +1,134 @@
|
||||
# Scalability Assessment: ~3K Users, 500 Concurrent
|
||||
|
||||
*Date: 2026-03-19*
|
||||
|
||||
## Blockers (Must fix before scaling)
|
||||
|
||||
### 1. SQLite — Cannot Scale Past 1 Replica
|
||||
|
||||
Single-writer limitation means 1 pod, 2 gunicorn workers = **max 2 concurrent requests**. At 500 concurrent users, request queueing will be catastrophic. **Postgres migration is prerequisite #1.**
|
||||
|
||||
### 2. Filesystem Sessions — Users Lose Login on Pod Switch
|
||||
|
||||
`SESSION_TYPE = "filesystem"` in `config.py:39`. Each pod has its own ephemeral disk. Load balancer sends user to Pod B → session doesn't exist → forced re-login. Even with 1 replica, a pod restart loses all sessions.
|
||||
|
||||
**Fix:** Switch to Redis-backed sessions (`SESSION_TYPE = "redis"`).
|
||||
|
||||
### 3. No Pagination — All List Endpoints Return Everything
|
||||
|
||||
Every list endpoint (`/projects`, `/users`, `/licenses`, `/certs`, `/feedback`, `/downtime`) does `query.all()` with no offset/limit. With thousands of records, these will timeout or OOM. Frontend tables are client-side paginated, meaning the entire dataset is loaded into browser memory.
|
||||
|
||||
### 4. Gunicorn: 2 Workers, No Async
|
||||
|
||||
`backend/Dockerfile` hardcodes `--workers 2 --timeout 120`. Two sync workers means two concurrent requests per pod. A single slow Keycloak call or SMTP send blocks 50% of capacity.
|
||||
|
||||
**Fix:** `--workers 8 --worker-class gevent` (or at least `(2 x CPU) + 1` sync workers).
|
||||
|
||||
---
|
||||
|
||||
## High Severity
|
||||
|
||||
### 5. N+1 Keycloak Calls in List Project Members
|
||||
|
||||
`GET /api/projects/<key>/members` makes **2 KC API calls per member** (get_user + get_user_sessions). A project with 100 members = **~214 KC HTTP calls**. This endpoint will timeout at scale.
|
||||
|
||||
**Fix:** Batch-fetch user data; cache sponsor attributes locally; persist last_login to DB (partially started).
|
||||
|
||||
### 6. Synchronous Email Sending
|
||||
|
||||
`send_email()` blocks the request handler for 1-10s per SMTP call. Called during sponsor release and member removal — user-facing latency.
|
||||
|
||||
**Fix:** Queue emails the same way KC writes are queued (DB-backed queue + worker).
|
||||
|
||||
### 7. No SQLAlchemy Connection Pooling Config
|
||||
|
||||
No `pool_size`, `pool_recycle`, or `pool_pre_ping` set. After Postgres migration, connection exhaustion is likely under load.
|
||||
|
||||
**Fix in `config.py`:**
|
||||
|
||||
```python
|
||||
SQLALCHEMY_ENGINE_OPTIONS = {
|
||||
"pool_size": 20,
|
||||
"pool_recycle": 3600,
|
||||
"pool_pre_ping": True,
|
||||
}
|
||||
```
|
||||
|
||||
### 8. No HTTP Connection Pooling to Keycloak
|
||||
|
||||
`keycloak.py` uses raw `requests.request()` per call — no connection reuse. Under load, this creates/tears down TCP connections constantly.
|
||||
|
||||
**Fix:** Use a module-level `requests.Session` with `HTTPAdapter(pool_maxsize=20)`.
|
||||
|
||||
---
|
||||
|
||||
## Medium Severity
|
||||
|
||||
### 9. Settings Read on Every Access
|
||||
|
||||
`Setting.get(key)` hits DB every call. Dashboard stats alone calls it 4+ times per request. No in-process cache.
|
||||
|
||||
**Fix:** Request-scoped or TTL-based LRU cache for settings.
|
||||
|
||||
### 10. Feedback Screenshots as Base64 in DB
|
||||
|
||||
Screenshots stored as base64 JSON arrays in a TEXT column. List endpoint returns full screenshots for all records — multi-MB responses.
|
||||
|
||||
**Fix:** Exclude screenshots from list; return only on individual GET. Long-term: move to S3/object storage.
|
||||
|
||||
### 11. Startup Blocks on KC Sync
|
||||
|
||||
`run.py` calls `sync_users_from_keycloak()` synchronously before accepting requests. With 1000+ KC users, this is 1000+ API calls taking 30s+. Kubernetes readiness probes may kill the pod.
|
||||
|
||||
**Fix:** Move startup sync to background thread; serve requests immediately.
|
||||
|
||||
### 12. Audit Log — Unbounded, Unindexed
|
||||
|
||||
No retention cleanup (default: keep forever). `details` column searched via `contains()` = full table scan. Will degrade as audit entries grow.
|
||||
|
||||
### 13. License Files as BLOBs in DB
|
||||
|
||||
`LicenseFile.file_data` stored as `LargeBinary`. 50MB files loaded entirely into memory for download. No streaming.
|
||||
|
||||
### 14. Helm Resource Limits Too Low
|
||||
|
||||
Backend: 100m CPU / 256Mi RAM request. This will get CPU-throttled immediately under load.
|
||||
|
||||
### 15. Nginx Missing Gzip and Rate Limiting
|
||||
|
||||
No compression configured. No `limit_req`. Frontend responses sent uncompressed.
|
||||
|
||||
### 16. React Query — No staleTime
|
||||
|
||||
All hooks use default `staleTime: 0`. Every component mount triggers a refetch. At 500 concurrent users, this multiplies backend load.
|
||||
|
||||
---
|
||||
|
||||
## What's Already Good
|
||||
|
||||
- **KC queue architecture** — All KC writes are queued to a background worker with retry, batching, and multi-pod safety (`SKIP LOCKED`). This is production-grade.
|
||||
- **Permission caching** — Only 1 KC call at login; permissions cached in session for the entire session lifetime.
|
||||
- **Request-path KC calls minimized** — Writes never happen inline. Only reads for group/member management.
|
||||
- **Security headers, CORS, auth decorators** — Solid foundation.
|
||||
|
||||
---
|
||||
|
||||
## Priority Roadmap for 500 Concurrent Users
|
||||
|
||||
| Priority | Item | Effort |
|
||||
|----------|------|--------|
|
||||
| **P0** | Migrate to Postgres | Medium (change `DATABASE_URL`, add pooling config, test) |
|
||||
| **P0** | Redis sessions | Small (add `flask-session[redis]`, set `SESSION_TYPE="redis"`) |
|
||||
| **P0** | Increase gunicorn workers + add gevent | Small (Dockerfile change) |
|
||||
| **P1** | Server-side pagination on all list endpoints | Medium (backend + frontend changes) |
|
||||
| **P1** | Fix N+1 in list_project_members | Medium (batch KC reads or cache locally) |
|
||||
| **P1** | Async email via queue | Small (reuse existing KC queue pattern) |
|
||||
| **P2** | SQLAlchemy pool config | Small |
|
||||
| **P2** | KC HTTP connection pooling | Small |
|
||||
| **P2** | Settings cache | Small |
|
||||
| **P2** | Helm resource limits bump | Small |
|
||||
| **P3** | Audit log indexing + retention | Small |
|
||||
| **P3** | Move screenshots/files to object storage | Medium |
|
||||
| **P3** | Frontend staleTime + code splitting | Small |
|
||||
|
||||
The P0 items are hard blockers — the app literally cannot serve 500 concurrent users without Postgres, Redis sessions, and more workers. P1 items will cause degraded performance at scale. P2/P3 are optimizations that matter at sustained load.
|
||||
Reference in New Issue
Block a user