Algodesks — Multi-Tenant SaaS for Algorithmic Trading
Production multi-tenant SaaS that unifies the algorithmic trading loop from research to live execution. Users discover or auto-generate strategies, walk-forward-validate them, and promote winning portfolios to a real exchange with one click — same domain code path in research and production, no silent re-implementation seam.
- 角色
- AI Architect & Lead Engineer
- 技术栈
- Python 3.13 · FastAPI · Next.js 15 · PostgreSQL · Redis · Docker · Railway
- 时间
- 2026 — present

本案例研究尚未翻译为您的语言——正在显示英文原文。

The problem
Retail and prosumer traders who want to research, validate, and run algorithmic strategies face a fragmented toolchain: backtesting lives in one notebook, optimization in another, live execution somewhere else entirely. Every step re-implements the same primitives — risk sizing, signal logic, fee gates, trailing stops — and the gap between "looks good in backtest" and "survives live trading" is where the majority of strategies silently die. Lookahead bias, overfitting, and execution drift turn promising research into losing money.
The idea
Algodesks unifies the full research-to-production loop in a single product. A user discovers strategies (or has the system generate them autonomously via AutoBuild — a discovery → data-coverage check → walk-forward optimization → portfolio-assembly pipeline), validates them on historical OHLCV data, and promotes a winning portfolio to live trading on a real exchange with one click. The same domain code path runs inside the backtest engine, the optimizer, and the live runner — so what wins in research is what executes in production. No re-implementation seam, no silent divergence.

Architectural decisions
Clean / hexagonal architecture
The domain layer holds immutable value objects (TrendFilterConfig, TrailingStopConfig, EntryQualityPreset, BodyRatioConfig, …) and entities (Backtest, Portfolio, LiveSession, AutoBuildJob). Every external integration sits behind a narrow Protocol: IUserRepository, IOptimizationRunner, IDataAutoFetcher, IPortfolioBuilder, IEventRepository. Production wiring uses Postgres / Redis / subprocess adapters; tests inject in-memory fakes.
Why: swapping the backtest engine from StubEngine to the legacy subprocess engine was a one-line DI change. Multi-tenancy was retrofitted by threading user_id through one constructor, not by rewriting business logic. The cost of strict layering pays back the first time you need to change anything.
Multi-tenant from day zero
Every database row carries user_id; every Redis key is namespaced {resource}:{user_id}:{id}. Tenant scoping is enforced at the repository boundary, not in the routes — so a future endpoint that forgets to pass user_id won't leak across tenants because the repo simply returns nothing. Per-tenant Fernet encryption protects exchange API credentials. Cross-tenant isolation has its own e2e test suite.
Why: retrofitting tenancy at the route layer is how data-leak CVEs get written. Putting the filter one layer deeper makes the default behaviour secure.
Autonomous-pipeline orchestration (AutoBuild)
AutoBuild is the AI/agentic core of the system. Given a constraint set, it autonomously:
- Discovers candidate instruments and ranks them by Bybit liquidity.
- Preflights data coverage on disk; auto-fetches missing ranges from upstream with timeout + classified failure modes (
timeout,no_data,unsupported_symbol,exception). - Optimises each viable symbol via parameter-grid search with walk-forward fit/test split (70/30 default) — only candidates whose fit-window winners survive the held-out test window get accepted.
- Assembles the accepted legs into a balanced portfolio, ready to promote to live trading.
The whole pipeline is cooperatively cancellable: a single flag, checked at every loop iteration, propagates down to a parallel SIGTERM of in-flight subprocesses with a 10-second grace window.
Why: lookahead bias and overfitting are structural failure modes in this domain. The validation has to be the architecture, not an afterthought.

Event-driven progress UX
Long-running jobs emit typed events over WebSocket (/ws/autobuild/{job_id}, /ws/events) with HTTP polling as a fallback for dropped connections. Event payloads carry rich diagnostic data — bar counts, on-disk size, wall-clock duration, classified error kinds — so the user sees what happened and why, not just a green check / red X.
Why: users watching a 30-minute optimization need something on screen. A frozen modal kills trust faster than a slow job.
Engineering decisions
Type discipline at every seam
- Pydantic v2 schemas at the HTTP edge, dataclasses + value objects in the domain,
SQLAlchemy 2 models in persistence. No
dict[str, Any]travels between layers. - TypeScript on the frontend with strict discriminated unions for WebSocket event variants. - Schema and entity converters live in dedicated modules — wire-format changes don't ripple into the domain.
Defensive runtime
Every external call (exchange API, backtest subprocess, upstream OHLCV fetcher, OAuth provider) is wrapped with:
- Per-call timeout (
asyncio.wait_for) so a hung upstream can't anchor the entire job. - Classified failure modes with upstream-message capture, so the UI can group failures into actionable buckets.
- Defense-in-depth catches at orchestrator level — a misbehaving implementer can't crash the run loop.
A stuck-job sweeper requeues orchestrator runs after pod restarts. Quota gates, token-bucket rate limiting, and Sentry on every uncaught exception round out the production hardening.
Security posture
- Exchange credentials are Fernet-encrypted at rest with a
SECRET_KEYprovisioned per environment. Rotation is manual-ops by design — rotating mid-trade would silently brick live sessions, which is a far worse outcome than the rotation friction. - Google OIDC for auth, short-lived JWT sessions, HTTP-only secure cookies. - Audit log on every admin write; security headers via middleware; CORS / CSP correctly scoped.
Test pyramid that actually pyramids
- 2,000+ unit tests on pure domain logic — no I/O, sub-second runtime, fast feedback on every commit.
- Integration tests against real Postgres + Redis in CI.
- Programmatic e2e tests driving the live API exactly as a real user would.
Every cross-tenant bug, every lookahead-bias regression, every silent numeric overflow was caught by a test before it shipped to users.
Pragmatic over dogmatic
- Bundled the legacy backtest engine into the production Docker image rather than splitting it into a microservice. Why: the legacy engine has years of validation behind its trading math; service-boundary purity is a worse trade than correctness. - Chose Railway over Kubernetes for a team of one. Time-to-deploy beats theoretical scale that doesn't exist yet — and Railway's primitives (volumes, environments, autodeploy) cover everything I actually need today. - Bypassed Cloudflare's 100 MB body limit by routing large CSV uploads to the Railway-direct backend URL — pragmatic workaround, documented in the ops runbook.



Outcomes
Production multi-tenant SaaS live at algodesks.com and api.algodesks.com. AutoBuild can autonomously construct a 50-leg portfolio across 500+ instruments in tens of minutes — a workflow that would take a quant analyst days by hand. Validated research artefacts (dynamic-whitelist hybrid strategy, entry-quality filters, trend filters, trailing stops) are baked into the product as configurable, sweepable axes.
What I personally owned
Domain modelling, architectural decisions, infra topology, security review, the multi-tenant retrofit, the AutoBuild design + implementation, walk-forward optimiser, live-trading runner, frontend architecture, and the production deploy pipeline. End-to-end ownership from problem framing to a paying-customer-ready product — making the technical decisions, building the code, validating with real strategies, and shipping it.
Stack snapshot
| Layer | Choices |
|---|---|
| Backend | Python 3.13, FastAPI, SQLAlchemy 2 + Alembic, Pydantic v2, asyncio, ccxt, yfinance |
| Frontend | Next.js 15 (App Router, route groups, AuthGuard), React 18, Tailwind CSS, shadcn/ui, TanStack Query |
| Data | PostgreSQL (transactional), Redis (Streams + KV), file-based OHLCV cache on Railway Volume |
| Auth & Security | Google OIDC, JWT sessions, Fernet symmetric encryption, RBAC permissions, audit log |
| Infra | Docker multi-stage builds with uv, Railway, Cloudflare proxy + DNS |
| Observability | Structured logging, Sentry, custom event store with /ws/events activity feed |