Cost Optimization & Multi-Model Playbook
Muxa’s proxy layer sits between IDE clients and upstream providers, which means every request flows through the same optimization pipeline. This document explains how the pieces interact, how to enable them, and what to expect as you accumulate traffic.
Optimization Stack
| Layer | What it does | When it helps |
|---|---|---|
Prompt Cache (MUXA_PROMPT_CACHE_*) |
Exact request → response reuse; 0ms turnaround | Repeated Codex/Continue prompts, regressions, CI bots |
Semantic Cache (MUXA_SEMANTIC_CACHE_* + embeddings provider) |
Vector similarity reuse with tolerance | Follow-up prompts that paraphrase previous asks |
Memory Store (MUXA_MEMORY_*) |
Extracts reusable snippets & appends as system context | Long-lived sessions, collaboration between IDEs |
Headroom Compression (MUXA_HEADROOM_*) |
Compresses chat history, removes redundancy | Large conversations, “o1-style” debugging sessions |
Hybrid Routing (MUXA_ROUTING_STRATEGY=hybrid) |
Routes complex prompts/tool calls to fallback provider | Expensive requests, vendor-specific tools |
How Requests Flow
- Ingress –
/v1/messages(Anthropic) or/v1/responses(OpenAI) lands on the proxy. - Cache checks – Prompt cache first, then semantic cache (if configured). Cache hits short-circuit the request.
- Memory injection – When enabled, the memory store injects top-K snippets before routing.
- Routing & Provider Execution –
executeWithRoutingscores complexity and selects the configured provider (single or hybrid). - Headroom Compression – For long conversations, the compression engine rewrites the prompt before calling upstream.
- Persistence & Metrics – Token usage is recorded for
/api/tokens/*endpoints and/metrics.
Enabling the Features
# Common toggles (usually added to .env)
export MUXA_PROMPT_CACHE_ENABLED=true
export MUXA_PROMPT_CACHE_TTL_MS=120000
export MUXA_SEMANTIC_CACHE_ENABLED=true
export MUXA_SEMANTIC_CACHE_THRESHOLD=0.9
export OLLAMA_BASE_URL=http://localhost:11434 # or OPENAI/OPENROUTER embeddings
export OLLAMA_EMBEDDINGS_MODEL=nomic-embed-text
export MUXA_MEMORY_ENABLED=true
export MUXA_MEMORY_TOPK=3
export MUXA_HEADROOM_ENABLED=true
export MUXA_HEADROOM_MODE=optimize # audit | optimize
export MUXA_ROUTING_STRATEGY=hybrid # single | hybrid
export MUXA_FALLBACK_PROVIDER=openai # e.g., openai, openrouter
Restart npm start (or Docker) after updating the environment. Headroom starts a sidecar automatically; if it crashes, POST /headroom/restart revives it.
When You’ll See Results
| Feature | Time-to-value | Notes |
|---|---|---|
| Prompt Cache | Immediate after a second identical prompt | Ideal for “run tests”, “summarize file” loops |
| Semantic Cache | After ~10 prompts (enough vectors to compare) | Requires embeddings provider; watch /metrics/semantic-cache |
| Memory Store | After at least one answer deemed “surprising” (by length/threshold) | Use /api/tokens/stats to confirm extra context is added |
| Headroom Compression | Visible once conversations exceed ~1,500 tokens | /metrics/compression reports saved tokens/dollars |
| Hybrid Routing | As soon as complexityScore crosses thresholds |
/routing/stats lists the last 25 decisions |
Actual savings depend entirely on workload, prompt mix, and provider pricing. Use
/metrics+ upstream billing to validate improvements in your own environment.
Typical pattern:
- Day 0 – Prompt cache and hybrid routing reduce spike costs immediately.
- Day 1–2 – Semantic cache kicks in as embeddings accumulate; memory store injects helpful reminders.
- Week 1 – Compression/Headroom reports measurable reductions (10–40% of prompt tokens in long sessions).
Multi-Model & Provider Strategy
- Single routing (
MUXA_ROUTING_STRATEGY=single) – All traffic goes toMUXA_PRIMARY_PROVIDER. Use when latency & deterministic behavior matter most. - Hybrid routing – Requests are scored by prompt length + tool usage:
complexityScore <= 5→ primary provider (often cheaper/faster)complexityScore > 5or tool-heavy payloads → fallback provider
- Fallback semantics – If the primary provider errors, the fallback is used regardless of strategy.
- Model aliases & fallbacks –
OPENAI_MODEL_ALIASESandOPENAI_MODEL_FALLBACKSlet you map IDE-only names (e.g.,gpt-5.3-codex) to real upstream SKUs without editing client configs.
Example Layout
MUXA_PRIMARY_PROVIDER=openrouter
OPENROUTER_API_KEY=sk-...
OPENROUTER_MODEL=gpt-4o-mini
MUXA_FALLBACK_PROVIDER=openai
OPENAI_API_KEY=sk-live...
OPENAI_MODEL=gpt-4o-mini
OPENAI_MODEL_FALLBACKS=gpt-4o-mini-2024-07-18,gpt-5.2
Cursor or Codex can request gpt-5.3-codex; the alias map in src/config/index.js transparently reroutes it to the configured provider + fallback chain.
Observability & Verification
/dashboard– Summaries, routing log, token ratios, and manual refresh button./metrics/compression– Headroom stats:totalSavings,compressedRequests./metrics/semantic-cache– Hit/miss counts (0 until embeddings configured)./api/tokens/stats– Aggregate input/output tokens for ROI calculations./routing/stats– Last 25 routing decisions, providers involved, latency./metrics/prometheus– Scrape-friendly totals for dashboards/alerts.
Recommended workflow:
- Enable logging with
MUXA_LOG_RESPONSES=truewhen onboarding a new client (Codex, Continue, Claude Code) to ensure requests hit the proxy. - Drive a realistic workload for a day.
- Capture snapshots of
/metrics,/metrics/compression,/metrics/semantic-cache, and/routing/stats. - Compare against baseline provider invoices to quantify token savings.
Best Practices
- Warm caches intentionally – Seed semantic/prompt caches by replaying regression suites or “common tasks” via Codex CLI.
- Name sessions – Supply
session_id/userfields when possible so memory + token stats attribute usage cleanly. - Monitor fallback cost – If fallback traffic stays high, tune
MUXA_ROUTING_STRATEGYthresholds or adjust providers. - Disable noisy logs in production – Use
MUXA_LOG_RESPONSESonly for debugging; rely on/metricsfor steady-state observability. - Document provider contracts – When multiple teams share the proxy, agree on which IDEs use which models to avoid surprise bills.
Troubleshooting
- Cache miss rate remains 100% – Verify
MUXA_PROMPT_CACHE_ENABLED=trueand clients aren’t randomizing prompts (some IDEs append timestamps). - Semantic cache never initializes – Ensure embeddings provider env vars are set; check
/metrics/semantic-cacheforconfigured: false. - Compression savings = 0 – Headroom sidecar may be disabled; hit
/headroom/statusand restart viaPOST /headroom/restart. - Fallback provider never used – Hybrid routing only triggers when the score crosses thresholds; confirm
MUXA_ROUTING_STRATEGY=hybrid.
With the optimization stack enabled and observed, Muxa becomes a centralized “brain” for IDE assistants: switch providers once, gain savings everywhere, and monitor costs in one place.