PerformancePERF0022026-03-175 min

See Where Every Millisecond Goes in Your LLM Calls

.stablestack/timing/rag_pipeline_1710702847.html

⏱

Wall-Time Waterfall

Sequential phases that add to response time

3.85s

Total Wall-Time

0ms3.85s

LLMSQL/DBStreamGuardrailOther

request_setup@ 0ms

3ms0.1%

auth_check@ 3ms

15ms0.4%

query_embedding@ 18ms

180ms4.7%

┃ parallel queries ↓

vector_search@ 198ms

120ms3.1%

user_history@ 198ms

85ms2.2%

policy_lookup@ 198ms

95ms2.5%

context_assembly@ 318ms

12ms0.3%

llm_call@ 330ms

2.78s72.2%⚠

guardrail_check@ 3.11s

218ms5.7%

citation_lookup@ 3.11s

45ms1.2%

token_streaming@ 3.33s

510ms13.2%

cache_write@ 3.84s

12ms0.3%

Sum of phases:4.08s(105.8%)

LLM API calls are the slowest part of most AI applications. A single call to GPT-4 or Claude can take 1-10 seconds. But when your endpoint takes 8 seconds to respond, how much of that is prompt building? How much is the actual API call? How much is parsing the response? Without instrumentation, you're guessing.

The problem

AI assistants build LLM pipelines as monolithic functions. A single function constructs the prompt, calls the API, parses the response, maybe retries on failure — and the whole thing looks like one opaque block. When it's slow, you don't know which part is slow.

We hit this ourselves building AI features at scale. A summarization endpoint was taking 12 seconds. Was it the prompt template rendering? The API call? The JSON parsing? We added print timestamps everywhere, did the mental math, removed the prints, repeated. It was painful enough that we built a proper solution.

One command to install

$ stablestack add-llm-timing
Created stablestack_timer.py
Added .stablestack/timing/ to .gitignore
Created .claude/commands/llm-timing.md

ChatTimer is a zero-dependency Python module that gives you phase-level timing for any LLM operation. Install it with a single command:

How it works

from stablestack_timer import ChatTimer

timer = ChatTimer("rag_pipeline")
with timer.measure("query_embedding", category="llm", metadata={"model": "text-embedding-3-small"}):
    embedding = embed(query)
with timer.measure("vector_search", category="db", metadata={"index": "documents"}):
    docs = search(embedding)
with timer.measure("llm_call", category="llm", metadata={"model": "gpt-4", "max_tokens": 2048}):
    response = client.chat.completions.create(
        model="gpt-4", messages=build_messages(query, docs)
    )
with timer.measure("guardrail_check", category="guardrail"):
    validate(response)
with timer.measure("token_streaming", category="stream"):
    stream_to_client(response)
timer.report()

Wrap each phase of your LLM operation with a context manager. ChatTimer records start and end times for each phase relative to the operation start. Tag phases with categories and attach metadata like model names or token counts:

Phase categories

Every phase can be tagged with a category — LLM, SQL/DB, Stream, Guardrail, Context, or Other. Categories drive color-coded dots in the console output, colored bars in the HTML waterfall, and a stacked overview bar that shows you the time distribution at a glance. When most of your time is cyan (LLM), you know where to optimize. When blue (DB) dominates, your bottleneck is data access, not the model.

Console waterfall

LLM Timing: rag_pipeline (4190ms)
  ● query_embedding   @ 0ms     █░░░░░░░░░░░░░░░   180ms   4.3%
  ● vector_search     @ 180ms   █░░░░░░░░░░░░░░░   137ms   3.3%
  ● llm_call          @ 317ms   ██████████░░░░░░  2780ms  66.3%  ⚠ bottleneck
  ● guardrail_check   @ 3.10s   █░░░░░░░░░░░░░░░   218ms   5.2%
  ● token_streaming   @ 3.32s   ██░░░░░░░░░░░░░░   510ms  12.2%
  Sum of phases: 3825ms (91.3%)

timer.report() prints an ASCII waterfall with category-colored dots, start offsets showing when each phase began, and a parallelism summary. Bottlenecks over 1 second get flagged automatically:

Parallel phase detection

When you run phases concurrently (e.g. a guardrail check alongside citation lookup), ChatTimer detects the overlap. The "sum of phases" line shows the total duration of all phases added together. If it exceeds 100% of wall time, you have parallelism — and the report tells you exactly how much. This is the metric that tells you whether your pipeline is actually utilizing concurrency or just running everything sequentially.

HTML waterfall visualization

timer.render_html_report() generates a self-contained HTML waterfall with a stacked overview bar showing time distribution by category, a category legend, phase timeline bars positioned by start offset for parallel visualization, metadata tooltips on hover, and a bottleneck analysis section. The dark theme matches StableStack branding and the HTML has zero external dependencies.

Reports are saved to .stablestack/timing/ (already gitignored) so they don't clutter your project. Generate them during development, share them in pull requests, or collect them in CI to track latency regressions over time.

Metadata on phases

with timer.measure("llm_call", category="llm", metadata={
    "model": "gpt-4",
    "max_tokens": 2048,
    "temperature": 0.7,
    "prompt_tokens": 1847,
    "completion_tokens": 512
}):
    response = client.chat.completions.create(...)

Attach arbitrary key/value metadata to any phase — model name, token count, database query, cache hit/miss. Metadata appears as hover tooltips in the HTML waterfall report, giving you full context without cluttering the visualization. This is especially useful for comparing runs: did latency spike because you switched models, or because the prompt got longer?

PERF002: automatic detection

StableStack's PERF002 checker scans your codebase for LLM API calls that aren't wrapped in timing instrumentation. It detects OpenAI (.chat.completions.create), Anthropic (.messages.create), and LangChain (.invoke/.ainvoke) patterns. If a call is inside a ChatTimer.measure() block, a @timed decorator, or has a noqa comment, it's suppressed.

Run stablestack . and PERF002 tells you exactly which calls are uninstrumented. Use the /llm-timing slash command in Claude Code and it instruments them automatically — separate phases for prompt building, API call, and response parsing.

Why we made this free

LLM latency is the single biggest complaint from users of AI-powered applications. If you can see where the time goes, you can fix it — switch to a faster model for simple tasks, cache repeated prompts, parallelize independent calls, or optimize your prompt templates. ChatTimer makes that visibility automatic.

Like our dashboard and notifications, this is free with every StableStack install. No license key, no signup. One command, instant latency visibility.

PERF002 is available with a StableStack license.

pip install stablestack