You shipped an LLM-powered feature. Users love it. Then the invoice arrives. Nobody can explain where $4,000 in API costs went last Tuesday.

LLMs are black boxes in production. You can’t see how many tokens each request burns, which model is slower, or why a batch job at 3 AM quietly retried thousands of failed completions and doubled your daily spend. Traditional APM tools are starting to add LLM support, though coverage and pricing vary; some bundle it in, others charge extra. Dedicated LLM observability platforms offer deeper insight out of the box, though many require a proprietary SDK or proxy that ties your instrumentation to a single vendor.

This guide is for teams that already run Elastic (or are evaluating it) and want to add LLM observability without adopting a separate vendor. If you’re using a different backend, the OpenTelemetry instrumentation still applies. Only the exporter configuration changes.

In this post, I’ll walk you through building a full LLM monitoring stack using open standards: OpenTelemetry, OpenLLMetry, and Elastic APM. By the end, you’ll have cost tracking, latency metrics, error correlation, and multi-model comparison running in Kibana, with vendor-neutral telemetry (OpenTelemetry + OTLP) and a swappable instrumentation layer (OpenLLMetry is one option among several).


1. OpenTelemetry in 60 Seconds

Before we talk about LLMs, let’s ground ourselves in the observability standard that makes all of this possible.

OpenTelemetry Basics — Traces, Spans, Exporters, Collectors

OpenTelemetry (OTel) has four building blocks you need to know:

  • Traces capture the full journey of a request through your system, from the HTTP endpoint down to the database query.
  • Spans are individual operations within a trace. Each span has a name, duration, status, and arbitrary key-value attributes.
  • Exporters ship your trace data out of the application, typically via the OTLP protocol.
  • Collectors receive, process, and route telemetry data to your backend of choice (Elastic, Jaeger, Datadog, or anything that speaks OTLP).

Why does OTel matter for LLM apps? Because it’s vendor-neutral and composable. You instrument once and send data anywhere. When your observability needs change (and they will), you swap the backend, not the instrumentation code.


2. Enter OpenLLMetry

OpenTelemetry handles generic telemetry. LLM calls have unique attributes (model names, token counts, prompt content, system identifiers) that standard OTel instrumentation doesn’t capture. That’s where OpenLLMetry comes in.

How OpenTelemetry concepts map to OpenLLMetry

OpenLLMetry is Traceloop’s open-source instrumentation layer built on top of OpenTelemetry. It maps cleanly to OTel’s concepts:

OTel ConceptOpenLLMetry Equivalent
Trace@workflow decorator
Span@task decorator
AttributesAuto-captured gen_ai.* fields
ExporterSame OTLP exporter — unchanged

When you decorate a function with @task, OpenLLMetry automatically captures gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.system, and more. No manual attribute setting required. The @workflow decorator creates a top-level span that groups related @task spans into a single trace hierarchy, giving you end-to-end visibility into multi-step LLM operations.

Some gen_ai.* attribute names are evolving as the semantic conventions mature. Check the latest spec for current names.


3. Before vs After: The Code

Compare LLM instrumentation with raw OpenTelemetry versus OpenLLMetry.

Before vs After — manual instrumentation vs OpenLLMetry auto-instrumentation

The Hard Way: Manual OpenTelemetry

First, the setup boilerplate. About 10 lines before you write any business logic:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(
    endpoint="http://otel-collector:4318/v1/traces"
))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("recipe-service")

Then, every single LLM call function needs ~25 lines of manual instrumentation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def call_openai(prompt, model="gpt-4"):
    with tracer.start_as_current_span("call_openai") as span:
        span.set_attribute("gen_ai.system", "openai")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.request.temperature", 0.7)
        span.set_attribute("gen_ai.request.max_tokens", 2000)

        try:
            response = openai_client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                temperature=0.7, max_tokens=2000
            )
            span.set_attribute("gen_ai.response.model", response.model)
            span.set_attribute("gen_ai.usage.input_tokens",
                               response.usage.prompt_tokens)
            span.set_attribute("gen_ai.usage.output_tokens",
                               response.usage.completion_tokens)
            span.set_status(StatusCode.OK)
            return response.choices[0].message.content
        except Exception as e:
            span.set_status(StatusCode.ERROR, str(e))
            span.record_exception(e)
            raise

That’s 4 imports, manual provider/processor/exporter wiring, manual span creation, manual attribute setting for every gen_ai field, manual response capture, and manual error handling. Repeat this for every LLM function in your codebase.

The Easy Way: OpenLLMetry

Setup is trivial: two imports, one init call.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import os
from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import workflow, task

Traceloop.init(
    app_name="recipe-generator-service",
    api_endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT",
                           "http://otel-collector:4318"),
    disable_batch=False
)

Production safety: prompts, PII, and trace size. OpenLLMetry can log prompts and completions into span attributes. Useful for debugging, but also a way to leak sensitive data into your observability backend. Disable content tracing in production unless you explicitly need it: export TRACELOOP_TRACE_CONTENT=false. With content tracing disabled, OpenLLMetry logs metadata (model name, token counts, latency) without capturing prompts or responses. Don’t log raw customer identifiers (hash them or use a surrogate ID), treat traces as a data store where retention and access control matter, and if you must log content, redact aggressively. See Traceloop’s privacy docs for selective per-workflow controls.

The business logic stays clean. Just add decorators:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
@task(name="call_openai")
def call_openai(prompt, model="gpt-4", temperature=0.7):
    """All gen_ai.* attributes are captured automatically"""
    response = openai_client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are an expert chef..."},
            {"role": "user", "content": prompt}
        ],
        temperature=temperature,
        max_tokens=2000
    )
    return {
        "recipe": response.choices[0].message.content,
        "model": response.model,
        "usage": {
            "prompt_tokens": response.usage.prompt_tokens,
            "completion_tokens": response.usage.completion_tokens,
            "total_tokens": response.usage.total_tokens
        }
    }

@workflow(name="recipe_generation_workflow")
def generate_recipe(provider, dish_name, cuisine_type="Italian", servings=4):
    prompt = generate_recipe_prompt(dish_name, cuisine_type, servings)
    result = call_openai(prompt, model=model)
    return result

The key takeaway: zero manual span or attribute management. OpenLLMetry intercepts the OpenAI and Anthropic client libraries, captures all the gen_ai.* attributes automatically, and your functions contain nothing but business logic.

What about OTel’s own auto-instrumentation? The opentelemetry-instrumentation-openai package also auto-captures gen_ai.* attributes without manual span code. The tradeoff is that it gives you auto-instrumentation for LLM calls but not the @workflow/@task decorator model for grouping business logic into named trace hierarchies. OpenLLMetry adds that layer on top, which is why I chose it for this project. If you only need LLM call telemetry without workflow grouping, the native OTel instrumentation is a lighter option.


4. System Architecture

The full pipeline from your Flask app to a Kibana dashboard.

System Architecture — Flask to OpenLLMetry to OTel Collector to APM to ES to Kibana

The stack has six components, all running in Docker Compose on a single bridge network:

  1. Flask App: Your Python application, instrumented with OpenLLMetry decorators.
  2. OpenLLMetry SDK: Auto-instruments OpenAI and Anthropic client libraries, captures gen_ai.* attributes, exports via OTLP/HTTP.
  3. OpenTelemetry Collector: Receives OTLP/HTTP on port 4318, batches spans, applies memory limits, and routes to APM Server via OTLP/gRPC.
  4. APM Server: Natively ingests OTLP trace data and indexes it into Elasticsearch.
  5. Elasticsearch: Stores all trace data. The gen_ai.* span attributes land in Elasticsearch as labels.gen_ai_* (strings) or numeric_labels.gen_ai_* (numbers), with dots replaced by underscores.
  6. Kibana: Provides the APM UI with service maps, trace waterfalls, and custom dashboards.

Field mapping reality (Elastic APM + OTel): Only a subset of OpenTelemetry attributes are mapped to first-class Elastic fields. Unmapped attributes are stored under labels.* (strings) or numeric_labels.* (numbers), with dots replaced by underscores. For example: gen_ai.request.modellabels.gen_ai_request_model, gen_ai.usage.input_tokensnumeric_labels.gen_ai_usage_input_tokens. See Elastic’s OTel attributes docs for the full mapping table.

The data flow is straightforward: your app sends OTLP/HTTP to the Collector, which forwards OTLP/gRPC to APM Server, which writes to Elasticsearch. Kibana reads from Elasticsearch. No custom adapters, no proprietary protocols.

I considered sending OTLP directly from the app to APM Server, skipping the Collector entirely. It works for a single service, but the Collector gives you a buffer for backpressure, a place to add processors later (attribute filtering, sampling), and decouples your app from the backend topology. For anything beyond a demo, it’s worth the extra container.

The Docker Compose overview:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
services:
  elasticsearch:    # v8.11.0 — single node, 512MB heap
  kibana:           # v8.11.0 — connected to ES
  otel-collector:   # contrib v0.91.0 — OTLP receiver + APM exporter
  apm-server:       # v8.11.0 — OTLP to Elastic APM format
  flask-app:        # Python 3.11 — OpenLLMetry instrumented

networks:
  observability:
    driver: bridge

See the full docker-compose.yml and otel-collector-config.yaml in the companion repo.


5. Building a Multi-Agent Workflow

The real test for an APM stack is multi-agent systems where multiple models collaborate, run in parallel, retry on failure, and feed each other’s outputs. That’s the scenario I built to put this stack through its paces.

Multi-Agent Workflow — 4 AI agents: Coordinator, Chef, Sommelier, Nutritionist

The Restaurant Menu Designer orchestrates 4 AI agents to create a complete fine-dining menu:

AgentModelRole
Menu CoordinatorGPT (OpenAI)Strategic planning — designs the course structure
Executive ChefClaude (Anthropic)Creative — generates detailed recipes for each course
NutritionistClaude Haiku (Anthropic)Analytical — reviews nutritional compliance, approves or requests changes
SommelierGPT (OpenAI)Expert pairing — matches wines to each course

The workflow runs in 5 phases:

  1. Coordinator plans the menu structure (sequential)
  2. Parallel research: Chef creates recipes, Nutritionist researches dietary guidelines, Sommelier develops a pairing strategy (concurrent via ThreadPoolExecutor)
  3. Recipe refinement: Nutritionist reviews each recipe, Chef iterates based on feedback (nested workflow with retry logic, max 3 iterations)
  4. Wine pairing: Sommelier pairs each course (includes automatic retry on incomplete results)
  5. Final assembly: combine everything into the complete menu
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
@workflow(name="restaurant_menu_design_workflow")
def design_restaurant_menu():
    # Phase 1: Coordinator plans (GPT)
    menu_plan = coordinator_plan_menu_structure(cuisine)

    # Phase 2: Parallel execution (Chef + Nutritionist + Sommelier)
    research = parallel_agent_research(menu_plan)

    # Phase 3: Refinement loop (Claude ↔ Claude Haiku)
    refined = refine_recipes_with_feedback(research["recipes"])

    # Phase 4: Wine pairing (GPT, with retry)
    wines = pair_wines_with_courses(refined)

    return final_menu

This creates a deeply nested trace with 15-20 spans: parallel execution branches, cross-model calls (GPT for planning, Claude for creativity), retry attempts visible as iteration counters, and the full agent-to-agent data flow. It’s the perfect stress test for any APM system, and it renders beautifully in Kibana’s waterfall view.

Trace waterfall in Kibana APM — nested spans showing parallel agent execution and cross-model calls

Context propagation across threads is not automatic in Python. If you run tasks in a thread pool, you need threading context propagation so spans created in worker threads remain parented to the originating workflow span.

Traceloop/OpenLLMetry enables OpenTelemetry’s threading context propagation so the active trace context follows work scheduled onto threads. The key idea is propagate context, not “generate extra telemetry.”

If you ever see parallel tasks showing up as separate traces in Kibana, context propagation is the first thing to verify. This was the most time-consuming debugging problem in the entire project. The symptoms are subtle: everything works, traces appear, but parallel tasks show up as orphaned root spans instead of children of the workflow. If you’re seeing that, check context propagation before anything else.

Full implementation: app/app.py


6. What You Can Monitor

Once traces are flowing into Elastic APM, this is what you get out of the box.

What We Can Monitor — Token usage, latency, cost, errors, multi-model comparison, traces, prompts, alerts

Eight capabilities, each powered by the gen_ai.* attributes that OpenLLMetry captures automatically:

CapabilityWhat You SeeWhy It Matters
Token UsageInput/output tokens per call, per modelOptimize prompts, catch token inflation
LatencyResponse time (avg, P95) per model and endpointSLA monitoring, provider comparison
Cost TrackingDollar cost per call, per workflow, per modelBudget control, cost allocation
Error DetectionFailed LLM calls with stack traces and retry countsReliability monitoring, root cause analysis
Multi-Model ComparisonSide-by-side metrics across GPT, Claude, etc.Informed model selection
Trace CorrelationFull request path from HTTP endpoint to LLM callDebug complex multi-agent workflows
Prompt LoggingSystem/user prompts and completions stored in span attributesAudit trail, prompt debugging
AlertsKibana alerting rules on any metricToken budget alerts, latency spikes, error rate thresholds

In the Kibana APM UI, you can explore these through the service map (see your app’s dependencies on LLM providers), trace waterfall (drill into individual requests), and span metadata (inspect every gen_ai.* attribute on each LLM call).

Sampling warning: If you sample traces (e.g., 10%), dashboards built from trace data will underestimate tokens and cost unless you compensate. For cost governance, prefer always-on LLM spans or emit token/cost counters as separate OTel metrics alongside traces.

A note on streaming: The examples in this post use non-streaming completions. OpenLLMetry supports streaming, but the behavior differs. Token counts are accumulated incrementally and the span closes when the stream finishes, which can affect latency measurements. If your app streams responses, verify instrumentation behavior with your specific provider SDK version before relying on the metrics.

Traces vs. metrics for cost tracking. This post uses trace-based cost attribution, which gives you per-request granularity. At scale, you may also want to emit OTel metrics (e.g., a gen_ai.cost.total_usd counter and a gen_ai.latency histogram) alongside traces. Metrics are pre-aggregated, cheaper to store, and aren’t affected by trace sampling. A common production pattern is: always-on metrics for dashboards and alerts, sampled traces for debugging specific requests.


7. The Dashboard

All of this data is great in the APM trace view, but for day-to-day monitoring you want a dashboard. I built an 8-panel Kibana Lens dashboard that gives you the full LLM observability picture at a glance.

LLM Observability Dashboard — 8 panels showing cost, tokens, latency, and error metrics across models

The dashboard is organized in three rows:

Row 1, Cost Analysis:

  • Cost Distribution by Model: Donut chart showing what percentage of your total spend goes to each model
  • Cost per Call by Model: Metric tiles showing average cost per LLM call (e.g., GPT-4 $0.025 vs Claude $0.012)
  • Cost over Time: Line chart tracking spending trends per model

Row 2, Token Usage:

  • Token Usage by Model: Stacked bar showing total input + output tokens per model
  • Input vs Output Token Ratio: Average input vs output tokens per call. Helps you spot verbose prompts.
  • Total Tokens by Model: Compare with cost distribution to identify the most token-efficient models

Row 3, Latency & Reliability:

  • Response Latency by Model: Average and P95 latency per model (sourced from span.duration.us)
  • Error Rates by Model: Success vs failure outcomes per model

The dashboard is defined as Kibana saved objects in NDJSON format. To import it:

1
2
3
4
curl -X POST "http://localhost:5601/api/saved_objects/_import?overwrite=true" \
  -u elastic:changeme \
  -H "kbn-xsrf: true" \
  --form file=@kibana/llm-observability-dashboards.ndjson

Dashboard definition and setup script: kibana/

How to build or modify panels (field names that matter)

Because unmapped OTel attributes land under labels.* (strings) or numeric_labels.* (numbers), your token/cost fields will look like:

  • numeric_labels.gen_ai_usage_input_tokens
  • numeric_labels.gen_ai_usage_output_tokens
  • numeric_labels.gen_ai_cost_total_usd
  • labels.gen_ai_request_model (or labels.gen_ai_response_model)

Elastic stores span duration as span.duration.us.

Example Lens formulas:

  • Total tokens = sum(numeric_labels.gen_ai_usage_input_tokens) + sum(numeric_labels.gen_ai_usage_output_tokens)
  • Avg cost per call = sum(numeric_labels.gen_ai_cost_total_usd) / count()
  • P95 latency = percentile(span.duration.us, 95)

If your field names differ, open any LLM span in APM -> Metadata and copy the exact field names from there.


8. Going Further: Automatic Cost Tracking

The gap that motivated the most interesting piece of engineering in this project: OpenLLMetry captures tokens but not dollar cost. It knows you used 500 input tokens and 1200 output tokens on claude-sonnet-4-5-20250929, but it doesn’t know what that costs.

The Architecture

LLM Cost Injection Architecture — how cost data flows from LiteLLM pricing through the span exporter

The solution is a custom CostEnrichingSpanExporter that wraps the real OTLP exporter, intercepting the export pipeline to inject cost attributes into LLM spans before they’re sent to the backend. It works by mutating span._attributes in-place, not a public API and therefore upgrade-fragile. I’m using it here to keep the demo small; for production, prefer emitting token/cost metrics via a separate OTel meter or enriching attributes before spans become read-only. If you do use this approach, pin your opentelemetry-sdk version and test after upgrades.

How It Works

Cost-Enriching Span Exporter — Decorator pattern intercepting LLM spans

The CostEnrichingSpanExporter implements the SpanExporter interface and wraps the original exporter:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
class CostEnrichingSpanExporter(SpanExporter):
    def __init__(self, wrapped_exporter, pricing_db):
        self.wrapped_exporter = wrapped_exporter
        self.pricing_db = pricing_db

    def export(self, spans):
        for span in spans:
            if self._is_llm_span(span):       # Check for gen_ai.system attribute
                self._enrich_with_cost(span)    # Calculate and inject cost
        return self.wrapped_exporter.export(spans)  # Forward to real exporter

    def _is_llm_span(self, span):
        return 'gen_ai.system' in (span.attributes or {})

    def _enrich_with_cost(self, span):
        attrs = dict(span.attributes or {})
        model = attrs.get('gen_ai.response.model') or attrs.get('gen_ai.request.model')
        input_tokens = attrs.get('gen_ai.usage.input_tokens', 0)
        output_tokens = attrs.get('gen_ai.usage.output_tokens', 0)

        cost = self.pricing_db.get_cost(model, input_tokens, output_tokens)
        span._attributes.update(cost)  # Inject gen_ai.cost.* attributes

When export() is called by the BatchSpanProcessor, the wrapper:

  1. Filters for LLM spans (those with gen_ai.system attribute)
  2. Extracts model name and token counts from existing span attributes
  3. Looks up per-token pricing from the database
  4. Calculates gen_ai.cost.input_usd, gen_ai.cost.output_usd, and gen_ai.cost.total_usd
  5. Injects the cost attributes into the span
  6. Forwards everything to the wrapped exporter

Non-LLM spans pass through untouched. Negligible overhead.

Cost math vs invoice reality: Token-based estimates can differ from provider billing due to system prompts, tool calls, cached tokens, rounding, tiered pricing, or provider-side adjustments. Treat this as an allocation and monitoring signal, not a perfect invoice replica.

Production alternative (OTel SpanProcessor): Instead of mutating read-only spans in the exporter, inject cost at span creation via a custom SpanProcessor. The on_end hook receives writable spans:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
class CostEnrichingSpanProcessor(SpanProcessor):
    def on_end(self, span: ReadableSpan) -> None:
        if 'gen_ai.system' in span.attributes:
            cost = self.pricing_db.get_cost(
                span.attributes.get('gen_ai.response.model'),
                span.attributes.get('gen_ai.usage.input_tokens', 0),
                span.attributes.get('gen_ai.usage.output_tokens', 0),
            )
            self.cost_counter.add(cost['gen_ai.cost.total_usd'], {
                'model': span.attributes.get('gen_ai.response.model', 'unknown')
            })

This avoids private API dependencies and works with any OTel backend. The demo repo uses the exporter approach for simplicity. See app/llm_cost_injector.py for the full implementation.

The Pricing Database

LiteLLM Pricing Database — hundreds of models with per-token pricing

Where do we get pricing data for hundreds of models? From LiteLLM’s open-source pricing database. It’s a JSON file on GitHub with per-token pricing for every major provider: OpenAI, Anthropic, Google, Mistral, Cohere, and more.

The LiteLLMPricingDatabase class:

  • Syncs from GitHub on first startup
  • Caches locally to avoid network calls on subsequent starts
  • Auto-refreshes when the cache is older than 24 hours
  • Fuzzy matches model names: gpt-4o-2024-08-06 resolves to gpt-4o, provider prefixes like openai/gpt-4o are stripped automatically

I initially considered maintaining my own pricing JSON, but model pricing changes frequently enough that it would be stale within weeks. LiteLLM’s database is community-maintained and covers providers I haven’t even tested yet. The tradeoff is a GitHub dependency at startup, which the local caching mitigates.

Wiring It Up

Exporter Wrapping Strategy — strategies for different OTel SDK versions

The tricky part is finding and wrapping the exporter inside Traceloop’s OpenTelemetry configuration. Different versions of the SDK organize their span processors differently, so the inject_llm_cost_tracking() function tries multiple strategies:

  1. Direct exporter wrapping: Find the BatchSpanProcessor, extract its exporter, wrap it with CostEnrichingSpanExporter, create a new processor
  2. Composite processor traversal: If Traceloop uses a composite processor with multiple children, iterate and wrap each BatchSpanProcessor
  3. Attribute-based discovery: Check _active_span_processor, _span_processors, and other internal attributes

The bootstrap is simple. Two lines after Traceloop.init():

1
2
3
4
5
from traceloop.sdk import Traceloop
from llm_cost_injector import inject_llm_cost_tracking

Traceloop.init(app_name="recipe-generator-service", ...)
inject_llm_cost_tracking()  # Wraps the exporter, loads pricing

After this, every LLM span automatically includes cost data. The attributes appear in Elastic APM as:

  • numeric_labels.gen_ai_cost_total_usd
  • numeric_labels.gen_ai_cost_input_usd
  • numeric_labels.gen_ai_cost_output_usd
  • labels.gen_ai_cost_provider
  • labels.gen_ai_cost_model_resolved

Full source: app/llm_cost_injector.py


9. Getting Started

Everything below is in the companion repo. The repo README has the most up-to-date quickstart and troubleshooting steps.

Prerequisites

  • Docker and Docker Compose
  • Python 3.11+
  • 8GB RAM minimum (Elasticsearch needs headroom)
  • OpenAI API key
  • Anthropic API key

Quick Start

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# 1. Clone the repo
git clone https://github.com/maheshbabugorantla/llm-observability-with-elasticapm.git
cd llm-observability-with-elasticapm

# 2. Configure API keys
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY and ANTHROPIC_API_KEY

# 3. Start the full stack
docker compose up --build -d

# 4. Wait for services to be healthy
# Elasticsearch, Kibana, OTel Collector, APM Server, Flask app
# (takes ~60-90 seconds for full startup)

Generate Test Data

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Flask runs on port 5001 (5000 is reserved by AirPlay on macOS)

# Single recipe generation (OpenAI)
curl -X POST http://localhost:5001/recipe/generate \
  -H "Content-Type: application/json" \
  -d '{"provider": "openai", "dish_name": "Spaghetti Carbonara", "cuisine_type": "Italian", "servings": 4}'

# Compare providers (OpenAI vs Claude, same recipe)
curl -X POST http://localhost:5001/recipe/compare \
  -H "Content-Type: application/json" \
  -d '{"dish_name": "Pad Thai", "cuisine_type": "Thai", "servings": 2}'

# Multi-agent menu design (the full 4-agent workflow)
curl -X POST http://localhost:5001/menu/design \
  -H "Content-Type: application/json" \
  -d '{"cuisine": "Italian", "menu_type": "fine_dining", "courses": 3, "dietary_requirements": ["vegetarian_option"], "budget": "premium", "season": "spring", "occasion": "romantic_dinner"}'

Verify in Kibana

  1. Open http://localhost:5601 (login: elastic / changeme)
  2. Navigate to Observability > APM > Services. You should see recipe-generator-service
  3. Click into a transaction to see the trace waterfall with nested spans
  4. Click on any LLM span and check the Metadata tab for gen_ai.* attributes
  5. Import the dashboard: curl -X POST "http://localhost:5601/api/saved_objects/_import?overwrite=true" -u elastic:changeme -H "kbn-xsrf: true" --form file=@kibana/llm-observability-dashboards.ndjson

Conclusion

The entire stack runs locally in Docker Compose for development. The same architecture (OTel Collector, APM Server, Elasticsearch) scales to production with managed Elastic Cloud. Because every component speaks OTLP, you can swap the backend without touching your application code.

The cost enrichment layer is the most opinionated piece. The demo uses a span exporter wrapper that mutates private attributes, good enough for local development, but for production, move to a SpanProcessor or emit cost as a separate OTel metric (see Section 8).

Check out the full source code on GitHub. Star it if you find it useful.