Comparisons

LangSmith vs Langfuse vs Helicone: AI Agent Observability in Production (2026)

Helicone went into maintenance mode after Mintlify acquired it in March 2026. Langfuse joined ClickHouse. Here is how I picked an LLM observability platform across our six AI products in production — and which one I would skip.

By Fanny Engriana · May 2, 2026 · 10 min read · 👁 32 views

LangSmith vs Langfuse vs Helicone: AI Agent Observability in Production (2026)

I switched our BizChat Revenue Assistant from a homegrown logging table to a real LLM observability stack in late February 2026, right after one of our smaller clients reported a $312 OpenAI bill on a day we expected $40. The runaway loop had been firing for six hours before anyone noticed. The week after that, I rolled out the same observability layer across SmartExam AI Generator and DocSumm AI Summarizer. Three platforms made the shortlist: LangSmith, Langfuse, and Helicone.

Things look very different now than they did when I started that evaluation. Helicone was acquired by Mintlify on March 3, 2026 and is now in maintenance-only mode. Langfuse was acquired by ClickHouse in January 2026, but stayed actively developed. LangSmith continues to be LangChain's commercial flagship. If you're picking an observability platform in May 2026 and beyond, the landscape isn't what blog posts from 2025 said it was.

This is the breakdown I wish I'd had: real pricing, real overhead numbers, and the production-stack details that determine whether the tool will still be a fit two quarters from now.

Why Agent Observability Stopped Being Optional in 2026

When I built our first AI-powered product (DocSumm) in early 2025, "observability" meant a Laravel logger writing prompt + response pairs to a MySQL table. That stopped working the moment we shipped multi-step agents. A single user request to ContentForge AI Studio now fans out into 6-12 LLM calls — outline draft, section drafts, fact verification, image prompt generation, SEO check, final polish. When something goes wrong, the question isn't "what was the prompt?" It's "which of the twelve calls regressed, why, and against which model version?"

Three failure modes I've personally hit in production make traditional APM useless:

Tool-call retry loops. An agent gets a malformed JSON response from a tool, retries, gets the same malformed response, retries again. Without span-level tracing this looks like one slow request, not 47 wasted calls.
Silent prompt regression on framework upgrade. We bumped LangChain minor versions on a Friday afternoon (lesson: don't), and a system prompt template stopped interpolating one variable. Output quality dropped about 30% but no exception fired.
Cost spikes from runaway loops. The $312 day mentioned above. With proper trace-level cost attribution we'd have caught it within 15 minutes via an alert.

Across the seven aggregator sites we run for daily content imports (each averaging 100-200 LLM-touched records per day), per-record cost attribution alone has saved us roughly 22-28% on monthly token spend by surfacing which workflows actually need GPT-5.4 versus which ones run fine on Haiku 4.5.

LangSmith — The LangChain-Native Pick

LangSmith is built by the LangChain team, and that lineage shows in every screen. If your stack already runs on LangChain, LangGraph, or LangChain.js, LangSmith is the deepest integration available — node-by-node state diffs in graph execution, full agent execution trees, replay-against-new-model-version, and a prompt playground that pulls live traces as starting points.

What I liked

Effectively zero overhead. When I instrumented our LangGraph orchestrator on ContentForge, end-to-end p95 latency moved by less than a millisecond. For a customer-facing agent this matters more than dashboard polish.
Setup time of about 30 minutes. One env variable (LANGCHAIN_TRACING_V2=true), one API key, and every chain in the project starts auto-tracing.
Replay against new models. Before promoting Sonnet 4.6 to production we replayed the prior week's traces against it and caught two regressions in tool-calling JSON formatting that would have hit us on Monday.

What stung

Per-seat pricing. Plus is $39 per seat per month with no read-only tier. A seven-person team that wants observability access from a junior engineer to the product manager is staring at $273/month before any usage charges. That assumes everyone needs full access — there is no cheaper viewer-only role.
Free tier is genuinely small. 5,000 traces per month, 14-day retention, single seat. For a real production agent doing 200 user requests a day with 8 spans each, that's gone in under four days.
Self-hosting requires Enterprise. No free self-hostable option exists. If your compliance team needs data residency, you're negotiating Enterprise contracts that typically start at $2,000-5,000 per month.
Lock-in pressure. The integration shines when you stay inside the LangChain ecosystem. Move to Pydantic AI or Google ADK and a lot of the magic goes away.

I keep LangSmith on my recommendation list for one specific shape of team: small engineering org (three or fewer engineers needing access), heavy LangChain investment, no self-hosting requirement. Outside that profile the per-seat math gets brutal.

Langfuse — The Open-Source Default

Langfuse is the platform I ended up running across BizChat, SmartExam, and ContentForge. Three reasons: open-source MIT license, framework-agnostic SDK, and a free tier that doesn't make you feel like you're being nickeled-and-dimed before the first invoice.

The pricing reality

Langfuse Cloud's Hobby tier gives you 50,000 events per month at no cost, 30-day retention, and unlimited users. That's 10x LangSmith's free quota with double the retention. Above the free tier, plans run $29/month (Core), $199/month (Pro), and $2,499/month (Enterprise), all with unlimited users. Overage is $8 per 100,000 additional units across all paid tiers.

The math becomes obvious at team scale. A seven-person team logging one million events monthly pays around $98 on Langfuse Core versus $1,500-2,800 on LangSmith Plus depending on retention configuration. We're in this zone with our combined products and the difference is real money — money that goes back into model spend and image API costs instead.

The self-hosting angle

This is what closed the deal for me on two of our enterprise client deployments. The Langfuse core product is MIT-licensed with no usage limits, no license keys, no telemetry phone-home. We run it inside the client's VPC on a single VPS with PostgreSQL and ClickHouse. Total monthly infra cost: roughly $48 on a Hostinger VPS for one client, $73 on DigitalOcean for the other. No per-seat fees ever.

The acquisition by ClickHouse in January 2026 worried me at first, but six months in: the open-source repo is still actively maintained, breaking changes have been minimal, and the only visible change is that ClickHouse is now the default analytical store (it already was, in practice). Pricing has held steady.

What stung

Higher overhead than LangSmith. One independent benchmark on a multi-step travel-planning workflow measured 15% latency overhead with Langfuse versus near-zero with LangSmith. On our SmartExam exam-generation flow, I measured a 4-7% bump on cold starts and ~2% steady-state. Acceptable for backend agents, watch carefully for sub-second user-facing flows.
UI polish lags LangSmith. The trace explorer is functional but you'll miss LangSmith's graph visualizations if you've used both.
Integration setup takes longer for non-LangChain stacks. Plan a half-day for proper instrumentation if you're not on LangChain or LiteLLM. The OpenTelemetry path works but I had to write three custom span attributes to capture our domain context properly.

Helicone — Why I'd Hesitate Today

Helicone's pitch was always the simplest in the field: change one line of code, point your OpenAI/Anthropic SDK at the Helicone proxy URL, and you have observability. Sub-5ms overhead in their Rust gateway, P50 latency cost of about 2ms, built-in caching that reportedly cuts costs 20-30% with 30-50% production hit rates. I used it on a side project in late 2025 and it delivered exactly that.

Then on March 3, 2026, Mintlify announced the acquisition. The official statement places Helicone in "maintenance mode" — security updates and bug fixes only, no new features. For a tool you're betting your production observability stack on, that changes the math significantly.

Pricing snapshot (still current)

Free tier: 10,000 requests/month, no credit card. 7-day log retention.
Pro: $20/seat/month or $79/month flat (depending on plan flavor); unlimited seats on the higher Pro tier; logs purged after 30 days unless on Enterprise.
Caching: Built into Pro. The cost savings are real on workflows with repeated prompts (think classification, scoring, content moderation).

What stung — pre-acquisition concerns that now matter more

Proxy approach means an extra hop. Every LLM call routes through Helicone's edge before reaching OpenAI/Anthropic. Sub-5ms is fine until it isn't — outage on Helicone's side cascades to your agent.
Retention cliffs. 7-day free, 30-day Pro. We had a customer regression report 41 days after the fact and the trace was already gone.
Maintenance mode means no new framework support. If LangGraph 2.0, Pydantic AI 1.0, or Google ADK ship breaking changes, don't expect Helicone to add first-class support.

For a brand-new project today, I would not pick Helicone as the primary observability layer. As a caching proxy in front of cost-sensitive workflows, it can still earn its $20/seat — but pair it with Langfuse or LangSmith for the actual tracing.

Side-by-Side Comparison Table

Dimension	LangSmith	Langfuse	Helicone
Free tier	5K traces, 14-day retention, 1 seat	50K events, 30-day retention, unlimited seats	10K requests, 7-day retention, unlimited seats
Paid entry	$39/seat/month (Plus)	$29/month (Core, unlimited seats)	$20/seat or $79/month flat
Self-host (free)	Not available	Yes — MIT license	Yes — open-source (maintenance mode)
Latency overhead	~0 ms	2-15% depending on workload	~2 ms (Rust proxy)
Best framework fit	LangChain, LangGraph	Any (LangChain, LlamaIndex, OpenAI, Anthropic, custom)	Any (proxy-based)
Setup time	~30 min if on LangChain	30-90 min depending on stack	~15 min (one URL change)
2026 acquisition status	Independent (LangChain)	Acquired by ClickHouse (Jan 2026), still active	Acquired by Mintlify (Mar 2026), maintenance only
Replay against new models	Native, polished	Available, less polished	Limited
Built-in caching	No	No	Yes

Production Decision Matrix — How I'd Pick Today

The honest answer depends on three questions, and the order matters.

Question 1: Is data residency or self-hosting a hard requirement?

If yes, you have one option: Langfuse, self-hosted. LangSmith only offers self-hosting on Enterprise contracts. Helicone's self-hosted version exists but is now in maintenance mode — fine for a year, risky for a 3-year stack decision.

Question 2: Is your stack 80%+ LangChain or LangGraph?

If yes and you have under three engineers needing seat access, LangSmith Plus is justifiable — the framework-native debugging genuinely saves hours per regression. If your team is larger, the per-seat cost outpaces the integration benefit; switch to Langfuse Cloud and accept the slightly less polished trace UI.

Question 3: Are you cost-sensitive on repeat-prompt workflows?

If yes (think bulk classification, content moderation, structured-output agents), pair your tracing tool with Helicone as a caching proxy only. The 30-50% cache hit rate on duplicate-prompt workloads is real. Don't make it your primary tracer given the maintenance-mode situation, but it earns its keep as a cost-reduction layer.

What I actually run

Across the WarungDigi production stack today: Langfuse Cloud Core ($29/month) for our shared observability across BizChat, SmartExam, ContentForge, DocSumm, and ServiceBot. Self-hosted Langfuse on the two enterprise client VPCs where compliance demanded it. No Helicone, no LangSmith. The decision came down to per-seat economics, the open-source licensing path, and the framework flexibility — we have agents on LangChain, Pydantic AI, and bare OpenAI SDK calls all flowing into one tracing UI.

Common Mistakes I See Teams Make

Three patterns repeatedly cost teams real money or weeks of debugging time:

Picking the framework-native tool because the docs are easier, then outgrowing the seat pricing. The free tiers seduce you. The bills land at month four when the team scales.
Skipping observability until production traffic doubles. By the time you need traces, you have no historical baseline to compare against. Instrument from day one even if you're only logging to a free tier.
Not setting cost alerts. Every platform supports them. Almost no one configures them on day one. The $312 bill I mentioned at the top of this piece is exactly the kind of incident a $0.01 alert prevents.

Frequently Asked Questions

Can I migrate between these platforms later?

Yes, but the cost varies. Langfuse uses OpenTelemetry-compatible spans, which makes migration to or from any OTel-compliant tool relatively painless. LangSmith's trace format is more proprietary; expect to re-instrument. Helicone's proxy approach means migration is just a URL change — that's its biggest virtue.

Do these tools work with non-LLM agent workflows (vector DB queries, web scraping, etc.)?

Langfuse and LangSmith both support arbitrary span instrumentation — you can trace a Pinecone query, a Playwright scrape, or a webhook handler in the same trace as your LLM calls. Helicone is purely an LLM proxy and won't capture non-LLM steps.

What about Arize Phoenix, Braintrust, AgentOps?

All three are credible. Arize Phoenix is open-source and strong on evals. Braintrust is excellent if your team prioritizes prompt experimentation. AgentOps focuses on agent-specific metrics. I narrowed to the LangSmith/Langfuse/Helicone trio because they have the largest community, the most production case studies, and the clearest pricing — but if your team's needs lean evals-first or experimentation-first, evaluate the alternatives.

Is there a free path that doesn't end in a sales call?

Self-hosted Langfuse. MIT license, no telemetry, no quota. Pay for the VPS only. We run a small instance for a 4-engineer client at $19/month all-in.

What happens if Mintlify shuts down Helicone entirely?

The open-source code remains under its current license, so self-hosters keep working. The hosted service has no announced sunset date as of May 2026. Plan migration anyway — banking your production tracing on a maintenance-mode product is a known risk you take with eyes open.

Where I'd Start in May 2026

If I were rebuilding our stack from scratch today, the path would be: Langfuse Cloud free tier from week one (50K events covers a real prototype), upgrade to Core at $29/month when you cross the threshold, evaluate self-hosting only when compliance or cost forces it. Add Helicone as a caching proxy only after you've measured your repeat-prompt rate and confirmed it's above 25%. Skip LangSmith unless you're a LangChain shop with under three engineers needing access.

The bigger lesson from this round of evaluation: acquisitions matter. Two of the three top observability platforms changed hands inside one quarter. The boring infrastructure choice — open source, MIT-licensed, self-hostable, framework-agnostic — turned out to be the most resilient one. That's not always the answer in this space, but in May 2026, for AI agent observability, it is.

🏷 Tagged: #ai observability #langsmith #langfuse #helicone #llm monitoring #ai agents #production #2026

Enjoyed this article?

Get more AI insights — browse our full library of 103+ articles and 373+ ready-to-use AI prompts.

Why Agent Observability Stopped Being Optional in 2026

LangSmith — The LangChain-Native Pick

What I liked

What stung

Langfuse — The Open-Source Default

The pricing reality

The self-hosting angle

What stung

Helicone — Why I'd Hesitate Today

Pricing snapshot (still current)

What stung — pre-acquisition concerns that now matter more

Side-by-Side Comparison Table

Production Decision Matrix — How I'd Pick Today

Question 1: Is data residency or self-hosting a hard requirement?

Question 2: Is your stack 80%+ LangChain or LangGraph?

Question 3: Are you cost-sensitive on repeat-prompt workflows?

What I actually run

Common Mistakes I See Teams Make

Frequently Asked Questions

Can I migrate between these platforms later?

Do these tools work with non-LLM agent workflows (vector DB queries, web scraping, etc.)?

What about Arize Phoenix, Braintrust, AgentOps?

Is there a free path that doesn't end in a sales call?

What happens if Mintlify shuts down Helicone entirely?

Where I'd Start in May 2026

Enjoyed this article?

📰 More like this

Pinecone vs Qdrant vs Weaviate vs Milvus vs pgvector: 2026 Benchmarks, Pricing & How to Choose

Phi-4-mini vs Gemma 3 vs Qwen3 vs SmolLM3: On-Device SLMs in 2026

Firecrawl vs Jina Reader vs Crawl4AI vs ScrapingBee: Which Web Scraper for AI in 2026?

Mem0 vs Zep vs Letta vs Cognee: AI Agent Memory Compared (2026)

Composio vs Arcade vs Nango: AI Agent Authentication in 2026

Semantic Caching for LLM Apps: GPTCache vs Redis vs Upstash (2026)