October 14, 2024

OpenTelemetry (OTel) for LLM Observability

Explore the challenges of LLM observability and the current state of using OpenTelemetry (OTel) for standardized instrumentation.

Marc Klingen

Introduction to

OpenTelemetry is an open-source observability framework designed to handle the instrumentation of applications for collecting traces, metrics, and logs. It helps developers monitor and troubleshoot complex systems by providing standardized tools and practices for data collection and analysis.

OpenTelemetry supports various exporters and backends, making it flexible and adaptable to different environments. By using OpenTelemetry, applications can achieve better visibility into their operations, aiding in root cause analysis and performance optimization.

Goal of this post

This post is a high-level overview of the challenges of LLM observability and the current state of using OpenTelemetry (OTel) for LLMOps.

OTel is geared towards general observability, and traces are a great standardized way to capture LLM application data – we have recorded a webinar on this. While we are excited about OTel and the roadmap towards it across LLMOps tools, non-OTel LLMOps tools are preferred by many teams. This post explores why this is the case and how OTel can address these challenges in the future.

Example trace of our public demo

Outline

Overview of LLM Application Observability
- Unique Challenges
- Comparison with Traditional Observability
- Experimentation vs. Production Monitoring
OpenTelemetry (OTel) for LLM Observability
- Current State
- My Personal View

1. Overview of LLM Application Observability

LLM Application Observability refers to the ability to monitor and understand how Large Language Model applications function, especially focusing on aspects like performance, reliability, and user interactions. This involves collecting and analyzing data such as traces, metrics, and logs to troubleshoot issues and optimize the application.

Unique Challenges

LLM applications present distinct challenges compared to traditional software systems. Evaluating the quality of LLM outputs is inherently complex due to their non-deterministic nature. Metrics like cost, latency, and quality must be balanced and cannot be purely derived from traces as they are in traditional applications.

Additionally, the interactive and context-sensitive nature of LLM tasks often requires real-time monitoring and rapid adaptation. Addressing these challenges demands robust tools and frameworks that can handle the dynamic and evolving nature of LLM applications.

Comparison with Traditional Observability

Traditional observability focuses on identifying exceptions and compliance with expected behaviors. LLM observability, however, requires monitoring dynamic and stochastic outputs, making it harder to standardize and interpret.

	Observability	LLM Observability
Async instrumentation (not in critical path)	✅	✅
Spans / traces (as core abstractions)	✅	✅
Metrics	✅	✅
Exceptions	At runtime	Ex-post (evaluations, annotations, user feedback, …)
Main use cases	Alerts, metrics, aggregated performance breakdowns	Debug single traces, build datasets for application benchmarking/testing, monitor hallucinations/evals
Users	Ops	MLE, SWE, data scientists, non-technical
Focus	Holistic system	Focus on what’s critical for LLM application

Experimentation vs. Production Monitoring

In development, experimentation with various models and configurations is crucial. Developers iterate on different approaches to fine-tune model behavior, optimize performance metrics, and explore new functionalities.

Production monitoring, however, shifts the focus to real-time performance tracking. It involves constant vigilance to ensure the application runs smoothly, identifying any latency issues, tracking costs, and integrating user interactions and feedback to continuously improve the application. Both phases are essential, but they have distinct objectives and methodologies geared towards pushing the boundaries of what the LLM can achieve and ensuring it operates reliably in real-world scenarios.

Development	Production
Debug step-by-step, especially when using frameworks	Monitor: cost / latency / quality
Run experiments on datasets and evaluations	Debug issues identified in prod based on user feedback, evaluations, and manual annotations
Document and share experiments	Cluster user intents

2. OpenTelemetry (OTel) for LLM Observability

Current State

The OpenTelemetry Special Interest Group (SIG) focused on “Generative AI Observability” pushes for standardized semantic conventions for LLM/GenAI Applications and instrumentation libraries for the most popular model vendors and frameworks. Learn more about the SIG in its project doc and meeting notes.

Deliverables of the working group (as of Oct 14, 2024) include:

Immediate term:

Ship OTel instrumentation libraries for OpenAI (or any other GenAI client) in Python and JS following existing conventions

Middle term:

Ship OpenTelemetry (or native) instrumentations for popular GenAI client libraries in Python and JS covering chat calls

Evolve GenAI semantic conventions to cover other popular GenAI operations such as embeddings, image or audio generation

As a result, we should have feature parity with the instrumentations of existing GenAI Observability vendors for a set of client instrumentation libraries that all vendors can depend upon.

Long term:

Implement instrumentations for GenAI orchestrators and GenAI frameworks for popular libraries in different languages

Evolve GenAI and other relevant conventions (DB) to cover complex multi-step scenarios such as RAG

Propose mature instrumentations to upstream libraries/frameworks

Currently, there’s a mix of progress and ongoing challenges. Significant issues include dealing with large traces, diverse LLM schema implementations (often biased towards OpenAI), and capturing evaluations and annotations. Many OTel-based LLM instrumentation libraries don’t strictly adhere to evolving conventions, resulting in vendor-specific solutions.

My Personal View

Despite the challenges, I’m excited about OTel instrumentation in the mid-term. The real value lies in its standardized data model, enabling seamless workflow integration across various frameworks and platforms. This standardization leads to increased interoperability across vendors, which is the main reason why OTel is interesting. Currently, we maintain countless integrations with popular models/frameworks/languages but can’t support the long tail due to capacity constraints. Standardizing on OTel will allow the ecosystem to crowdsource instrumentation efforts, benefiting everyone and enabling LLMOps vendors to focus more on core features rather than maintaining numerous integrations. These developments are essential for achieving consistent and reliable observability across diverse LLM frameworks and platforms.

We are committed to OTel and are happy to contribute to the SIG. We will continue to maintain our integrations and SDKs and are currently exploring adding an OTel collector to allow for integrations with OTel-based instrumentation libraries.

💡

If you are interested in contributing to our OTel efforts, join the GitHub Discussion thread.

Get Started

If you want to get started with tracing your AI applications with Langfuse today, check out our quickstart guide on how to use Langfuse with multiple LLM building frameworks like Langchain or LlamaIndex.

If you are curious about why Traces are a good fit for LLM observability, check out our webinar on the topic.

Observability in Multi-Step LLM Systems Langfuse raises $4M

Was this page useful?

Questions? We're here to help

GitHub Q&AEmail Talk to sales