OpenTelemetry (OTel) for LLM Observability
Explore the challenges of LLM observability and the current state of using OpenTelemetry (OTel) for standardized instrumentation.
Introduction to
OpenTelemetry is an open-source observability framework designed to handle the instrumentation of applications for collecting traces, metrics, and logs. It helps developers monitor and troubleshoot complex systems by providing standardized tools and practices for data collection and analysis.
OpenTelemetry supports various exporters and backends, making it flexible and adaptable to different environments. By using OpenTelemetry, applications can achieve better visibility into their operations, aiding in root cause analysis and performance optimization.
Goal of this post
This post is a high-level overview of the challenges of LLM observability and the current state of using OpenTelemetry (OTel) for LLMOps.
OTel is geared towards general observability, and traces are a great standardized way to capture LLM application data – we have recorded a webinar on this. While we are excited about OTel and the roadmap towards it across LLMOps tools, non-OTel LLMOps tools are preferred by many teams. This post explores why this is the case and how OTel can address these challenges in the future.
Example trace of our public demo
Outline
- Overview of LLM Application Observability
- Unique Challenges
- Comparison with Traditional Observability
- Experimentation vs. Production Monitoring
- OpenTelemetry (OTel) for LLM Observability
- Current State
- My Personal View
1. Overview of LLM Application Observability
LLM Application Observability refers to the ability to monitor and understand how Large Language Model applications function, especially focusing on aspects like performance, reliability, and user interactions. This involves collecting and analyzing data such as traces, metrics, and logs to troubleshoot issues and optimize the application.
Unique Challenges
LLM applications present distinct challenges compared to traditional software systems. Evaluating the quality of LLM outputs is inherently complex due to their non-deterministic nature. Metrics like cost, latency, and quality must be balanced and cannot be purely derived from traces as they are in traditional applications.
Additionally, the interactive and context-sensitive nature of LLM tasks often requires real-time monitoring and rapid adaptation. Addressing these challenges demands robust tools and frameworks that can handle the dynamic and evolving nature of LLM applications.
Comparison with Traditional Observability
Traditional observability focuses on identifying exceptions and compliance with expected behaviors. LLM observability, however, requires monitoring dynamic and stochastic outputs, making it harder to standardize and interpret.
Observability | LLM Observability | |
---|---|---|
Async instrumentation (not in critical path) | ✅ | ✅ |
Spans / traces (as core abstractions) | ✅ | ✅ |
Metrics | ✅ | ✅ |
Exceptions | At runtime | Ex-post (evaluations, annotations, user feedback, …) |
Main use cases | Alerts, metrics, aggregated performance breakdowns | Debug single traces, build datasets for application benchmarking/testing, monitor hallucinations/evals |
Users | Ops | MLE, SWE, data scientists, non-technical |
Focus | Holistic system | Focus on what’s critical for LLM application |
Experimentation vs. Production Monitoring
In development, experimentation with various models and configurations is crucial. Developers iterate on different approaches to fine-tune model behavior, optimize performance metrics, and explore new functionalities.
Production monitoring, however, shifts the focus to real-time performance tracking. It involves constant vigilance to ensure the application runs smoothly, identifying any latency issues, tracking costs, and integrating user interactions and feedback to continuously improve the application. Both phases are essential, but they have distinct objectives and methodologies geared towards pushing the boundaries of what the LLM can achieve and ensuring it operates reliably in real-world scenarios.
Development | Production |
---|---|
Debug step-by-step, especially when using frameworks | Monitor: cost / latency / quality |
Run experiments on datasets and evaluations | Debug issues identified in prod based on user feedback, evaluations, and manual annotations |
Document and share experiments | Cluster user intents |
2. OpenTelemetry (OTel) for LLM Observability
Current State
The OpenTelemetry Special Interest Group (SIG) focused on “Generative AI Observability” pushes for standardized semantic conventions for LLM/GenAI Applications and instrumentation libraries for the most popular model vendors and frameworks. Learn more about the SIG in its project doc and meeting notes.
Deliverables of the working group (as of Oct 14, 2024) include:
Immediate term:
- Ship OTel instrumentation libraries for OpenAI (or any other GenAI client) in Python and JS following existing conventions
Middle term:
- Ship OpenTelemetry (or native) instrumentations for popular GenAI client libraries in Python and JS covering chat calls
- Evolve GenAI semantic conventions to cover other popular GenAI operations such as embeddings, image or audio generation
As a result, we should have feature parity with the instrumentations of existing GenAI Observability vendors for a set of client instrumentation libraries that all vendors can depend upon.
Long term:
- Implement instrumentations for GenAI orchestrators and GenAI frameworks for popular libraries in different languages
- Evolve GenAI and other relevant conventions (DB) to cover complex multi-step scenarios such as RAG
- Propose mature instrumentations to upstream libraries/frameworks
Currently, there’s a mix of progress and ongoing challenges. Significant issues include dealing with large traces, diverse LLM schema implementations (often biased towards OpenAI), and capturing evaluations and annotations. Many OTel-based LLM instrumentation libraries don’t strictly adhere to evolving conventions, resulting in vendor-specific solutions.
My Personal View
Despite the challenges, I’m excited about OTel instrumentation in the mid-term. The real value lies in its standardized data model, enabling seamless workflow integration across various frameworks and platforms. This standardization leads to increased interoperability across vendors, which is the main reason why OTel is interesting. Currently, we maintain countless integrations with popular models/frameworks/languages but can’t support the long tail due to capacity constraints. Standardizing on OTel will allow the ecosystem to crowdsource instrumentation efforts, benefiting everyone and enabling LLMOps vendors to focus more on core features rather than maintaining numerous integrations. These developments are essential for achieving consistent and reliable observability across diverse LLM frameworks and platforms.
We are committed to OTel and are happy to contribute to the SIG. We will continue to maintain our integrations and SDKs and are currently exploring adding an OTel collector to allow for integrations with OTel-based instrumentation libraries.
If you are interested in contributing to our OTel efforts, join the GitHub Discussion thread.
Get Started
If you want to get started with tracing your AI applications with Langfuse today, check out our quickstart guide on how to use Langfuse with multiple LLM building frameworks like Langchain or LlamaIndex.
If you are curious about why Traces are a good fit for LLM observability, check out our webinar on the topic.