The founders of Traversal.ai explain in detail how they combined AI, causal machine learning (Causal ML), and reinforcement learning (RL) for root cause analysis of problems arising in complex enterprise environments. They introduce the limitations of traditional observability tools and demonstrate AI's unique approach to actually finding "the cause of a problem" amid vast data environments. Traversal focuses not on simple automation but on uncovering actual "cause and effect," making engineering smarter and more proactive.
1. Founder Introductions and the Birth of Traversal
The video opens with the hosts of the 'Latent Space' podcast and the Traversal founders. Alessio (founder of Kernel Labs) and Swixs (founder of Small AI) greet the audience, while Traversal co-founders Anish and Raaz introduce themselves.
Anish holds a PhD in computer science from MIT and says he "was deeply immersed in AI and machine learning and wanted to continue researching." He focused particularly on the field of causal machine learning, emphasizing that "it's important to know that correlation is not causation. We need AI systems that can capture causal relationships from data."
"How can AI identify 'cause' and 'effect' from data?"
With a deep interest in reinforcement learning as well, he explains that "reinforcement learning is a method for effectively exploring vast search spaces."
Raaz grew up in India and completed his PhD at Berkeley, where he became interested in causal analysis while working at a wireless systems monitoring startup. He recalls meeting Anish at MIT and together contemplating "how to turn theory into a company."
2. The Traversal Idea -- Timing, Market, and a Unique Approach
What Traversal zeroed in on was the point that "this was almost the first clear problem that naturally emerged with the arrival of AI." Although there were already many AI-powered SRE (Site Reliability Engineering) startups, the Traversal team chose a completely "first-principles approach" from the ground up.
"In astronomically distributed systems, truly exploring the vast data across logs, metrics, traces, code, configuration files, and even Slack messages -- this is closer to one giant 'search problem' than what people typically think of as an AI application."
Traversal focused not on simple automation of existing market problems but on combining:
- Causal machine learning (technology for discovering true causes)
- Reinforcement learning (technology for intelligently narrowing the search space)
- Adaptability of LLMs and AI agents
The main pain point in the market was that existing simple log/monitoring tools made it "difficult to find the true root cause among countless 'fake needles.'"
"This isn't 'finding a needle in a haystack' -- it's 'finding the real needle among fake needles in a haystack.'"
3. How Traversal Actually Works -- Demo and Walkthrough
Traversal is used in real scenarios at large enterprises (e.g., DigitalOcean) dealing with enormous data: thousands of microservices, billions of log entries, millions of time-series data points, and thousands of code repositories.
Input and Flow
- When a user or system provides the "approximate time the problem started" and a "situation description," Traversal begins querying related logs, metrics, and service lists.
- It can auto-detect alert channels from Slack or allow manual investigation from a dashboard.
Context Building
- Rather than requiring users to feed in all the information manually, "Traversal generates the important context on its own."
-
"If we ask the user to input too much information, Traversal loses its value, doesn't it? We need to accumulate context on our own."
- Since it's impossible to feed millions of tokens of data directly into an LLM, Traversal goes through multiple stages of "searching, summarizing, and connecting."
Step-by-Step Search and Reasoning
- From data collected across multiple sources like logs and metrics, Traversal progressively narrows the data scope by combining LLMs and statistical algorithms.
- It dynamically applies statistical tests (such as time-series change detection) while combining them with LLM understanding of semantics (code relationships, metric meanings, etc.) to narrow evidence down to "true root cause candidates."
-
"The agent we built weaves together statistics (tests) and semantic understanding in real time, retaining only the most relevant context."
Results and Connection to Real Work
- Traversal explores systems through a read-only approach, but once sufficient confidence is built, it recommends "automated actions" (such as rolling back a specific commit or executing scripts).
4. Traversal's Agentic Architecture and AI Model Selection
The "agentic architecture" that Traversal emphasizes is fundamentally different from conventional "predefined procedure automation." Existing runbook products work well in situations where the algorithm is already defined (playbook exists), but Traversal targets the domain of "real-time exploration of complex, compound failure scenarios that no one has experienced before."
"This domain is a fully agentic problem. The AI must proactively solve problems in a state with no prior knowledge and no predefined paths."
AI Model Selection and Experimentation
- They are experimenting with various models including GPT-3, GPT-4, Gemini, and Claude, and they evaluate that models excelling in "reasoning and tool-calling capabilities" have an advantage in enterprise environments.
- "This isn't a simple MCP-React agent -- it requires reasoning ability capable of 'complex exploration.'"
- While OpenAI's (O3-O4) infrastructure and tool utilization were strong, they note that the Claude family performs better in tool calls and "unstucking" -- recovering from wrong paths.
5. Business Model and Market Strategy
Traversal approaches the market with an outcome-based model that differentiates it from existing observability tool companies (Datadog, Splunk, etc.).
"We're not trying to store and retain data for you -- we're a company that sells the outcome of problem 'resolution' itself. We don't care about where the data comes from."
- Rather than simple storage billing (based on data volume), they price based on "infrastructure complexity, number of investigations," and aim to move toward fully outcome-based pricing (automated recovery, etc.) over the long term.
Staged Evolution of Self-Healing
- Currently, problems that Traversal's AI can resolve on its own account for about 10-20% of the total (simple incidents, L1/L2 engineer level).
- They project that "within 6 months to 1 year, AI will be able to 'suggest and execute fixes' for more complex problems."
- For more futuristic scenarios like full codebase re-architecture, they realistically note that "at least 2-3 more years are needed."
6. Experimentation, Evaluation, and the Challenges of Real-World Adoption
In the process of delivering Traversal to enterprise customers, they candidly discuss the difficulty of "building trust in the face of real production variables and complexity, even when performance looks great in staging (simulated environments)."
"Everyone believes their environment is the most complex in the world. So no matter how well performance looks in staging, the real contract only happens when you prove it in production."
Internally, they invest significant resources in building sophisticated evaluation pipelines and benchmarks. They emphasize the importance of system evaluation and testing to such a degree that they consider it part of their core IP (intellectual property).
7. Final Thoughts -- Organizational Culture, Hiring, and Future Vision
Traversal.ai is currently hiring actively in New York across areas including AI infrastructure, data evaluation, and system design.
In closing, Traversal wraps up with the aspiration to "build a culture of showing more. We still talk (tell) a lot, but we want people to experience that the real product works well."
"Show more, tell less. We believe that's the way to earn trust in this market."
Conclusion
Traversal stands right in the middle of where AI turns "vast data into insightful root cause analysis" that drives real change. Beyond simple automation, this is the frontline of transformation where "agentic" AI makes enterprise systems smarter, faster, and more operationally effective. The Traversal case vividly shows how outstanding theory connects with real-world experience, illuminating the future of AI in the enterprise landscape.
"The essence of AI evolution lies not in merely 'observing,' but in truly 'understanding' and 'acting.'"
