Traversal: Solving Complex System Problems with AI and Causal ML -- A Conversation with the Founders

The founders of Traversal.ai provide a detailed account of how they combine AI, causal machine learning (Causal ML), and reinforcement learning (RL) to analyze root causes of problems in complex enterprise environments. They discuss the limitations of traditional observability tools and introduce Traversal's unique approach to how AI actually finds "the cause of a problem" within vast data environments. Traversal focuses not on simple automation, but on uncovering actual 'cause and effect' to make engineering smarter and more proactive.

1. Founder Introductions and the Origins of Traversal

The video begins with the hosts of the 'Latent Space' podcast and the Traversal founders. Alessio (founder of Kernel Labs) and Swixs (founder of Small AI) greet the audience, followed by self-introductions from Traversal co-founders Anish and Raaz.

Anish received his PhD in computer science from MIT and says he "became deeply immersed in AI and machine learning and wanted to keep researching." He focused especially on the field of "causal machine learning" and emphasizes that "it's important to know that correlation is not causation. We need AI systems that can capture causal relationships from data."

"How can AI determine 'cause' and 'effect' from data?"

With a deep interest in reinforcement learning as well, he explains that "reinforcement learning is a method for effectively exploring large search spaces."

Raaz grew up in India and did his PhD at Berkeley, where he became interested in causal analysis while interning at a startup dealing with wireless system monitoring. He recalls meeting Anish at MIT and together pondering "how to turn theory into a company."

2. The Traversal Idea -- Timing, Market, and a Unique Approach

What Traversal zeroed in on was that this was "almost naturally the first clear problem that emerged with the arrival of AI." Although there were already many AI-based SRE (Site Reliability Engineering) startups, the Traversal team chose a "fundamentally first-principles approach" from scratch.

"In astronomically distributed systems, truly exploring the vast data of logs, metrics, traces, code, configuration files, and even Slack messages is essentially one giant 'search problem' -- which is what AI application itself amounts to."

Rather than simple automation, Traversal focused on combining:

Causal machine learning (technology for discovering true causes)
Reinforcement learning (technology for intelligently narrowing the search space)
The adaptability of LLMs and other AI agents

The key 'pain point' in the market was that simple log/monitoring tools couldn't "find the real cause of a problem among numerous 'fake needles.'"

"This isn't 'finding a needle in a haystack' -- it's 'finding the real needle among fake needles in a haystack.'"

3. How Traversal Actually Works: A Demo

Real-world use cases for Traversal involve large enterprises (e.g., DigitalOcean) handling thousands of microservices, billions of log entries, millions of time-series data points, and thousands of code repositories.

Input and Flow

When a user or system inputs "the approximate time the problem started" and "a description of the situation," Traversal begins querying related logs, metrics, and service lists.
It can automatically detect alert channels on Slack or allow manual investigation from a dashboard.

Context Building

Users don't need to manually input all the information -- "Traversal automatically generates important context."
"If we required users to input too much information, what would be the value of Traversal? We need to accumulate context on our own."
Since feeding millions of tokens of data directly into an LLM is impossible, Traversal goes through multiple stages of 'searching, summarizing, and connecting.'

Step-by-Step Search and Reasoning

From data sets collected across multiple sources like logs and metrics, Traversal progressively narrows the data scope by combining LLMs with statistical algorithms.
It dynamically applies statistical tests (such as time-series change detection) while combining them with LLMs that understand meaning (code relationships, metric semantics, etc.) to narrow evidence down to 'true cause candidates.'
"The agent we built weaves together statistics (tests) and semantic understanding in real time, progressively keeping only the most relevant context."

Results and Connection to Actual Work

Traversal explores systems with a read-only approach, but once sufficient trust is established, it can recommend 'automated actions' (e.g., rolling back a specific commit, executing scripts).

4. Traversal's Agentic Architecture and AI Model Selection

The 'agentic architecture' that Traversal emphasizes is fundamentally different from conventional 'pre-defined procedure automation.' Existing runbook products work well in situations where algorithms are already defined (when playbooks exist), but Traversal's target domain is "real-time exploration of complex failure situations that no one has experienced before."

"This domain is a completely agentic problem. The AI must proactively solve problems with no prior knowledge and no predefined path."

AI Model Selection and Experimentation

They are experimenting with various models including GPT-3, GPT-4, Gemini, and Claude, and assess that models with strong 'reasoning and tool-calling capabilities' are advantageous in enterprise environments.
"We don't need a simple MCP-React agent -- we need reasoning capabilities that can handle 'complex exploration.'"
OpenAI (O3-O4) offered strong infrastructure and tool utilization, but Claude models performed well in tool calling and "unstucking" -- getting off wrong paths.

5. Business Model and Market Strategy

Traversal approaches the market with an outcome-based model that differentiates from existing observability tool companies (Datadog, Splunk, etc.).

"We're not trying to store and manage data -- we're a company that sells the outcome of problem 'resolution' itself. We don't care about data sources."

Rather than simple storage-based pricing (by data volume), they price based on "infrastructure complexity and number of investigations," with a long-term goal of moving to fully outcome-based pricing (including automated remediation).

The Staged Evolution of Self-Healing

Currently, problems that Traversal's AI can resolve on its own account for about 10--20% of the total (simple incidents at the L1/L2 engineer level).
They expect that within 6 months to 1 year, the AI will be able to "propose and execute fixes" for more complex problems.
Fully futuristic scenarios like complete codebase restructuring are realistically estimated to require "at least 2--3 more years."

6. Experimentation, Evaluation, and the Challenges of Real-World Adoption

In the process of deploying Traversal to enterprise customers, they acknowledge that "performance looks great in staging (simulated environments), but building trust in the face of real production variables and complexity is difficult."

"Everyone believes their environment is the most complex in the world. So no matter how well performance looks in staging, contracts only happen after verification in production."

Internally, they invest significant resources in building sophisticated evaluation pipelines and benchmarks. They emphasize the importance of system evaluation/testing to the point of considering it core IP.

7. Conclusion -- Organizational Culture, Hiring, and Future Vision

Traversal.ai is currently hiring actively in New York across multiple areas including AI infrastructure, data evaluation, and system design. In closing, Traversal expresses the aspiration to "create a culture of showing more. We still talk too much -- we want to let people experience the product actually working well."

"Show more, tell less. We believe that's how you earn trust in this market."

Conclusion

Traversal sits at the center of where AI transforms 'vast data into insightful root cause analysis' that produces real change. Beyond simple automation, "agentic" AI is driving change that makes enterprise systems smarter, faster, and more practically operational. The Traversal case clearly demonstrates how outstanding theory and real-world experience connect, showing us the future of AI and enterprise operations. "The essence of AI evolution lies not in simply 'observing,' but in truly 'understanding' and 'acting.'"

1. Founder Introductions and the Origins of Traversal

"How can AI determine 'cause' and 'effect' from data?"

With a deep interest in reinforcement learning as well, he explains that "reinforcement learning is a method for effectively exploring large search spaces."

2. The Traversal Idea -- Timing, Market, and a Unique Approach

"In astronomically distributed systems, truly exploring the vast data of logs, metrics, traces, code, configuration files, and even Slack messages is essentially one giant 'search problem' -- which is what AI application itself amounts to."

Rather than simple automation, Traversal focused on combining:

Causal machine learning (technology for discovering true causes)
Reinforcement learning (technology for intelligently narrowing the search space)
The adaptability of LLMs and other AI agents

The key 'pain point' in the market was that simple log/monitoring tools couldn't "find the real cause of a problem among numerous 'fake needles.'"

"This isn't 'finding a needle in a haystack' -- it's 'finding the real needle among fake needles in a haystack.'"

3. How Traversal Actually Works: A Demo

Input and Flow

When a user or system inputs "the approximate time the problem started" and "a description of the situation," Traversal begins querying related logs, metrics, and service lists.
It can automatically detect alert channels on Slack or allow manual investigation from a dashboard.

Context Building

Users don't need to manually input all the information -- "Traversal automatically generates important context."
"If we required users to input too much information, what would be the value of Traversal? We need to accumulate context on our own."
Since feeding millions of tokens of data directly into an LLM is impossible, Traversal goes through multiple stages of 'searching, summarizing, and connecting.'

Step-by-Step Search and Reasoning

From data sets collected across multiple sources like logs and metrics, Traversal progressively narrows the data scope by combining LLMs with statistical algorithms.
It dynamically applies statistical tests (such as time-series change detection) while combining them with LLMs that understand meaning (code relationships, metric semantics, etc.) to narrow evidence down to 'true cause candidates.'
"The agent we built weaves together statistics (tests) and semantic understanding in real time, progressively keeping only the most relevant context."

Results and Connection to Actual Work

Traversal explores systems with a read-only approach, but once sufficient trust is established, it can recommend 'automated actions' (e.g., rolling back a specific commit, executing scripts).

4. Traversal's Agentic Architecture and AI Model Selection

"This domain is a completely agentic problem. The AI must proactively solve problems with no prior knowledge and no predefined path."

AI Model Selection and Experimentation

They are experimenting with various models including GPT-3, GPT-4, Gemini, and Claude, and assess that models with strong 'reasoning and tool-calling capabilities' are advantageous in enterprise environments.
"We don't need a simple MCP-React agent -- we need reasoning capabilities that can handle 'complex exploration.'"
OpenAI (O3-O4) offered strong infrastructure and tool utilization, but Claude models performed well in tool calling and "unstucking" -- getting off wrong paths.

5. Business Model and Market Strategy

Traversal approaches the market with an outcome-based model that differentiates from existing observability tool companies (Datadog, Splunk, etc.).

"We're not trying to store and manage data -- we're a company that sells the outcome of problem 'resolution' itself. We don't care about data sources."

Rather than simple storage-based pricing (by data volume), they price based on "infrastructure complexity and number of investigations," with a long-term goal of moving to fully outcome-based pricing (including automated remediation).

The Staged Evolution of Self-Healing

Currently, problems that Traversal's AI can resolve on its own account for about 10--20% of the total (simple incidents at the L1/L2 engineer level).
They expect that within 6 months to 1 year, the AI will be able to "propose and execute fixes" for more complex problems.
Fully futuristic scenarios like complete codebase restructuring are realistically estimated to require "at least 2--3 more years."

6. Experimentation, Evaluation, and the Challenges of Real-World Adoption

"Everyone believes their environment is the most complex in the world. So no matter how well performance looks in staging, contracts only happen after verification in production."

7. Conclusion -- Organizational Culture, Hiring, and Future Vision

"Show more, tell less. We believe that's how you earn trust in this market."

1. Founder Introductions and the Origins of Traversal

2. The Traversal Idea -- Timing, Market, and a Unique Approach

3. How Traversal Actually Works: A Demo

Input and Flow

Context Building

Step-by-Step Search and Reasoning

Results and Connection to Actual Work

4. Traversal's Agentic Architecture and AI Model Selection

AI Model Selection and Experimentation

5. Business Model and Market Strategy

The Staged Evolution of Self-Healing

6. Experimentation, Evaluation, and the Challenges of Real-World Adoption

7. Conclusion -- Organizational Culture, Hiring, and Future Vision

Conclusion

Related writing

Why Agent-Era Skill Standardization Changes Everything

AX Roadmap That Leads to Results: From Individual Efficiency to Org Productivity

The Era When Agents Code and Research Runs in Loops: Andrej Karpathy

Reading

1. Founder Introductions and the Origins of Traversal

2. The Traversal Idea -- Timing, Market, and a Unique Approach

3. How Traversal Actually Works: A Demo

Input and Flow

Context Building

Step-by-Step Search and Reasoning

Results and Connection to Actual Work

4. Traversal's Agentic Architecture and AI Model Selection

AI Model Selection and Experimentation

5. Business Model and Market Strategy

The Staged Evolution of Self-Healing

6. Experimentation, Evaluation, and the Challenges of Real-World Adoption

7. Conclusion -- Organizational Culture, Hiring, and Future Vision

Conclusion

Related writing

Why Agent-Era Skill Standardization Changes Everything

AX Roadmap That Leads to Results: From Individual Efficiency to Org Productivity

The Era When Agents Code and Research Runs in Loops: Andrej Karpathy