The co-founders of Traversal.ai introduce their unique approach to incident troubleshooting and root cause analysis in complex enterprise systems using AI and causal machine learning. They point out the limitations of existing observability tools and current AI applications, emphasizing that the enormous volume of fragmented telemetry data in modern microservice architectures—logs, metrics, traces, code, Slack messages, and more—turns effective troubleshooting into a massive search problem. Traversal's core innovation lies in an agent architecture that dynamically combines LLM semantic understanding with statistical analysis of time-series data to intelligently narrow down potential root causes, effectively converting correlation into causation. Their product aims to move software maintenance away from a reactive "firefighting" model, resolving the "hero engineer" problem by providing reliable AI-driven insights even when there is no clear playbook, and shifting toward a more proactive and intelligent process.
1. The Founders' Backgrounds and the Story Behind Traversal
The Latent Space podcast, hosted by Allesio from Kernel Labs and Swixs from Small AI, features Traversal co-founders Anish and Raz. Anish is from Singapore and has spent 15 years in the U.S. tech industry, earning a PhD in Computer Science from MIT. His research focused primarily on Causal Machine Learning and Reinforcement Learning.
"It was about the idea that correlation is not causation—how AI systems can discover causal relationships in data, and how reinforcement learning effectively searches broad spaces."
He had originally wanted to spend his career in AI research and was a professor at Columbia University, but in January 2024 he witnessed the explosive growth in LLMs and AI agents and decided to pursue a new challenge at a tech startup. Anish co-founded Traversal with MIT alumni Raz, Raj, and Ahmed. Ahmed, who came from Citadel Securities, had deep experience thinking about troubleshooting and incident response in environments where system uptime is critical, and it was he who first proposed the troubleshooting domain to the team.
Raz grew up in India and also built his career in AI research. He entered UC Berkeley for his PhD in 2015 and became interested in causal machine learning through a startup internship dealing with observability in wireless systems. He was captivated by analyzing the root causes of Wi-Fi issues—whether the problem was the device or the LAN—and spent his PhD focusing on that area. He met Anish during a postdoctoral fellowship at MIT, discovered they shared the same research interests, and together they founded Traversal when they both started faculty positions in New York.
"When I say correlation is not causation, I joke: 'I have a PhD, so I get to say this. You have to listen.'" 🤣
Initially the two founders met every day at a WeWork without a clear idea, and it was at the intersection of causal machine learning, reinforcement learning, and AI agents that they realized troubleshooting was the perfect problem to solve. They described it as a "richly interesting problem" from an AI standpoint.
2. Troubleshooting in Complex Enterprise Environments: Traversal's Unique Approach and Market Timing
When many AI SRE (Site Reliability Engineering) startups emerged in early 2024, Traversal focused on the space of "AI-driven incident response." Anish initially didn't even know the term SRE, but approached the problem from "first principles." They explored several ideas—A/B testing for LLMs, dynamic prompt tuning, product analytics for LLMs (something like Amplitude)—before finding their opportunity in the core engineering domain.
The main reasons this problem attracted them were:
- A long-standing challenge: A deeply entrenched problem that had gone unsolved for 30–40 years. 🕰️
- A massive market: Troubleshooting in the complex systems of enterprise environments represented an enormous market.
Traversal believed they had a proprietary advantage in tackling this problem. The core of their approach lies in distinguishing "correlation from causation." When a problem occurs—such as a spike in system latency—many other metrics tend to spike simultaneously. The critical question is which metrics are merely symptoms of the problem versus which represent the true root cause.
"The problem is that thousands of other spikes happen at the same time. You need to figure out which spikes are symptoms of the actual problem, which are correlations caused by something else, and which are the true root cause."
This is likened to "finding a needle in a haystack"—except that you also have to find it among countless "fake needles." Traversal's expertise in Causal ML and Reinforcement Learning was precisely optimized for this kind of search and filtering.
Enterprise environments deal with petabytes of data—logs, metrics, traces, code, configuration files, Slack messages, and more. It is simply not possible to feed all that data into an LLM's context at once and expect it to solve the problem. Traversal therefore saw an agentic system that explores data sequentially and adaptively as essential. This agent system autonomously queries diverse systems such as Elastic, Grafana, ServiceNow, and Datadog to gather information.
They also anticipated that as AI-powered coding assistants proliferate, understanding how code actually behaves would become increasingly difficult—meaning software maintenance could devolve into pure "firefighting" or QA work. Traversal positioned itself not on software development but on software maintenance, aiming to align with the direction of AI-driven development and offer a solution that enables proactive troubleshooting. 🚀
3. How Traversal Works: Inputs, Context Building, and AI-Driven Troubleshooting
Traversal begins when a user provides an approximate time when the incident started and a brief description of the problem—for example, "Region selection was unavailable during bucket creation on the evening of April 22."
3.1. Automatic Triggers and Manual Investigation 🤖
Traversal is primarily triggered automatically from incident channels (e.g., Slack) or alert channels. Once notified, users can navigate to the Traversal UI to conduct deeper investigation. Unlike Slack—which only surfaces "leads" (clues)—the UI lets users review all supporting evidence behind the AI's answers and take follow-up actions.
The UI can also be used proactively, even when no active incident is occurring, to perform a "system health check" or to "figure out what's going on" when, for example, a particular system feels sluggish.
3.2. Intelligent Context Building 🧠
When the user describes a problem, Traversal enters the critical phase of "context building." This takes a few minutes and draws on the following elements intelligently to surface relevant information:
- Who is asking: identity and timing of the query.
- Traversal's knowledge base: its own understanding of the system architecture.
- Live data queries: real-time queries of logs, metrics, traces, and similar sources.
Using all of this, Traversal performs "sequential hopping"—moving from service to service, index to index—to identify related services, understand how they connect, and build up context. This process is AI-driven, and users can accelerate the investigation by providing initial information.
"Traversal analyzes the question to determine what context to retrieve. In the background it draws on who is asking, when they're asking, its own knowledge base, and its understanding of the system architecture."
3.3. Handling Massive Data and Managing LLM Context 📊
Traversal processes hundreds of millions of tokens' worth of data: hundreds of millions of logs, tens of thousands of time-series, hundreds of pull requests and deployments, thousands of dashboards. None of this can fit into a single LLM's context window at once.
Traversal's agent architecture handles this through tool orchestration and memory management, ensuring that the LLM or orchestrator agent receives only the right information at the right moment to maintain useful context without being overloaded. The process continuously "cuts down" data and narrows the relevant information.
"We looked at hundreds of millions of logs, tens of thousands of time-series, hundreds of pull request deployments, thousands of dashboards. The token count for all this data runs into the millions. But if you dump all of it into an LLM or orchestrator agent at once, it gets overwhelmed. So we use an agent architecture to manage just the right pieces and reach an answer quickly."
3.4. Combining Causal ML with Statistical Analysis ➕
Traversal's core differentiator goes beyond simply gathering data and querying an LLM. Anish explains that the process of narrowing down billions of potential symptoms to a handful of actual causes is itself a "massive data processing pipeline." AI agents and LLMs operate throughout this pipeline to reduce the volume of relevant information.
Critically, Traversal uses proprietary statistical tests to process time-series data. Because LLMs are poor at handling time-series directly, Traversal's AI agents act like data scientists, dynamically deciding which combination of statistical tests to apply in order to filter the search space. At the same time they leverage semantic information from logs, metrics, and traces. Raz describes this as a "semantics meets statistics" framework.
"LLMs are really poor at handling time-series data. That's exactly when you need good statistics. These statistical tests are like a toolbox the AI agent can access. Like a data scientist, it dynamically decides which statistical tests to run to filter the search space."
In this way, Traversal goes beyond mere correlation to present "root cause candidates" backed by factual symptoms observed in the data. This is a complex process that cannot be easily achieved with simple LLM calls or a basic ReAct agent.
4. Self-Healing and Business Model: Traversal's Evolution
Traversal is moving beyond finding the root cause of a problem toward enabling the system to "self-heal."
4.1. A Gradual Approach to Self-Healing 🛠️
Initially, for security reasons, Traversal only receives read-only API access to customer systems, since customers generally have reservations about installing agents in their infrastructure. Once a long-term relationship builds trust, conversations about "system self-healing" begin.
Self-healing at Traversal works as follows:
- Simple actions: If the problem is highly localized, a straightforward action such as reverting a specific commit may be proposed.
- Leveraging existing automation scripts: Many large enterprises maintain thousands of automation scripts to heal their systems (e.g., restarting specific pods, auto-scaling certain parts of infrastructure). The challenge is knowing which script to run. Once Traversal has precisely identified the root cause, it uses the LLM's semantic understanding to connect or match the appropriate healing script and propose its execution.
"People don't want to hand over full write access to do anything to their systems. So a whitelisted command set is needed, and this typically takes the form of existing scripts the customers already have."
Traversal operates within a pre-approved (whitelisted) set of commands, ensuring self-healing is performed without requiring full write access to customer systems.
4.2. Business Model and Market Competition 💰
Traversal's business model differentiates itself from existing observability tool vendors. Those vendors typically charge based on the volume of data stored and have little incentive to provide insights into data stored elsewhere. In enterprise environments, however, it is common to run 7–8 fragmented tools simultaneously—Splunk, Dynatrace, Datadog, and others. Resolving a problem requires iterating across all of them to piece together what happened.
Traversal saw this fragmentation of systems and teams as an opportunity.
"We're not trying to sell data storage volume—we're selling the outcome of the investigation itself. We're reasonably agnostic about where the data lives. We're like the Switzerland of observability."
Rather than storage-based billing, Traversal pursues a model based on the outcome of the investigation. Currently, pricing combines two factors:
- Size of the search space: complexity of the infrastructure (number of hosts, containers, etc.).
- Number of investigations conducted.
Unlike customer support use cases, Traversal's technology is not yet at the stage of guaranteeing fully autonomous self-healing outcomes 100% of the time. Today it provides "very good leads." As a result, customers resist purely outcome-based pricing. As the technology matures, Traversal plans to transition to full outcome-based pricing. For now, pricing blends infrastructure scale and number of investigations, with more weight on infrastructure scale.
For cases where full end-to-end self-healing is already possible, Traversal is already attempting to sell on an outcome basis.
4.3. Stages of Self-Healing Development and Future Outlook 🔮
Raz explains that self-healing, like agents themselves, exists on a continuum. Currently, for the easiest 10–20% of problems, Traversal can deliver reliable root cause analysis and self-healing—work that an L1 or L2 engineer could handle.
Self-healing for 30–40% of medium-complexity problems (where a senior engineer's confirmation is still needed) is expected to become feasible within six months to a year.
A fully autonomous agent capable of long-term re-architecture of codebases, however, is expected to take at least two more years. AI continues to surprise, but responding to unpredictable issues in live production systems and fully restructuring code will take more time. Problems solvable by unit testing may be largely addressed within one to two years, but more severe issues still have a long road ahead.
5. AI Agent Development and Model Selection: Traversal's View on GPT-5
Traversal actively tests a wide range of AI models and emphasizes the importance of reasoning models in particular.
5.1. The Importance of Reasoning Models 🧐
Traversal's troubleshooting scenarios require reasoning about root causes while navigating vast numbers of symptoms and complex system architectures, making reasoning models essential.
"Reasoning models are essential for our use case. We need to reason about root causes across many symptoms and complex architectures. A plain flagship model won't cut it."
Traversal has built a flexible architecture that allows models to be swapped easily. Because of uncertainty around AI regulation, many enterprise customers find it difficult to bring their own models.
5.2. Evaluating Frontier Models: OpenAI, Gemini, and Anthropic 🧪
Traversal continuously evaluates major models including OpenAI (GPT-3, GPT-4), Google Gemini, and Anthropic (Claude).
- OpenAI (GPT-3/4): Historically GPT-3 performed better on reasoning, and well-supported infrastructure allowed for rapid development.
- Google Gemini: Gemini has shown solid performance recently but its infrastructure support lags somewhat, so OpenAI is still used more.
- GPT-5: The verdict on GPT-5 is still unclear.
"GPT-5 starts reasoning even when it shouldn't. We haven't made up our minds yet—final testing is ongoing. We haven't seen overwhelming positive signals so far."
Traversal notes that GPT-5 tends to over-reason in ways that are unnecessary, and continues to evaluate it. They emphasize that because they must bridge "missing information" or "poorly instrumented data" in complex enterprise environments, simple pre-defined workflows are insufficient—reasoning models are indispensable.
5.3. Claude's Tool-Calling Ability 🎯
Anthropic's Claude is widely regarded as excellent at tool calling, and Traversal has confirmed this assessment.
"For the agent portion of our stack, we're moving toward Anthropic. There are signals that Anthropic and Claude are better at tool calling, especially at getting unstuck when the investigation goes down a wrong path."
They shared their experience that when an investigation heads in the wrong direction, Claude is more effective at pivoting back toward the correct path.
5.4. Eval Pipeline and Core IP 📈
Early on, Traversal relied on "vibes" to evaluate models, but now has proper benchmarks and a highly detailed eval pipeline.
"I always tell the company: one of our core IPs is how well we evaluate. I think the best AI companies need to always be at the frontier of what models can do."
Because AI companies must continuously push against model limitations, systems will sometimes work and sometimes not. Evaluation therefore becomes a critical bottleneck requiring heavy investment of time and resources. Traversal emphasizes that eval is core intellectual property, and actively recruits PhD-level talent to support it. 🧑🎓
6. Conclusion: "More Showing, Less Telling" and Hiring 🚀
Anish highlighted the phrase "more showing, less telling" as a message for a world where many people repeat the same claims about AI and technology. Traversal's success came not from story and assertion, but from demonstrating the actual utility of a product that resolves complex production incidents.
6.1. The Benchmark Dilemma 🤔
For Traversal, benchmarks are core intellectual property—essential for knowing how well the company is performing and determining the direction forward. But publishing those benchmarks is equivalent to exposing the company's core value externally, creating a "stalemate", as Anish puts it.
6.2. The Challenge of Accelerating PoCs ⚡️
Customers often attempt to evaluate Traversal's performance by running "chaos engineering" exercises in staging environments. But Anish notes that no matter how well Traversal performs in simulated environments, customers ultimately need to test in actual production before they are convinced.
"Customers always say, 'No, no. To really know, we need to test in production.' And you just have to say, 'What?'" 🤷♂️
This is because truly simulating a production environment with sufficient fidelity is essentially equivalent to building another production environment. The most effective approach currently is for customers to provide data from their last ten production incidents, which Traversal then analyzes and presents findings on. Customers tend to believe their environment is "uniquely complex", making it hard to earn their conviction without demonstrating value in real production.
6.3. Hiring at Traversal 🌟
Traversal is actively hiring. Based in New York City, they are looking for people willing to work in or relocate to New York. Anish encouraged anyone who wants to work at the intersection of AI and infrastructure to reach out.
"We are actively hiring right now. Our entire team is in New York City. If you're in New York or willing to move there, and you want to work at a great company at the intersection of AI and infrastructure, please reach out."
