Analysis of OpenAI's latest failure report OpenAI released a public report on failures that occurred on December 11th.

Analysis of OpenAI's recent failure reports

OpenAI released a public report on the failure that occurred on December 11th. This report details the causes and resolution of system failures, so there is a lot to learn from it.

System overload (Saturation) - the beginning of failure

"With thousands of nodes working simultaneously, the Kubernetes API server became overloaded, which brought down the Kubernetes control plane in many of our large clusters."

Saturation refers to a state that occurs when the limit that the system can handle is exceeded. It is also commonly called overload or resource exhaustion.
In the case of OpenAI, the Kubernetes API server was unable to handle excessive traffic and became saturated, causing the DNS-based service discovery mechanism to fail.
This saturation is one of the most common causes of system failure. In addition to OpenAI's case, companies such as Cloudflare, Rogers, and Slack have also experienced similar problems.

I passed the test, but...

"The change was tested on the staging cluster, and no issues were observed. However, the impact was only seen on clusters above a certain size, and the rollout continued before any visible outage occurred due to the DNS cache on each node."

Despite passing all tests, there are problems that only occur in actual operating environments. In particular, problems such as saturation often become apparent only when the system is exposed to full load.
OpenAI evaluated resource usage such as CPU and memory before deploying new telemetry services, but did not sufficiently consider the load on the Kubernetes API server.

"We monitored service health, but we lacked protocols for monitoring cluster health."

Complex, unexpected interactions

"This was the result of multiple systems and processes failing simultaneously and interacting in unexpected ways."

In complex systems, failures do not occur solely due to problems with individual components. Interactions between systems often cause problems.
In the case of OpenAI, the new telemetry service caused excessive load on the Kubernetes API server, causing DNS-based service discovery to fail.

"A new telemetry service configuration unexpectedly increased Kubernetes API load on large clusters, overloading the control plane and causing DNS-based service discovery to fail."

If the Kubernetes API behaves abnormally, existing services should generally continue to operate. But this time DNS was the link in the problem. When DNS failed, even services running on Kubernetes were affected.

DNS caching and delay issues (Impact of a change is spread out over time)

"DNS caching caused a delay between changes being made and the service failing."

DNS caching has the characteristic of distributing the impact of changes over time. This can make it difficult to diagnose the problem.
In the case of OpenAI, the deployment of the telemetry service did not cause immediate problems due to DNS caching, and the failure was revealed only after the rollout was almost complete.

"DNS caching made the problem less visible, and it only emerged after the rollout was fully underway."

Failure mode makes remediation more difficult

"We needed access to the Kubernetes control plane to fix the issue, but the load on the Kubernetes API server made that impossible."

If a failure occurs, the tools or systems used by operators may also be affected. This makes problem solving more difficult.
Facebook also experienced similar problems during the large-scale outage in 2021. When DNS failed, even internal tools became unusable.

"We identified the problem within minutes and worked concurrently to quickly restore the cluster."

OpenAI used three strategies in parallel:

Reduce cluster size: Reduce Kubernetes API load.
Block network access to the Kubernetes management API: Block new requests to give the API server time to recover.
Expand the Kubernetes API server: Free up additional resources to handle pending requests.

"By combining all three, we were able to restore enough control to remove the service that was causing the problem."

But these responses are always accompanied by uncertainty. Wrong action can make the situation worse.

A change intended to improve reliability - an ironic situation

"As part of our work to improve our cluster observability tools to increase reliability across our organization, we have deployed a new telemetry service that collects Kubernetes control plane metrics."

This failure is a case where changes made to improve reliability actually caused a failure. This is a common problem in complex systems, where attempts to increase reliability can destabilize the system in unexpected ways.

December 17, 20249 minOriginal source