When AI Fails: How FAILS Helps Decode Breakdowns in Large Language Models

When AI Fails: How FAILS Helps Decode Breakdowns in Large Language Models
Introduction
Ever been mid-chat with an AI like ChatGPT, only for it to suddenly stop responding? Or maybe you were anxiously waiting for AI-generated artwork, and DALL·E just wouldn’t load? AI services drive critical applications, but they’re not immune to breakdowns. And when they fail, the consequences can range from mild inconvenience to full-scale business disruption.
Understanding how and why these failures happen is one of the biggest challenges in AI reliability. That’s where FAILS—a tool designed to analyze incidents in Large Language Model (LLM) services—comes in. Created by a team of researchers, FAILS helps uncover the hidden patterns behind AI service failures, empowering engineers, researchers, and everyday users to make sense of AI breakdowns.
Let’s explore what FAILS does and why it’s a game-changer in making AI services more reliable.
The Reality of AI Failures
AI services like ChatGPT, Claude, and StableDiffusion power chatbots, automate tasks, and even generate art. Behind the scenes, these models rely on massive distributed systems—geographically dispersed clusters of servers processing billions of requests.
Because of this complexity, AI services go down more often than you might think. When they do, companies experience service interruptions, slow response times, or outright crashes. This can lead to frustrated users, financial losses, and even public apologies from top AI firms.
Despite the critical need for reliability, there hasn’t been a truly open, accessible tool to study these failures—until now.
FAILS fills this gap by collecting and analyzing AI incident data, making AI crash trends more transparent than ever.
Meet FAILS: The AI Failure Investigator
FAILS (Framework for Analysis of Incidents on LLM Services) is the first open-source system designed to automatically track, analyze, and visualize AI service failures.
Think of it like a crime scene investigator for AI breakdowns—it pieces together clues from failure incidents, helping us understand what went wrong and how similar failures could be prevented in the future.
What FAILS Does
FAILS provides a toolkit for researchers, engineers, and even general AI enthusiasts to explore AI service failures. It comes with three core superpowers:
- Automated Data Collection: FAILS scrapes failure reports from AI providers, cleans them up, and organizes them neatly.
- Deep Failure Analysis: It runs 17 different types of analysis, looking at patterns like recovery times, frequency of failures, and co-occurring breakdowns across different AI services.
- Interactive Dashboards & AI Insights: Users can see visual reports of AI outages, explore failure trends over time, and even consult an LLM-powered assistant to help interpret the findings.
By making this information more accessible, FAILS helps users diagnose AI service weaknesses and improve system reliability.
How Does FAILS Track AI Breakdowns?
To understand AI failures, you need structured data. FAILS pulls this data from public status pages provided by leading AI platforms like OpenAI, Anthropic, Character.AI, and Stability.AI.
Every time an AI service reports an outage, FAILS logs it and categorizes the failure into five stages:
- Investigating: The AI provider acknowledges an issue.
- Identified: Engineers have figured out what’s wrong.
- Monitoring: A fix has been applied, but they’re making sure it works.
- Resolved: The service is restored.
- Postmortem: A final report explains what happened.
FAILS then calculates two key reliability metrics:
- Mean Time to Recovery (MTTR): How long does an AI provider take to fix an issue?
- Mean Time Between Failures (MTBF): How frequently do failures happen?
By looking at these metrics across different platforms, FAILS highlights which AI services are resilient—and which ones struggle with frequent breakdowns.
What FAILS Reveals About AI Service Reliability
With FAILS collecting and analyzing real-world AI service failures, some interesting patterns start to emerge.
1. Some AI Services Are More Prone to Breakdowns Than Others
FAILS has already gathered nearly four years of incident reports from OpenAI, Anthropic, and other providers. Analysis shows that some services are much more reliable than others.
For example, Character.AI and Stability.AI experience fewer failures than OpenAI or Anthropic. This might not be because they’re better at reliability—rather, they have fewer users and simpler systems, which means less strain on their infrastructure.
2. Some AI Services Recover Faster Than Others
FAILS also helps rank AI providers by how fast they recover from failures. Stability.AI has shown the fastest recovery times, meaning their engineers tend to fix issues much quicker than competitors like OpenAI.
This kind of data can be useful for researchers studying AI infrastructure or businesses deciding which AI service to integrate into their products.
3. Interconnected Failures Make Some AI Services Riskier
Sometimes, an outage on one AI service cascades into failures across multiple services. FAILS reveals that Anthropic is more prone to system-wide failures impacting multiple services at the same time, while failures at OpenAI and Stability.AI tend to be isolated to single services.
This suggests that some companies have more robust internal firewalls, preventing one system failure from bringing down everything.
Next Steps: Making AI Fail Less
FAILS isn’t just for studying past failures—it also has potential for predicting future AI failures. The researchers behind FAILS are working on features like:
- Real-time outage tracking: Predicting when AI services will go down based on early warning signs.
- Comparing provider reports with user complaints: Checking whether AI firms accurately report their failures or downplay issues.
- Enhanced AI summaries: Using LLMs to explain outage reports in simpler terms for non-technical users.
The goal? Help AI providers fix problems faster and create more dependable AI systems.
Key Takeaways
- FAILS is the first open-source tool for tracking AI breakdowns. It scrapes failure reports from AI providers and helps analyze the patterns behind AI failures.
- AI services fail more often than people realize. OpenAI and Anthropic experience more failures than Character.AI or Stability.AI, largely due to system complexity.
- Some providers recover faster from failures than others. Stability.AI shows the fastest recovery times, while Anthropic struggles with multi-service failures.
- FAILS helps users explore AI failures through an interactive dashboard. It visualizes failure trends and even integrates AI-powered insights to help interpret failure data.
- The future of FAILS includes real-time tracking and predictive models. The researchers behind FAILS aim to make AI services more reliable by developing smarter failure detection tools.
Final Thoughts
AI systems are powerful, but they’re not invincible. When they fail, the consequences can range from minor annoyances to major financial losses. By making AI failures more transparent, FAILS is taking an important step toward building more reliable AI systems.
Whether you’re an AI researcher, a system engineer, or just a curious user, FAILS offers valuable insights into the hidden world of AI reliability.
Want to explore AI failure trends yourself? Check out FAILS on GitHub: https://github.com/atlarge-research/FAILS 🚀
If you are looking to improve your prompting skills and haven’t already, check out our free Advanced Prompt Engineering course.
This blog post is based on the research article “FAILS: A Framework for Automated Collection and Analysis of LLM Service Incidents” by Authors: Sándor Battaglini-Fischer, Nishanthi Srinivasan, Bálint László Szarvas, Xiaoyu Chu, Alexandru Iosup. You can find the original article here.