“`markdown

When AI Fails: How FAILS Helps Track and Analyze LLM Service Outages

Introduction

AI-powered tools like ChatGPT, DALL·E, and other large language model (LLM) services have become an integral part of our daily and professional lives. Businesses use them for customer support, creative professionals rely on them for content generation, and developers integrate them into coding assistants. But just like any other technology, LLM services are not immune to failures.

When a major AI tool goes down, it can cause frustration, lost productivity, and even financial losses. Yet, studying and understanding these failures is a challenge—most service providers don’t offer detailed, open-access data on these incidents. Enter FAILS—a groundbreaking open-source framework designed to collect, analyze, and visualize LLM service failures.

This blog post will explore what FAILS does, why it matters, and how it can help researchers, engineers, and everyday users understand and mitigate AI outages.

Why AI Services Fail (and Why It Matters)

AI services operate as complex, large-scale distributed systems, with different components spread across the globe. This complexity makes them prone to failures, which can stem from:

Infrastructure issues: Data centers, cloud servers, or networks experiencing downtime.
Software bugs: Problems with the AI models, faulty updates, or system overloads.
External dependencies: Issues with APIs, third-party dependencies, or cyberattacks.

When an AI service goes down, it can cost businesses revenue, frustrate users, and damage trust—forcing even industry-leading companies to issue public apologies.

Yet, most failure analysis tools available today are either private, enterprise-focused, or provide limited data. General outage tracking services like Downdetector offer user-reported issues but lack detailed insights. This gap in open-access failure analysis motivated researchers to develop FAILS.

Meet FAILS: The Open-Source Solution for AI Outages

FAILS (Framework for Analysis of Incidents on LLM Services) is the first open-source tool designed to collect, analyze, and visualize LLM service failures. It scrapes incident reports from major AI service providers—including OpenAI (ChatGPT, DALL·E), Anthropic (Claude), Character.AI, and Stability.AI—and provides detailed insights into:

How often failures occur – analyzing the Mean Time Between Failures (MTBF).
How quickly services recover – measuring the Mean Time to Recovery (MTTR).
Failure trends and patterns over time – spotting recurring issues.
Which services are affected together – understanding dependencies and cascading failures.

And the best part? FAILS integrates LLM-powered analysis, making it easier than ever to interpret the data with an AI chatbot that can break down failure insights.

How FAILS Works

1. Automated Data Collection & Cleaning

FAILS scrapes status pages from major LLM providers to collect real-time and historical incident reports. Since service providers use different formats and reporting styles, the tool cleans and standardizes the data, making it useful for comparison and analysis.

2. Detailed Failure Analysis

FAILS performs 17 different types of failure analyses, including:
– MTBF (Mean Time Between Failures): The average time between service outages.
– MTTR (Mean Time to Recovery): The average time it takes to fix an issue.
– Co-occurrence of failures: Identifies which services tend to go down together.

For example, FAILS found that Character.AI and Stability.AI tend to have longer times between failures, while OpenAI and Anthropic experience more frequent outages but recover faster.

3. AI-Powered Insights

Understanding technical failure metrics can be tricky, so FAILS integrates AI-driven explanations. Users can ask a chatbot questions like:
– “How many incidents happened in the last six months?”
– “Which provider has the shortest recovery time?”
– “What are the most common failure patterns seen in OpenAI services?”

This AI-assisted interaction helps make complex failure data accessible to researchers, engineers, and even non-technical users.

4. Interactive Data Visualization

FAILS isn’t just about numbers—it generates easy-to-read plots and graphs to visualize failure trends over time.
– Time-series plots highlight how outages have changed historically.
– Heatmaps show service dependency and co-occurrence of failures.
– CDF charts illustrate how long different providers take to recover from failures.

Users can even download these charts for reports and presentations.

Why FAILS Matters

Empowering Transparency in AI
Many AI companies publicly report failures, but they don’t make it easy to analyze trends. FAILS brings transparency to the reliability of LLM services.
Helping Businesses & Researchers
Businesses relying on AI tools can use FAILS to choose more reliable services. Researchers can leverage FAILS to improve AI resilience.
Reducing AI Downtime in the Long Run
By identifying common failure patterns, AI developers can prevent recurring issues and improve service availability.
Bringing AI & Users Closer
With FAILS’ interactive chatbot, even non-technical users can understand AI failures more easily—bridging the gap between AI research and real-world application.

The Future of FAILS

The team behind FAILS is working on several exciting improvements, including:
– Real-time failure prediction: Using AI to predict downtime before it happens.
– Better AI-powered explanations: Improving LLM-assisted analysis with advanced models.
– User-reported failures: Integrating third-party reports from social media and Downdetector.

With these updates, FAILS could become the go-to tool for AI reliability analysis, helping keep the AI services we depend on more predictable and resilient.

Key Takeaways

LLM services like ChatGPT and DALL·E occasionally fail, causing frustration and financial losses.
FAILS is the first open-source tool to systematically collect and analyze AI service failure data.
It provides detailed insights on outage frequency, recovery times, and failure patterns.
AI-assisted analysis helps users interpret data easily through an interactive chatbot.
FAILS’ visual tools allow researchers and businesses to compare AI service reliability over time.
Future improvements will enhance real-time failure prediction and third-party data integration.

Want to explore FAILS for yourself? Check it out on GitHub.

By making AI failures more transparent, FAILS is paving the way for more reliable and robust AI services in the future.
“`

This blog post translates the original research into an engaging and easy-to-understand format that tech enthusiasts, developers, and researchers can appreciate. The conversational tone keeps it accessible while ensuring depth in the explanations. 🚀

If you are looking to improve your prompting skills and haven’t already, check out our free Advanced Prompt Engineering course.

This blog post is based on the research article “FAILS: A Framework for Automated Collection and Analysis of LLM Service Incidents” by Authors: Sándor Battaglini-Fischer, Nishanthi Srinivasan, Bálint László Szarvas, Xiaoyu Chu, Alexandru Iosup. You can find the original article here.

Blog

“`markdown

When AI Fails: How FAILS Helps Track and Analyze LLM Service Outages

Introduction

Why AI Services Fail (and Why It Matters)

Meet FAILS: The Open-Source Solution for AI Outages