Can AI Accurately Detect Online Hate? A Deep Dive into Open-Source vs. Proprietary Models

The internet is a double-edged sword. On one hand, it connects people across the globe, enabling free expression and knowledge-sharing. On the other, it has amplified the spread of extreme speech—content that is offensive, exclusionary, or even incites violence. Social media platforms struggle to filter out harmful content, relying on both human moderators and AI-driven tools. But just how effective is AI at extreme speech classification?

This blog explores fascinating research from Sarthak Mahajan and Nimmi Rangaswamy, which compares different large language models (LLMs)—from open-source alternatives like Llama to proprietary, closed-source giants like GPT-4o—to see which one is better at classifying extreme speech. The findings may surprise you!

Why Extreme Speech Needs AI Moderation

Before diving into the research, let’s define extreme speech. Unlike typical hate speech, extreme speech covers a broader range, including:
– Derogatory speech – Offensive language that can be uncivil but may also be used in protests.
– Dangerous speech – Content that could lead to real-world violence.
– Exclusionary speech – Subtle forms of discrimination, often expressed as humor to normalize exclusion.

Manually identifying such content is impossible at the scale of social media today. Even human moderators often disagree on what qualifies as extreme speech because of cultural and contextual differences. This is where AI-powered moderation comes in—offering automated, scalable solutions.

However, AI isn’t perfect. Understanding cultural context is tricky, and language models must be trained to recognize complex patterns in speech. The researchers tested how well different types of AI models handle this challenge.

The AI Showdown: Open-Source vs. Proprietary Models

The study compared two types of AI models for extreme speech classification:

1. Open-Source Models (Llama by Meta AI)

Transparent and accessible for developers.
Can be fine-tuned for specific tasks.
Includes models like Llama 3.18B, 3.21B, 3.23B, and 3.70B (where B stands for billions of parameters).

2. Proprietary Models (GPT-4o by OpenAI)

Closed-source, meaning internal workings are hidden.
Generally more powerful out-of-the-box.
Includes GPT-4o and GPT-4o-mini (a lighter version of GPT-4o).

Each model was tested in different settings:
🔹 Zero-shot, where AI had no prior training on extreme speech data—it simply had to rely on its general knowledge.
🔹 Fine-tuning, where AI models were trained with real examples of extreme speech to improve accuracy.
🔹 Preference Optimization (DPO), an advanced method to refine AI classification.
🔹 Ensembling, combining multiple models to improve accuracy.

Key Findings: How Each AI Performed

📌 Round 1: Zero-Shot Testing (No Training Given)

Surprisingly, even without training, LLMs performed decently well, proving their ability to generalize across topics. However:
– Larger models did better than smaller ones (e.g., Llama 3.70B > Llama 3.21B).
– GPT-4o outperformed all Llama models in this setting.
– GPT-4o-mini (a smaller version of GPT-4o) also did well, particularly in detecting dangerous speech.

💡 Takeaway: Bigger models and proprietary models handle zero-shot classification better, likely due to superior training data and architectures.

📌 Round 2: Fine-Tuning for Better Accuracy

When models were trained with specific examples of extreme speech, their performance improved significantly:
– Even smaller Llama models became as effective as GPT-4o, proving that fine-tuning can make open-source AI just as powerful.
– Fine-tuning eliminated the performance gap between open and closed models.

💡 Takeaway: Publicly available LLMs can be fine-tuned to rival proprietary models, making them strong alternatives for organizations needing customizable AI moderation.

📌 Round 3: Preference Optimization (DPO)

DPO is a technique that refines AI responses based on a structured ranking system. However, in this study, DPO did not improve AI accuracy for extreme speech classification.

💡 Takeaway: While useful for preference-based tasks, DPO adds little value for strict classification problems like detecting extreme speech.

📌 Round 4: Ensembling (Combining Multiple Models)

To see if combining different AIs worked better, researchers tried blending multiple fine-tuned models. However, the improvement was minimal because each model showed similar strengths and weaknesses.

💡 Takeaway: If all models make the same mistakes, combining them won’t help. A better approach would be using AI alongside human moderators.

How These Findings Impact AI Content Moderation

For companies, regulators, and researchers interested in responsible AI, this study provides several key insights:

🔹 Proprietary AIs aren’t always necessary. Open-source models like Llama can perform just as well when fine-tuned, making them an attractive choice for organizations needing transparency and cost control.

🔹 Fine-tuning is essential. Models trained with real examples perform significantly better than their original versions. This suggests that future AI moderation should incorporate reliable training on diverse real-world data.

🔹 AI alone is not enough. Even the best models showed inconsistencies, just like human moderators. A hybrid approach combining AI and human oversight is likely the best way to tackle online hate effectively.

Key Takeaways

✅ Extreme speech is a complex challenge that requires AI-powered moderation due to the sheer scale of online content.
✅ Open-source AI (Llama) can match closed-source AI (GPT-4o) when fine-tuned, making it a cost-effective and ethical alternative.
✅ Fine-tuning drastically improves AI performance, proving the importance of training models with real-world examples.
✅ Advanced techniques like Preference Optimization (DPO) didn’t help, highlighting the need for better AI refinements.
✅ The best AI solution combines models with human moderation, ensuring nuanced decisions in content filtering.

What’s Next for AI in Content Moderation?

This research highlights exciting advancements, but challenges remain. As society debates the ethics and effectiveness of AI moderation, future AI models must:
– Improve contextual understanding to recognize cultural nuances.
– Minimize false positives and negatives in hate speech detection.
– Be transparent and auditable, especially for AI-driven regulatory decisions.

So, the next time you’re scrolling through social media and notice that extreme speech is vanishing before your eyes, remember—the battle against online hate isn’t just about detecting words, it’s about understanding context, and AI is actively reshaping how we do that.

What do you think—should social media companies rely more on open-source AI for moderation, or do proprietary models still have the edge? Drop your thoughts in the comments! 🚀💬

If you are looking to improve your prompting skills and haven’t already, check out our free Advanced Prompt Engineering course.

This blog post is based on the research article “Extreme Speech Classification in the Era of LLMs: Exploring Open-Source and Proprietary Models” by Authors: Sarthak Mahajan, Nimmi Rangaswamy. You can find the original article here.

Blog

Can AI Accurately Detect Online Hate? A Deep Dive into Open-Source vs. Proprietary Models

Can AI Accurately Detect Online Hate? A Deep Dive into Open-Source vs. Proprietary Models

Why Extreme Speech Needs AI Moderation