Arabic LLM Leaderboard

Introducing the Arabic Broad Leaderboard (ABL) - the ultimate ranking system for Arabic LLMs, providing a comprehensive overview of leaders in the field

The Arabic Broad Leaderboard (ABL) is a cutting-edge leaderboard for Arabic LLMs, it features advanced visualizations, in-depth analytics, detailed model performance breakdowns, speed evaluations, and tools for detecting data contamination. It empowers the community to thoroughly assess the strengths of Arabic language models and make informed decisions about which model best suits each task

Explore Leaderboard

Features

Size-Based Leaderboards

Dedicated leaderboard sections allow users to compare models by size, answering questions like: “What’s the top Arabic model under 10B parameters?”

Defined size categories:

Nano: Fewer than 3.5 billion parameters
Small: 3.5 to 10 billion parameters
Medium: 10 to 35 billion parameters
Large: More than 35 billion parameters

Skill-Based Leaderboards

Additional sections enable model comparisons based on specific capabilities. For instance, identifying the best Arabic model for handling long-context inputs.

Visual Comparison

Models can be evaluated side by side using radar charts to visualize their performance across different skills.

Deep Dive

These reports offer a focused analysis of a single model, highlighting its strengths and weaknesses. Full output samples are included for transparency.

Speed

Model performance is evaluated in terms of generation speed, measured in words per second. This is calculated by dividing the total words generated during testing by the elapsed testing time in seconds.

To ensure consistency, all models on Hugging Face are benchmarked using the same hardware setup: an A100 GPU and a batch size of 1. For models with over 15 billion parameters, multiple GPUs are used.

Speed comparisons are valid only within the same size category. Closed or API-based models are only compared with other API models, as they are evaluated externally.

Contamination Detection

We employ a proprietary method to estimate the likelihood that a model was exposed to test data during training. This results in a contamination score, which is displayed alongside the model's output, marked with a red indicator.

To preserve leaderboard credibility, strict policies are in place to prevent repeated model evaluations. Each organization or account is restricted to one submission per month.

To avoid score manipulation, we keep details of the detection algorithm, contamination threshold, and sub-threshold scores confidential.

Any model found with signs of contamination is immediately removed and reviewed. Additionally, a banning system is enforced to deter misuse.