ABBL: An Advanced Benchmark and Leaderboard for Comprehensive Evaluation of Arabic Language Models

Jul 20, 2025

Abstract

The rapid advancement of Large Language Models (LLMs) necessitates robust and comprehensive evaluation frameworks, particularly for languages with unique complexities like Arabic. Existing Arabic benchmarks are frequently characterized by several deficiencies, notably: narrow skill coverage, vulnerability to test set contamination, limited accessibility, and inconsistent data quality. This paper introduces the Arabic Broad Benchmark and Leaderboard (ABBL), a novel platform developed by SILMA.AI. ABBL features a human-validated, compact dataset of 470 questions spanning 22 distinct Arabic language tasks, sampled from 64 diverse sources. It employs an innovative evaluation methodology combining customized manual rules and tailored LLM-as-Judge approaches. To ensure a comprehensive and fair evaluation, the proposed leaderboard is equipped with several key innovations: advanced analytical visualizations, detailed breakdowns of model skills, integrated speed benchmarks, contamination detection, and dedicated sub-leaderboards for models of varying sizes. ABBL aims to provide the research and development community with an unprecedented ability to rigorously assess Arabic LLMs, fostering informed model selection and driving further advancements in Arabic NLP.

Paper link: https://demo.silma.ai/papers/ABBL_final.pdf