Introducing SILMA Matryoshka Embedding Model v0.1

Oct 27, 2024

A New Fully-Open Arabic Embedding Model.

At SILMA AI, we’re driven by a singular goal: to bring world-class AI capabilities to the Arabic-speaking world. Today, we’re excited to announce a major milestone in that mission with the release of the SILMA Arabic Matryoshka Embedding Model v0.1—a powerful new tool designed to revolutionize how Arabic text is understood and processed by machines.

This model is built specifically for the Arabic language, addressing its unique complexities like rich grammar, and lingual ambiguity. Many general models struggle with these challenges, but the SILMA Arabic Matryoshka Embedding Model v0.1 is designed to handle them effectively. It’s perfect for tasks like semantic search, text similarity, and more.

In this post, we’ll explain what makes this model special, why it’s important for Arabic NLP, and how you can start using it to improve your applications.



Why Embedding Models Matter for Arabic Language AI

Embedding models are a key part of modern Natural Language Processing (NLP). They transform words and sentences into numerical representations (called embeddings) that allow machines to understand the meaning and relationships between different pieces of text. This makes embeddings essential for tasks like search engines, text similarity and retrieval augmented generation (RAG).

Embedding models like SILMA Arabic Matryoshka Embedding Model v0.1 address offering more accurate, context-aware representations of Arabic text. They capture the subtleties of the language, making tasks like semantic search and text matching much more reliable.

SILMA Embedding Models can convert Arabic texts into representative embedding vectors.

What is the Matryoshka Embedding Model?

The Matryoshka Embedding Model is a new approach to creating more efficient and powerful text embeddings. Named after the Russian Matryoshka dolls that nest inside one another, this method allows embeddings to capture multiple layers of meaning and context within a single representation. It’s a more flexible and efficient way to handle the complexities of language compared to traditional embedding methods.

In traditional embeddings, a fixed-size vector is used to represent words or sentences. While effective, this can sometimes limit the model’s ability to fully capture nuances, especially in languages like Arabic that have rich morphology and diverse dialects. The Matryoshka approach overcomes this by creating multi-scale embeddings that allow different levels of meaning to be captured, similar to how the nested dolls represent layers within layers.

By using the Matryoshka approach, SILMA’s model provides high-quality embeddings without needing massive computational resources, making it accessible and scalable for real-world applications. You can now adjust the size of the embedding space of your data starting from 2 to 768 dimensions.

with the SILMA Matryoshka Embedding Model, you can truncate the output vector and reach high-quality embeddings.

Key Features of SILMA Arabic Matryoshka Embedding Model

The SILMA Arabic Matryoshka Embedding Model v0.1 is designed to offer high performance and flexibility for a variety of Arabic NLP tasks. Here are some of its standout features:

1. Bilingual Support (Arabic & English)

While the model is optimized for Arabic, it also supports English, making it useful in bilingual settings. This allows for seamless handling of mixed Arabic-English text, which is common in many modern applications.

2. High Accuracy on Semantic Tasks

The model performs exceptionally well on standard benchmarks for tasks like semantic search, text similarity, and text classification. This is achieved through its advanced multi-scale embedding technique, which captures deeper context and meaning from text, especially in a morphologically rich language like Arabic.

3. Efficient and Lightweight

Despite its powerful capabilities, the SILMA Arabic Matryoshka Embedding Model is designed to be computationally efficient. It offers high-quality embeddings without requiring large computational resources, making it accessible for developers and businesses that need fast and scalable solutions.

4. Adaptable to Real-World Applications

Whether you’re building a search engine, recommendation system, or a RAG chatbot, the SILMA model is versatile enough to fit into many different use cases. It can be easily integrated into various NLP pipelines, offering both accuracy and speed in real-time applications.

5. Easy Integration via Hugging Face

The model is available on Hugging Face, which means developers can quickly integrate it into their projects. With the support of Hugging Face’s APIs, it’s simple to start using the model in Python-based environments.

Use Cases for the SILMA Arabic Matryoshka Embedding Model

The SILMA Arabic Matryoshka Embedding Model v0.1 is versatile and highly effective across a range of practical applications. From short text similarity tasks to complex intent recognition, the model delivers high accuracy and flexibility. Below are some common use cases where this model shines, with examples to demonstrate its capabilities.

For example, In everyday applications like search engines or text classification, short sentence similarity is critical for matching user queries to relevant content. In the example below, the query "الطقس اليوم مشمس" ("The weather is sunny today") is compared to two sentences: one describing similar weather conditions and another describing cloudy weather. Even with a reduced number of dimensions, the model effectively captures the semantic similarity, as seen in the cosine similarity scores.

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
import pandas as pd

model_name = "silma-ai/silma-embeddding-matryoshka-0.1"
model = SentenceTransformer(model_name)

query = "الطقس اليوم مشمس"
sentence_1 = "الجو اليوم كان مشمسًا ورائعًا"
sentence_2 = "الطقس اليوم غائم"

scores = []
for dim in [768, 256, 48, 16, 8]

Another use case is matching user queries to relevant paragraphs or documents. This is particularly useful in applications like customer support, content recommendation, or even voice assistants. In this example, the query "ما هي فوائد ممارسة الرياضة؟" ("What are the benefits of exercising?") is matched against two paragraphs—one relevant and one unrelated. As shown, even at low dimensions, the model correctly identifies the relevant paragraph after evaluating multiple different vector sizes.

query = "ما هي فوائد ممارسة الرياضة؟"
sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية"
sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة"

scores = []
for dim in [768, 256, 48, 16, 8]

Dataset, Training Details, and Benchmarks

For those interested in the technical aspects behind the SILMA Arabic Matryoshka Embedding Model v0.1, we’ve also published detailed information about the dataset, training process, and benchmarks. You can find all of this on the model’s official page on Hugging Face here.

Whether you’re a developer looking to integrate the model into your application or a researcher interested in its technical foundation, this page provides a wealth of information to deepen your understanding of the SILMA Arabic Matryoshka Embedding Model v0.1.

Model Availability and How to Use It

The SILMA Arabic Matryoshka Embedding Model v0.1 is now available on Hugging Face, making it easy for developers to access and integrate into their projects. You can find the model here, where you’ll also find the full documentation and usage examples.

To start using the model, you can load it directly from Hugging Face’s Hub. Here’s a quick example in Python to help you get started:


from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model_name = "silma-ai/silma-embeddding-matryoshka-0.1"
model = SentenceTransformer(model_name)

dim = 50 # up to 768

query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم"
sentence_1 = "حجز رحلة"
sentence_2 = "إلغاء حجز"

query_embedding = model.encode(query)[:dim]

sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0]

Conclusion

The release of the SILMA Arabic Matryoshka Embedding Model v0.1 marks an important step forward in advancing Arabic language AI. By addressing the unique challenges of Arabic with cutting-edge embedding techniques, we’ve created a powerful tool that is both efficient and highly accurate. Whether you're working on semantic search, text similarity, intent recognition, or chatbots, this model offers the flexibility and performance needed to handle a wide range of real-world applications.

We invite developers, researchers, and businesses to explore the model on Hugging Face, integrate it into their projects, and share their feedback. As we continue to improve and expand on this work, your input will be invaluable in shaping future versions of SILMA’s models.

Thank you for being part of our journey to build smarter, more accessible AI solutions for the Arabic-speaking world. We look forward to seeing how you’ll use the SILMA Arabic Matryoshka Embedding Model to push the boundaries of what’s possible in Arabic language technology.