BGE-M3 is a multilingual text embedding model developed by BAAI, distinguished by its Multi-Linguality (supporting 100+ languages), Multi-Functionality (unified dense, multi-vector, and sparse retrieval), and Multi-Granularity (handling inputs from short queries to 8192-token documents). It achieves state-of-the-art retrieval performance across diverse benchmarks while maintaining a single model for multiple retrieval modes.
BGE-M3 is a multilingual text embedding model developed by BAAI, distinguished by its Multi-Linguality (supporting 100+ languages), Multi-Functionality (unified dense, multi-vector, and sparse retrieval), and Multi-Granularity (handling inputs from short queries to 8192-token documents). It achieves state-of-the-art retrieval performance across diverse benchmarks while maintaining a single model for multiple retrieval modes.
IN Inputs --
Dense is a set of low-dimensional vectors where every token in the input is represented by a fully populated embedding derived from a neural model. 2
Sparse is a collection of high-dimensional vectors where each word in the input is assigned a lexical weight, with most values being zero. 2
Colbert is a system of contextualized vectors where every token in the input is represented by its own BERT-derived embedding. 2
whether to normalize the computed embeddings 2
You need to login to use this model
{
"input_tokens": 42,
"embeddings": [
[
0,
0.5,
1
],
[
1,
0.5,
0
]
],
"sparse": [
[
0,
0,
1
],
[
0,
0.6,
0
]
],
"colbert": [
[
[
0.5,
0.1,
1
],
[
0.3,
0.6,
0.5
]
],
[
[
0.3,
0.6,
0.5
]
]
],
"embedding_jsons": [
"[0.0, 0.5, 1.0]",
"[1.0, 0.5, 0.0]"
]
}
BGE-M3 is a multilingual text embedding model developed by BAAI, distinguished by its Multi-Linguality (supporting 100+ languages), Multi-Functionality (unified dense, multi-vector, and sparse retrieval), and Multi-Granularity (handling inputs from short queries to 8192-token documents). It achieves state-of-the-art retrieval performance across diverse benchmarks while maintaining a single model for multiple retrieval modes.
BGE-M3’s capabilities were validated on standard information retrieval benchmarks, demonstrating top-tier performance compared to previous models: • MIRACL (Multilingual Retrieval, 18 languages): BGE-M3 achieved the highest average ranking score (nDCG@10 = 70.0 using all modes) across languages, outperforming the best prior multi-lingual embedder (mE5, ~65.4). Notably, even using only its dense embedding, BGE-M3 surpassed all baselines on average and in most individual languages. • MKQA (Cross-Lingual QA Retrieval, 26 languages): On this cross-lingual retrieval task (measured by Recall@100), BGE-M3 attained 75.5% recall, substantially above the strongest baseline (~70.9%). It outperformed OpenAI’s latest text embedding model in this benchmark, confirming its effectiveness in cross-language scenarios. • Long Document Retrieval: BGE-M3 demonstrated strong performance on long-document datasets. On the MLDR test set (a 13-language long document retrieval benchmark introduced by BAAI), its sparse-retrieval mode achieved about 10 nDCG@10 points higher than its dense mode, and a hybrid of dense+sparse gave further gains. Its sparse vector results were competitive with BM25, highlighting that lexical matching signals were well-captured. In the NarrativeQA long-form retrieval evaluation, BGE-M3 likewise showed consistent gains over baseline models, with the performance gap widening as input length increased. These results underscore the model’s ability to handle documents up to 8192 tokens.
Overall, BGE-M3 delivers superior multilingual retrieval quality. For instance, its dense embeddings alone not only outperform prior dense retrievers like mDPR and mContriever, but even rival a much larger 7B-parameter model (the E5/Mistral-7B encoder) on English while exceeding it significantly on other languages. Additionally, BGE-M3’s learned sparse representations outperform traditional BM25 in all tested languages under comparable conditions. The combination of all three retrieval modes (dense + sparse + multi-vector) yields the best results, providing a unified state-of-the-art solution across diverse retrieval tasks.
The model was evaluated on multiple retrieval benchmarks using standard IR metrics and comparisons: • MIRACL: Multilingual ad-hoc retrieval in 18 languages (queries and documents in the same language). Evaluated with Pyserini, using nDCG@10 as the primary metric. This benchmark tests BGE-M3’s multilingual dense and hybrid retrieval performance per language. • MKQA: A cross-lingual open-domain question answering retrieval dataset. Queries are in various languages and relevant passages in English. Evaluation used Recall@100 as the metric. This measures the model’s ability to retrieve relevant English answers for non-English queries. • Long Document Retrieval: Including MLDR (a new multilingual long-document retrieval set constructed from Wikipedia/WuDao/mC4) and NarrativeQA (long narrative QA retrieval in English). Performance is measured by nDCG@10, focusing on handling of lengthy documents. These tests assess the model’s multi-granularity capability on inputs of thousands of tokens. • Baselines and Comparisons: Results are compared against traditional lexical retrieval (BM25) and prior embedding models: mDPR, mContriever (multilingual dense retrievers), E5/mE5 (5th-generation embedding models, with E5 using a larger 7B encoder), and OpenAI’s Text-Embedding-ADA-002 (referred to as “OpenAI-3-Large”). All methods were evaluated under the same conditions (e.g. using the same XLM-R tokenizer for BM25 for fairness). The evaluation methodology thus highlights BGE-M3’s improvements in a like-for-like comparison.
You can find model corpus here.