intfloat/e5-base-v2 cover image

intfloat/e5-base-v2

Text Embeddings by Weakly-Supervised Contrastive Pre-training. Model has 24 layers and 1024 out dim.

Text Embeddings by Weakly-Supervised Contrastive Pre-training. Model has 24 layers and 1024 out dim.

Public
$0.005 / Mtoken
512
PaperLicense
Web inference not supported yet, please check API tab

E5-base-v2

Text Embeddings by Weakly-Supervised Contrastive Pre-training. Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei, arXiv 2022

This model has 24 layers and the embedding size is 1024.

Training Details

Please refer to our paper at https://arxiv.org/pdf/2212.03533.pdf.

Benchmark Evaluation

Check out unilm/e5 to reproduce evaluation results on the BEIR and MTEB benchmark.

FAQ

1. Do I need to add the prefix "query: " and "passage: " to input texts?

Yes, this is how the model is trained, otherwise you will see a performance degradation.

Here are some rules of thumb:

  • Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.

  • Use "query: " prefix for symmetric tasks such as semantic similarity, paraphrase retrieval.

  • Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.

2. Why are my reproduced results slightly different from reported in the model card?

Different versions of transformers and pytorch could cause negligible but non-zero performance differences.

3. Why does the cosine similarity scores distribute around 0.7 to 1.0?

This is a known and expected behavior as we use a low temperature 0.01 for InfoNCE contrastive loss.

For text embedding tasks like text retrieval or semantic similarity, what matters is the relative order of the scores instead of the absolute values, so this should not be an issue.

Citation

If you find our paper or models helpful, please consider cite as follows:

@article{wang2022text,
  title={Text Embeddings by Weakly-Supervised Contrastive Pre-training},
  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu},
  journal={arXiv preprint arXiv:2212.03533},
  year={2022}
}

Limitations

This model only works for English texts. Long texts will be truncated to at most 512 tokens.