sentence-transformers/multi-qa-mpnet-base-dot-v1 cover image

sentence-transformers/multi-qa-mpnet-base-dot-v1

We present a sentence transformation model that maps sentences and paragraphs to a 768-dimensional dense vector space, suitable for semantic search tasks. The model is trained on 215 million question-answer pairs from various sources, including WikiAnswers, PAQ, Stack Exchange, MS MARCO, GOOAQ, Amazon QA, Yahoo Answers, Search QA, ELI5, and Natural Questions. Our model uses a contrastive learning objective.

We present a sentence transformation model that maps sentences and paragraphs to a 768-dimensional dense vector space, suitable for semantic search tasks. The model is trained on 215 million question-answer pairs from various sources, including WikiAnswers, PAQ, Stack Exchange, MS MARCO, GOOAQ, Amazon QA, Yahoo Answers, Search QA, ELI5, and Natural Questions. Our model uses a contrastive learning objective.

Public
$0.005 / Mtoken
512
Web inference not supported yet, please check API tab

multi-qa-mpnet-base-dot-v1

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and was designed for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources. For an introduction to semantic search, have a look at: SBERT.net - Semantic Search

Technical Details

In the following some technical details how this model must be used:

SettingValue
Dimensions768
Produces normalized embeddingsNo
Pooling-MethodCLS pooling
Suitable score functionsdot-product (e.g. util.dot_score)

Background

The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised contrastive learning objective. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.

We developped this model during the Community week using JAX/Flax for NLP & CV, organized by Hugging Face. We developped this model as part of the project: Train the Best Sentence Embedding Model Ever with 1B Training Pairs. We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.

Intended uses

Our model is intented to be used for semantic search: It encodes queries / questions and text paragraphs in a dense vector space. It finds relevant documents for the given passages.

Note that there is a limit of 512 word pieces: Text longer than that will be truncated. Further note that the model was just trained on input text up to 250 word pieces. It might not work well for longer text.

Training procedure

The full training script is accessible in this current repository: train_script.py.

Pre-training

We use the pretrained mpnet-base model. Please refer to the model card for more detailed information about the pre-training procedure.

Training

We use the concatenation from multiple datasets to fine-tune our model. In total we have about 215M (question, answer) pairs. We sampled each dataset given a weighted probability which configuration is detailed in the data_config.json file.

The model was trained with MultipleNegativesRankingLoss using CLS-pooling, dot-product as similarity function, and a scale of 1.

DatasetNumber of training tuples
WikiAnswers Duplicate question pairs from WikiAnswers77,427,422
PAQ Automatically generated (Question, Paragraph) pairs for each paragraph in Wikipedia64,371,441
Stack Exchange (Title, Body) pairs from all StackExchanges25,316,456
Stack Exchange (Title, Answer) pairs from all StackExchanges21,396,559
MS MARCO Triplets (query, answer, hard_negative) for 500k queries from Bing search engine17,579,773
GOOAQ: Open Question Answering with Diverse Answer Types (query, answer) pairs for 3M Google queries and Google featured snippet3,012,496
Amazon-QA (Question, Answer) pairs from Amazon product pages2,448,839
Yahoo Answers (Title, Answer) pairs from Yahoo Answers1,198,260
Yahoo Answers (Question, Answer) pairs from Yahoo Answers681,164
Yahoo Answers (Title, Question) pairs from Yahoo Answers659,896
SearchQA (Question, Answer) pairs for 140k questions, each with Top5 Google snippets on that question582,261
ELI5 (Question, Answer) pairs from Reddit ELI5 (explainlikeimfive)325,475
Stack Exchange Duplicate questions pairs (titles)304,525
Quora Question Triplets (Question, Duplicate_Question, Hard_Negative) triplets for Quora Questions Pairs dataset103,663
Natural Questions (NQ) (Question, Paragraph) pairs for 100k real Google queries with relevant Wikipedia paragraph100,231
SQuAD2.0 (Question, Paragraph) pairs from SQuAD2.0 dataset87,599
TriviaQA (Question, Evidence) pairs73,346
Total214,988,242