A sentence similarity model that can be used for various NLP tasks such as text classification, sentiment analysis, named entity recognition, question answering, and more. It utilizes the CoSENT architecture, which consists of a transformer encoder and a pooling module, to encode input texts into vectors that capture their semantic meaning. The model was trained on the nli_zh dataset and achieved high performance on various benchmark datasets.
A sentence similarity model that can be used for various NLP tasks such as text classification, sentiment analysis, named entity recognition, question answering, and more. It utilizes the CoSENT architecture, which consists of a transformer encoder and a pooling module, to encode input texts into vectors that capture their semantic meaning. The model was trained on the nli_zh dataset and achieved high performance on various benchmark datasets.
whether to normalize the computed embeddings 2
You need to login to use this model
This is a CoSENT(Cosine Sentence) model: shibing624/text2vec-base-chinese.
It maps sentences to a 768 dimensional dense vector space and can be used for tasks like sentence embeddings, text matching or semantic search.
For an automated evaluation of this model, see the Evaluation Benchmark: text2vec
Arch | BaseModel | Model | ATEC | BQ | LCQMC | PAWSX | STS-B | SOHU-dd | SOHU-dc | Avg | QPS |
---|---|---|---|---|---|---|---|---|---|---|---|
Word2Vec | word2vec | w2v-light-tencent-chinese | 20.00 | 31.49 | 59.46 | 2.57 | 55.78 | 55.04 | 20.70 | 35.03 | 23769 |
SBERT | xlm-roberta-base | sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 | 18.42 | 38.52 | 63.96 | 10.14 | 78.90 | 63.01 | 52.28 | 46.46 | 3138 |
Instructor | hfl/chinese-roberta-wwm-ext | moka-ai/m3e-base | 41.27 | 63.81 | 74.87 | 12.20 | 76.96 | 75.83 | 60.55 | 57.93 | 2980 |
CoSENT | hfl/chinese-macbert-base | shibing624/text2vec-base-chinese | 31.93 | 42.67 | 70.16 | 17.21 | 79.30 | 70.27 | 50.42 | 51.61 | 3008 |
CoSENT | hfl/chinese-lert-large | GanymedeNil/text2vec-large-chinese | 32.61 | 44.59 | 69.30 | 14.51 | 79.44 | 73.01 | 59.04 | 53.12 | 2092 |
CoSENT | nghuyong/ernie-3.0-base-zh | shibing624/text2vec-base-chinese-sentence | 43.37 | 61.43 | 73.48 | 38.90 | 78.25 | 70.60 | 53.08 | 59.87 | 3089 |
CoSENT | nghuyong/ernie-3.0-base-zh | shibing624/text2vec-base-chinese-paraphrase | 44.89 | 63.58 | 74.24 | 40.90 | 78.93 | 76.70 | 63.30 | 63.08 | 3066 |
CoSENT | sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 | shibing624/text2vec-base-multilingual | 32.39 | 50.33 | 65.64 | 32.56 | 74.45 | 68.88 | 51.17 | 53.67 | 4004 |
说明:
shibing624/text2vec-base-chinese
模型,是用CoSENT方法训练,基于hfl/chinese-macbert-base
在中文STS-B数据训练得到,并在中文STS-B测试集评估达到较好效果,运行examples/training_sup_text_matching_model.py代码可训练模型,模型文件已经上传HF model hub,中文通用语义匹配任务推荐使用shibing624/text2vec-base-chinese-sentence
模型,是用CoSENT方法训练,基于nghuyong/ernie-3.0-base-zh
用人工挑选后的中文STS数据集shibing624/nli-zh-all/text2vec-base-chinese-sentence-dataset训练得到,并在中文各NLI测试集评估达到较好效果,运行examples/training_sup_text_matching_model_jsonl_data.py代码可训练模型,模型文件已经上传HF model hub,中文s2s(句子vs句子)语义匹配任务推荐使用shibing624/text2vec-base-chinese-paraphrase
模型,是用CoSENT方法训练,基于nghuyong/ernie-3.0-base-zh
用人工挑选后的中文STS数据集shibing624/nli-zh-all/text2vec-base-chinese-paraphrase-dataset,数据集相对于shibing624/nli-zh-all/text2vec-base-chinese-sentence-dataset加入了s2p(sentence to paraphrase)数据,强化了其长文本的表征能力,并在中文各NLI测试集评估达到SOTA,运行examples/training_sup_text_matching_model_jsonl_data.py代码可训练模型,模型文件已经上传HF model hub,中文s2p(句子vs段落)语义匹配任务推荐使用sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
模型是用SBERT训练,是paraphrase-MiniLM-L12-v2
模型的多语言版本,支持中文、英文等w2v-light-tencent-chinese
是腾讯词向量的Word2Vec模型,CPU加载使用,适用于中文字面匹配任务和缺少数据的冷启动情况CoSENT(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_mean_tokens': True})
)
Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it ouptuts a vector which captures the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
By default, input text longer than 256 word pieces is truncated.
We use the pretrained hfl/chinese-macbert-base
model.
Please refer to the model card for more detailed information about the pre-training procedure.
We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch. We then apply the rank loss by comparing with true pairs and false pairs.
This model was trained by text2vec.
If you find this model helpful, feel free to cite:
@software{text2vec,
author = {Xu Ming},
title = {text2vec: A Tool for Text to Vector},
year = {2022},
url = {https://github.com/shibing624/text2vec},
}