camembert-base

We extract contextual embedding features from Camembert, a fill-mask language model, for the task of sentiment analysis. We use the tokenize and encode functions to convert our sentence into a numerical representation, and then feed it into the Camembert model to get the contextual embeddings. We extract the embeddings from all 12 self-attention layers and the input embedding layer to form a 13-dimensional feature vector for each sentence.

Public

$0.0005 / sec

api versions

Input

text prompt, should include exactly one <mask> token

You need to login to use this model

Output

where is my father? (0.09)

where is my mother? (0.08)

CamemBERT: a Tasty French Language Model

Model Details
Uses
Risks, Limitations and Biases
Training
Evaluation
Citation Information
How to Get Started With the Model

Model Details

Model Description: CamemBERT is a state-of-the-art language model for French based on the RoBERTa model. It is now available on Hugging Face in 6 different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.
Developed by: Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
Model Type: Fill-Mask
Language(s): French
License: MIT
Parent Model: See the RoBERTa base model for more information about the RoBERTa base model.
Resources for more information:
- Research Paper
- Camembert Website

Uses

Direct Use

This model can be used for Fill-Mask tasks.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware this section contains content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

This model was pretrained on a subcorpus of OSCAR multilingual corpus. Some of the limitations and risks associated with the OSCAR dataset, which are further detailed in the OSCAR dataset card, include the following:

The quality of some OSCAR sub-corpora might be lower than expected, specifically for the lowest-resource languages.

Constructed from Common Crawl, Personal and sensitive information might be present.

Training

Training Data

OSCAR or Open Super-large Crawled Aggregated coRpus is a multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.

Training Procedure

Model	#params	Arch.	Training data
`camembert-base`	110M	Base	OSCAR (138 GB of text)
`camembert/camembert-large`	335M	Large	CCNet (135 GB of text)
`camembert/camembert-base-ccnet`	110M	Base	CCNet (135 GB of text)
`camembert/camembert-base-wikipedia-4gb`	110M	Base	Wikipedia (4 GB of text)
`camembert/camembert-base-oscar-4gb`	110M	Base	Subsample of OSCAR (4 GB of text)
`camembert/camembert-base-ccnet-4gb`	110M	Base	Subsample of CCNet (4 GB of text)

Evaluation

The model developers evaluated CamemBERT using four different downstream tasks for French: part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER) and natural language inference (NLI).

Citation Information

@inproceedings{martin2020camembert,
  title={CamemBERT: a Tasty French Language Model},
  author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
  booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
  year={2020}
}

Latest Models

Phind/

Phind-CodeLlama-34B-v2

Gryphe/

MythoMax-L2-13b

openchat/

openchat_3.5

openai/

whisper-tiny

bigcode/

starcoder2-15b

Featured Models

mistralai/

Mistral-7B-Instruct-v0.2

openai/

whisper-large

microsoft/

WizardLM-2-8x22B

openchat/

openchat_3.5

cognitivecomputations/

dolphin-2.6-mixtral-8x7b

google/

gemma-1.1-7b-it

Company

Pricing

Docs

Compare

DeepStart

About

Privacy

Terms