Jean-Baptiste/roberta-large-ner-english cover image

Jean-Baptiste/roberta-large-ner-english

We present a fine-tuned RoBERTa model for English named entity recognition, achieving high performance on both formal and informal datasets. Our approach uses a simplified version of the CONLL2003 dataset and removes unnecessary prefixes for improved efficiency. The resulting model shows superiority over other models, especially on entities that do not begin with uppercase letters, and can be used for various applications such as email signature detection.

We present a fine-tuned RoBERTa model for English named entity recognition, achieving high performance on both formal and informal datasets. Our approach uses a simplified version of the CONLL2003 dataset and removes unnecessary prefixes for improved efficiency. The resulting model shows superiority over other models, especially on entities that do not begin with uppercase letters, and can be used for various applications such as email signature detection.

Public
$0.0005 / sec
Web inference not supported yet, please check API tab

roberta-large-ner-english: model fine-tuned from roberta-large for NER task

Introduction

[roberta-large-ner-english] is an english NER model that was fine-tuned from roberta-large on conll2003 dataset. Model was validated on emails/chat data and outperformed other models on this type of data specifically. In particular the model seems to work better on entity that don't start with an upper case.

Training data

Training data was classified as follow:

AbbreviationDescription
OOutside of a named entity
MISCMiscellaneous entity
PERPerson’s name
ORGOrganization
LOCLocation

In order to simplify, the prefix B- or I- from original conll2003 was removed. I used the train and test dataset from original conll2003 for training and the "validation" dataset for validation. This resulted in a dataset of size:

TrainValidation
174943250

Model performances

Model performances computed on conll2003 validation dataset (computed on the tokens predictions)

entityprecisionrecallf1
PER0.99140.99270.9920
ORG0.96270.96610.9644
LOC0.97950.98620.9828
MISC0.92920.92620.9277
Overall0.97400.97660.9753

On private dataset (email, chat, informal discussion), computed on word predictions:

entityprecisionrecallf1
PER0.88230.91160.8967
ORG0.76940.72920.7487
LOC0.86190.77680.8171

By comparison on the same private dataset, Spacy (en_core_web_trf-3.2.0) was giving:

entityprecisionrecallf1
PER0.91460.82870.8695
ORG0.76550.64370.6993
LOC0.87270.61800.7236

For those who could be interested, here is a short article on how I used the results of this model to train a LSTM model for signature detection in emails: https://medium.com/@jean-baptiste.polle/lstm-model-for-email-signature-detection-8e990384fefa