microsoft/beit-base-patch16-224-pt22k-ft22k cover image

microsoft/beit-base-patch16-224-pt22k-ft22k

The BEiT model is a Vision Transformer (ViT) pre-trained on ImageNet-21k, a dataset of 14 million images and 21,841 classes, using a self-supervised approach. The model was fine-tuned on the same dataset and achieved state-of-the-art performance on various image classification benchmarks. The BEiT model uses relative position embeddings and mean-pools the final hidden states of the patch embeddings for classification.

The BEiT model is a Vision Transformer (ViT) pre-trained on ImageNet-21k, a dataset of 14 million images and 21,841 classes, using a self-supervised approach. The model was fine-tuned on the same dataset and achieved state-of-the-art performance on various image classification benchmarks. The BEiT model uses relative position embeddings and mean-pools the final hidden states of the patch embeddings for classification.

Public
$0.0005/sec

Input

Please upload an image file

You need to login to use this model

Output

Maltese dog, Maltese terrier, Maltese (0.92)

Lhasa, Lhasa apso (0.03)

 


© 2023 Deep Infra. All rights reserved.

Discord Logo