google/vit-base-patch16-224 cover image

google/vit-base-patch16-224

The Vision Transformer (ViT) is a transformer encoder model pre-trained on ImageNet-21k and fine-tuned on ImageNet, achieving state-of-the-art results in image classification. The model presents images as a sequence of fixed-size patches and adds a CLS token for classification tasks. The authors recommend using fine-tuned versions of the model for specific tasks.

The Vision Transformer (ViT) is a transformer encoder model pre-trained on ImageNet-21k and fine-tuned on ImageNet, achieving state-of-the-art results in image classification. The model presents images as a sequence of fixed-size patches and adds a CLS token for classification tasks. The authors recommend using fine-tuned versions of the model for specific tasks.

Public
$0.0005/sec

Input

Please upload an image file

You need to login to use this model

Output

Maltese dog, Maltese terrier, Maltese (0.92)

Lhasa, Lhasa apso (0.03)

 


© 2023 Deep Infra. All rights reserved.

Discord Logo