google/vit-base-patch16-384 cover image

google/vit-base-patch16-384

The Vision Transformer (ViT) model, pre-trained on ImageNet-21k and fine-tuned on ImageNet, achieves state-of-the-art results on image classification tasks. The model uses a transformer encoder architecture and presents images as a sequence of fixed-size patches, adding a [CLS] token for classification tasks. The pre-trained model can be used for downstream tasks such as extracting features and training standard classifiers.

The Vision Transformer (ViT) model, pre-trained on ImageNet-21k and fine-tuned on ImageNet, achieves state-of-the-art results on image classification tasks. The model uses a transformer encoder architecture and presents images as a sequence of fixed-size patches, adding a [CLS] token for classification tasks. The pre-trained model can be used for downstream tasks such as extracting features and training standard classifiers.

Public
$0.0005/sec

Input

Please upload an image file

You need to login to use this model

Output

Maltese dog, Maltese terrier, Maltese (0.92)

Lhasa, Lhasa apso (0.03)

 


© 2023 Deep Infra. All rights reserved.

Discord Logo