google/vit-base-patch16-384 cover image

google/vit-base-patch16-384

The Vision Transformer (ViT) model, pre-trained on ImageNet-21k and fine-tuned on ImageNet, achieves state-of-the-art results on image classification tasks. The model uses a transformer encoder architecture and presents images as a sequence of fixed-size patches, adding a [CLS] token for classification tasks. The pre-trained model can be used for downstream tasks such as extracting features and training standard classifiers.

The Vision Transformer (ViT) model, pre-trained on ImageNet-21k and fine-tuned on ImageNet, achieves state-of-the-art results on image classification tasks. The model uses a transformer encoder architecture and presents images as a sequence of fixed-size patches, adding a [CLS] token for classification tasks. The pre-trained model can be used for downstream tasks such as extracting features and training standard classifiers.

Public
$0.0005 / sec
demoapi

be89a4abf1f427fe502d37f261b8b6d6da7894bc

2023-04-29T01:02:50+00:00