The Vision Transformer (ViT) model, pre-trained on ImageNet-21k and fine-tuned on ImageNet, achieves state-of-the-art results on image classification tasks. The model uses a transformer encoder architecture and presents images as a sequence of fixed-size patches, adding a [CLS] token for classification tasks. The pre-trained model can be used for downstream tasks such as extracting features and training standard classifiers.
The Vision Transformer (ViT) model, pre-trained on ImageNet-21k and fine-tuned on ImageNet, achieves state-of-the-art results on image classification tasks. The model uses a transformer encoder architecture and presents images as a sequence of fixed-size patches, adding a [CLS] token for classification tasks. The pre-trained model can be used for downstream tasks such as extracting features and training standard classifiers.
webhook
fileThe webhook to call when inference is done, by default you will get the output in the response of your inference request