google/vit-base-patch16-224 cover image

google/vit-base-patch16-224

The Vision Transformer (ViT) is a transformer encoder model pre-trained on ImageNet-21k and fine-tuned on ImageNet, achieving state-of-the-art results in image classification. The model presents images as a sequence of fixed-size patches and adds a CLS token for classification tasks. The authors recommend using fine-tuned versions of the model for specific tasks.

The Vision Transformer (ViT) is a transformer encoder model pre-trained on ImageNet-21k and fine-tuned on ImageNet, achieving state-of-the-art results in image classification. The model presents images as a sequence of fixed-size patches and adds a CLS token for classification tasks. The authors recommend using fine-tuned versions of the model for specific tasks.

Public
$0.0005 / sec

HTTP/cURL API

You can use cURL or any other http client to run inferences:

curl -X POST \
    -H "Authorization: bearer $DEEPINFRA_TOKEN"  \
    -F image=@my_image.jpg  \
    'https://api.deepinfra.com/v1/inference/google/vit-base-patch16-224'

which will give you back something similar to:

{
  "results": [
    {
      "label": "Maltese dog, Maltese terrier, Maltese",
      "score": 0.9235488176345825
    },
    {
      "label": "Lhasa, Lhasa apso",
      "score": 0.0298430435359478
    }
  ],
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0
  }
}

Input fields

imagestring

image to classify


webhookfile

The webhook to call when inference is done, by default you will get the output in the response of your inference request

Input Schema

Output Schema