sentence-transformers/clip-ViT-B-32 cover image

sentence-transformers/clip-ViT-B-32

The CLIP model maps text and images to a shared vector space, enabling various applications such as image search, zero-shot image classification, and image clustering. The model can be used easily after installation, and its performance is demonstrated through zero-shot ImageNet validation set accuracy scores. Multilingual versions of the model are also available for 50+ languages.

The CLIP model maps text and images to a shared vector space, enabling various applications such as image search, zero-shot image classification, and image clustering. The model can be used easily after installation, and its performance is demonstrated through zero-shot ImageNet validation set accuracy scores. Multilingual versions of the model are also available for 50+ languages.

Public
$0.005 / Mtoken
512

HTTP/cURL API

You can use cURL or any other http client to run inferences:

curl -X POST \
    -H "Authorization: bearer $DEEPINFRA_TOKEN"  \
    -F 'inputs=["I like chocolate"]'  \
    'https://api.deepinfra.com/v1/inference/sentence-transformers/clip-ViT-B-32'

which will give you back something similar to:

{
  "embeddings": [
    [
      0.0,
      0.5,
      1.0
    ],
    [
      1.0,
      0.5,
      0.0
    ]
  ],
  "input_tokens": 42,
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0
  }
}

Input fields

inputsarray

sequences to embed

Default value: []


normalizeboolean

whether to normalize the computed embeddings

Default value: false


imagestring

image to embed


webhookfile

The webhook to call when inference is done, by default you will get the output in the response of your inference request

Input Schema

Output Schema