🚀 New models by Bria.ai, generate and edit images at scale 🚀
sentence-transformers/
$0.005
/ 1M tokens
The CLIP model maps text and images to a shared vector space, enabling various applications such as image search, zero-shot image classification, and image clustering. The model can be used easily after installation, and its performance is demonstrated through zero-shot ImageNet validation set accuracy scores. Multilingual versions of the model are also available for 50+ languages.
You need to login to use this model
LoginSettings
ServiceTier
The service tier used for processing the request. When set to 'priority', the request will be processed with higher priority.
Normalize
whether to normalize the computed embeddings
Dimensions
The number of dimensions in the embedding. If not provided, the model's default will be used.If provided bigger than model's default, the embedding will be padded with zeros. (Default: empty, 32 ≤ dimensions ≤ 8192)
Custom Instruction
Custom instruction prepending to each input. If empty, no instruction will be used.. (Default: empty)
[
[
0,
0.5,
1
],
[
1,
0.5,
0
]
]
This is the Image & Text model CLIP, which maps text and images to a shared vector space. For applications of the models, have a look in our documentation SBERT.net - Image Search
In the following table we find the zero-shot ImageNet validation set accuracy:
Model | Top 1 Performance |
---|---|
clip-ViT-B-32 | 63.3 |
clip-ViT-B-16 | 68.1 |
clip-ViT-L-14 | 75.4 |
For a multilingual version of the CLIP model for 50+ languages have a look at: clip-ViT-B-32-multilingual-v1
© 2025 Deep Infra. All rights reserved.