The Vision Transformer (ViT) is a transformer encoder model pre-trained on ImageNet-21k and fine-tuned on ImageNet, achieving state-of-the-art results in image classification. The model presents images as a sequence of fixed-size patches and adds a CLS token for classification tasks. The authors recommend using fine-tuned versions of the model for specific tasks.
The Vision Transformer (ViT) is a transformer encoder model pre-trained on ImageNet-21k and fine-tuned on ImageNet, achieving state-of-the-art results in image classification. The model presents images as a sequence of fixed-size patches and adds a CLS token for classification tasks. The authors recommend using fine-tuned versions of the model for specific tasks.
5dca96d358b3fcb9d53b3d3881eb1ae20b6752d1
2023-03-03T07:20:58+00:00
2ddc9d4e473d7ba52128f0df4723e478fa14fb80
2023-04-29T01:03:41+00:00