The YOLOS model, a Vision Transformer (ViT) trained using the DETR loss, achieves 42 AP on COCO validation 2017, similar to DETR and more complex frameworks like Faster R-CNN, despite its small size. The model uses a bipartite matching loss to compare predicted classes and bounding boxes to ground truth annotations. Fine-tuned on COCO 2017 object detection dataset consisting of 118k/5k annotated images for training/validation, it achieved 36.1 AP on validation set.
The YOLOS model, a Vision Transformer (ViT) trained using the DETR loss, achieves 42 AP on COCO validation 2017, similar to DETR and more complex frameworks like Faster R-CNN, despite its small size. The model uses a bipartite matching loss to compare predicted classes and bounding boxes to ground truth annotations. Fine-tuned on COCO 2017 object detection dataset consisting of 118k/5k annotated images for training/validation, it achieved 36.1 AP on validation set.
Please upload an image file
You need to login to use this model