Top Multimodal Vision Models

Multimodal vision models allow you to interact with images and information in a different modality (i.e. text). Some multimodal vision models support asking questions about images; others support comparing the similarity of images to text, useful in classification.

Deploy select models (i.e. YOLOv8, CLIP) using the Roboflow Hosted API, or your own hardware using Roboflow Inference.

Showing

of

models.

GroundingDINO

Grounding DINO is a zero-shot object detection model made by combining a Transformer-based DINO detector and grounded pre-training.

Object Detection

Deploy with Roboflow

OpenAI CLIP

CLIP (Contrastive Language-Image Pre-Training) is an impressive multimodal zero-shot image classifier that achieves impressive results in a wide range of domains with no fine-tuning. It applies the recent advancements in large-scale transformers like GPT-3 to the vision arena.

Deploy with Roboflow

LLaVA-1.5

LLaVA is an open source multimodal language model that you can use for visual question answering and has limited support for object detection.

Object Detection

Deploy with Roboflow

CogVLM

CogVLM shows strong performance in Visual Question Answering (VQA) and other vision tasks.

Vision-Language

Deploy with Roboflow

QwenVL

Qwen-VL is an LMM developed by Alibaba Cloud. Qwen-VL accepts images, text, and bounding boxes as inputs. The model can output text and bounding boxes. Qwen-VL naturally supports English, Chinese, and multilingual conversation.

Vision-Language

Deploy with Roboflow

MetaCLIP

MetaCLIP is a zero-shot classification and embedding model developed by Meta AI.

Deploy with Roboflow

BakLLaVA

BakLLaVA is an LMM developed by LAION, Ontocord, and Skunkworks AI. BakLLaVA uses a Mistral 7B base augmented with the LLaVA 1.5 architecture.

Vision-Language

Deploy with Roboflow

Kosmos-2

Kosmos-2 is a multimodal language model capable of object detection and grounding text in images.

Object Detection

Deploy with Roboflow

BLIPv2

BLIPv2 is a multimodal model developed by Salesforce Research.

Deploy with Roboflow

GPT-4 with Vision

GPT-4 with Vision is a multimodal language model developed by OpenAI.

Object Detection

Deploy with Roboflow

SigLIP

SigLIP is an image embedding model defined in the "Sigmoid Loss for Language Image Pre-Training" paper.

Deploy with Roboflow

Anthropic Claude 3

Vision-Language

Deploy with Roboflow

Google Gemini

Gemini is a family of Large Multimodal Models (LMMs) developed by Google Deepmind focused specifically on multimodality.

Vision-Language

Deploy with Roboflow

Visual Question Answering

Image Similarity

Image Captioning

Zero-shot Detection

Real-Time Vision

Image Embedding

LLMS with Vision Capabilities

Multimodal Vision

Foundation Vision