Overview

Paper abstract:

General-purpose foundation models have led to recent breakthroughs in artificial intelligence. In remote sensing, self-supervised learning (SSL) and Masked Image Modeling (MIM) have been adopted to build foundation models. However, these models primarily learn low-level features and require annotated data for fine-tuning. Moreover, they are inapplicable for retrieval and zero-shot applications due to the lack of language understanding. To address these limitations, we propose RemoteCLIP, the first vision-language foundation model for remote sensing that aims to learn robust visual features with rich semantics and aligned text embeddings for seamless downstream application.‍

‍To address the scarcity of pre-training data, we leverage data scaling which converts heterogeneous annotations into a unified image-caption data format based on Box-to-Caption (B2C) and Mask-to-Box (M2B) conversion. By further incorporating UAV imagery, we produce a 12 × larger pretraining dataset than the combination of all available datasets. RemoteCLIP can be applied to a variety of downstream tasks, including zero-shot image classification, linear probing, k-NN classification, few-shot classification, image-text retrieval, and object counting in remote sensing images. Evaluation on 16 datasets, including a newly introduced RemoteCount benchmark to test the object counting ability, shows that RemoteCLIP consistently outperforms baseline foundation models across different model scales. Impressively, RemoteCLIP beats the state-of-the-art method by 9.14% mean recall on the RSITMD dataset and 8.92% on the RSICD dataset. For zero-shot classification, our RemoteCLIP outperforms the CLIP baseline by up to 6.39% average accuracy on 12 downstream datasets. Project website: this https URL

Source.

You can use RemoteCLIP to calculate image embeddings. These embeddings can be used for:

Zero-shot image classification;
Calculating image similarity;
Deduplicating images in a dataset, and more.

Performance

Use This Model

First, install Autodistill and Autodistill RemoteCLIP:


pip install autodistill autodistill-remoteclip

Then, run:


from autodistill_remote_clip import RemoteCLIP
from autodistill.detection import CaptionOntology

# define an ontology to map class names to our RemoteCLIP prompt
# the ontology dictionary has the format {caption: class}
# where caption is the prompt sent to the base model, and class is the label that will
# be saved for that caption in the generated annotations
# then, load the model
base_model = RemoteCLIP(
    ontology=CaptionOntology(
        {
            "airport runway": "runway",
            "countryside": "countryside",
        }
    )
)

predictions = base_model.predict("runway.jpg")

print(predictions)

‍

Label Data Automatically with RemoteCLIP

You can automatically label a dataset using RemoteCLIP with help from Autodistill, an open source package for training computer vision models. You can label a folder of images automatically with only a few lines of code. Below, see our tutorials that demonstrate how to use RemoteCLIP to train a computer vision model.

No items found.