ArborCLIP

The seven taxonomic classes in the Arboretum dataset. Example images of the top 5 most frequent species in each category, including their counts, common names, and scientific names. For clarity, the word class used in this context is denoted taxonomic class in the rest of the paper to distinguish it from the other common ML definitions.

Abstract

We introduce Arboretum, the largest publicly accessible dataset designed to advance AI for biodiversity applications. This dataset, curated from the iNaturalist community science platform and vetted by domain experts to ensure accurate data, includes 134.6 million images, surpassing existing datasets in scale by an order of magnitude. The dataset encompasses image-language paired data for a diverse set of species from birds (Aves), spiders/ticks/mites (Arachnida), insects (Insecta), plants (Plantae), fungus/mushrooms (Fungi), snails (Mollusca), and snakes/lizards (Reptilia), making it a valuable resource for multimodal vision-language AI models for biodiversity assessment and agriculture research. Each image is annotated with scientific names, taxonomic details, and common names, enhancing the robustness of AI model training. We showcase the value of Arboretum by releasing a suite of CLIP models trained using a subset of 40 million captioned images. We introduce several new benchmarks for rigorous assessment, report accuracy for zero-shot learning, and evaluations across life stages, rare species, confounding species, and various levels of the taxonomic hierarchy. We anticipate that Arboretum will spur the development of AI models that can enable a variety of digital tools ranging from pest control strategies, crop monitoring, and worldwide biodiversity assessment and environmental conservation. These advancements are critical for ensuring food security, preserving ecosystems, and mitigating the impacts of climate change. Arboretum is publicly available, easily accessible, and ready for immediate use.

Dataset Features

Category Distribution

Distribution of the Arboretum dataset. (a) Size of each taxonomic class in the Arboretum dataset. (b) Species counts for each taxonomic class. (c) The 40 highest occuring species.

Sample Image 2

Treemap diagram of the Arboretum Dataset, starting from Kingdom. The nested boxes represent phyla, (taxonomic) classes, orders, and families. Box size represents the relative number of samples.

Sample Image 3

Comparison of Arboretum Dataset with existing biodiversity datasets.

Data Preparation

Our Github includes the pipeline and arbor-process package installation instructions for the data preparation. The metadata can be downloaded from HF. This procedure will generate machine learning-ready image-text pairs from the downloaded metadata in four steps:

datprep

Dataset Benchmarks

ARBORETUM consists of several benchmark datasets - ARBORETUM-40M and New Benchmarks.

ARBORETUM-40M

ARBORETUM-40M is a subset comprising approximately 40M samples and 33K species from the seven ARBORETUM categories (iNaturalist prior to January 27, 2024). It contains 30 to maximum of 50,000 samples per species. We conducted semi-global shuffling and divided the data into mini-batches of approximately 50,000 samples each. From these mini-batches, 95% were randomly selected for training and validation, while the remaining 5% were reserved for testing.

lifestages

Training data sources used in ARBORETUM-40M and Diversity in Different Taxonomy Levels. We integrate taxonomic labels into the images to train the models.

New Benchmark

From ARBORETUM, we created three new benchmark datasets for fine-grained image classification. They are- Arboretum-Balanced, Arboretum-Unseen and Arboretum-LifeStages.

lifestages

(a) Example images from Arboretum-Unseen. (b) ARBORETUM-LIFE-STAGES with 20 class labels: four life stages (egg, larva, pupa, and adult) for five distinct insect species

ARBORCLIP Model

We use ARBORETUM-40M to train new CLIP-style foundation models (ARBORCLIP), and then evaluate them on zero-shot image classification tasks. A ViT-B/16 backbone initialized from the OpenAI CLIP weights, a ViT-L/14 from the MetaCLIP and a ViT-B/16 from the BioCLIP checkpoint. The ArborCLIP model is designed to analyze and categorize various plant species using advanced machine learning techniques. It leverages a vast dataset and sophisticated algorithms to achieve high accuracy in species recognition.

ArborCLIP Model

ARBORCLIP performs well on a range of benchmarks. ARBORCLIP-O was pretrained from an OpenAI model checkpoint, ARBORCLIP-B from the BioCLIP checkpoint, and ARBORCLIP-M from a MetaCLIP-cc checkpoint. AU stands for Arboretum-Unseen, using Scientific Names (n-classes=300). AB stands for Arboretum-Balanced, using Scientific Names (n-classes= 2253). BCR stands for the BioCLIP Rare species benchmark (n-classes=400). F stands for Fungi (n-classes=25). I2 stands for Insects2 (n-classes=102). B stands for Birds525 (n-classes=525), LS for the Life Stages benchmark (n-classes=20), and DW for DeepWeeds (n-classes=9).

BibTeX


      @misc{yang2024arboretumlargemultimodaldataset,
      title={Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity}, 
      author={Chih-Hsuan Yang, Benjamin Feuer, Zaki Jubery, Zi K. Deng, Andre Nakkab, Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, Nirmal Baishnab, Asheesh K Singh, Arti Singh, Soumik Sarkar, Nirav Merchant, Chinmay Hegde, Baskar Ganapathysubramanian},
      year={2024},
      eprint={2406.17720},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.17720},}