Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity

Chih-Hsuan Yang*, Benjamin Feuer*, Talukder Zaki Jubery, Zi K. Deng, Andre Nakkab, Md Zahid Hasan, Shivani Chiranjeevi, Kelly O. Marshall, Nirmal Baishnab, Asheesh K Singh, Arti Singh, Soumik Sarkar, Nirav Merchant, Chinmay Hegde, Baskar Ganapathysubramanian

Iowa State University, New York University, University of Arizona
Under Review
^*Equal Contribution

Paper GitHub 🤗 Dataset card 🤗 Model card arXiv

The seven taxonomic classes in the Arboretum dataset. Example images of the top 5 most frequent species in each category, including their counts, common names, and scientific names. For clarity, the word class used in this context is denoted taxonomic class in the rest of the paper to distinguish it from the other common ML definitions.

Dataset Features

Distribution of the Arboretum dataset. (a) Size of each taxonomic class in the Arboretum dataset. (b) Species counts for each taxonomic class. (c) The 40 highest occuring species.

Treemap diagram of the Arboretum Dataset, starting from Kingdom. The nested boxes represent phyla, (taxonomic) classes, orders, and families. Box size represents the relative number of samples.

Comparison of Arboretum Dataset with existing biodiversity datasets.

Data Preparation

Our Github includes the pipeline and arbor-process package installation instructions for the data preparation. The metadata can be downloaded from HF. This procedure will generate machine learning-ready image-text pairs from the downloaded metadata in four steps:

Dataset Benchmarks

ARBORETUM consists of several benchmark datasets - ARBORETUM-40M and New Benchmarks.

ARBORETUM-40M

ARBORETUM-40M is a subset comprising approximately 40M samples and 33K species from the seven ARBORETUM categories (iNaturalist prior to January 27, 2024). It contains 30 to maximum of 50,000 samples per species. We conducted semi-global shuffling and divided the data into mini-batches of approximately 50,000 samples each. From these mini-batches, 95% were randomly selected for training and validation, while the remaining 5% were reserved for testing.

Training data sources used in ARBORETUM-40M and Diversity in Different Taxonomy Levels. We integrate taxonomic labels into the images to train the models.

New Benchmark

From ARBORETUM, we created three new benchmark datasets for fine-grained image classification. They are- Arboretum-Balanced, Arboretum-Unseen and Arboretum-LifeStages.

(a) Example images from Arboretum-Unseen. (b) ARBORETUM-LIFE-STAGES with 20 class labels: four life stages (egg, larva, pupa, and adult) for five distinct insect species

ARBORCLIP Model

We use ARBORETUM-40M to train new CLIP-style foundation models (ARBORCLIP), and then evaluate them on zero-shot image classification tasks. A ViT-B/16 backbone initialized from the OpenAI CLIP weights, a ViT-L/14 from the MetaCLIP and a ViT-B/16 from the BioCLIP checkpoint. The ArborCLIP model is designed to analyze and categorize various plant species using advanced machine learning techniques. It leverages a vast dataset and sophisticated algorithms to achieve high accuracy in species recognition.

ARBORCLIP performs well on a range of benchmarks. ARBORCLIP-O was pretrained from an OpenAI model checkpoint, ARBORCLIP-B from the BioCLIP checkpoint, and ARBORCLIP-M from a MetaCLIP-cc checkpoint. AU stands for Arboretum-Unseen, using Scientific Names (n-classes=300). AB stands for Arboretum-Balanced, using Scientific Names (n-classes= 2253). BCR stands for the BioCLIP Rare species benchmark (n-classes=400). F stands for Fungi (n-classes=25). I2 stands for Insects2 (n-classes=102). B stands for Birds525 (n-classes=525), LS for the Life Stages benchmark (n-classes=20), and DW for DeepWeeds (n-classes=9).

Model Weights

The ARBORCLIP models were developed for the benefit of the AI community as an open-source product. We released three ARBORCLIP trained models based on OpenAI's CLIP model. The models were trained on ARBORETUM-40M dataset. Please utilize our HuggingFace Model card to access the model checkpoints.