BioTrove: A Large Curated Image Dataset Enabling AI for Biodiversity

Iowa State University, New York University, University of Arizona
NeurIPS 2024 Track on Datasets and Benchmarks (Spotlight)

*Equal Contribution
ArborCLIP

Top Seven Phyla in the BioTrove Dataset. This figure displays the seven most frequently occurring phyla within BioTrove, which is curated to include data exclusively from the three primary kingdoms: Animalia, Plantae and Fungi. For each phylum, the five most common species are shown, including their scientific names, common names, and the number of images per species. The phyla are ordered by species diversity, with the most diverse phylum on the right and the least diverse on the left.

Abstract

We introduce BIOTROVE, the largest publicly accessible dataset designed to advance AI applications in biodiversity. Curated from the iNaturalist platform and vetted by domain experts to include only research-grade data, BIOTROVE contains 161.9 million images, offering unprecedented scale and diversity from three primary kingdoms: Animalia ("animals"), Fungi ("fungi"), and Plantae ("plants"), spanning approximately 366.6K species. Each image is annotated with scientific names, taxonomic hierarchies, and common names, providing rich metadata to support accurate AI model development across diverse species and ecosystems. We demonstrate the value of BIOTROVE by releasing a suite of CLIP models trained using a subset of 40 million captioned images, known as BIOTROVE-TRAIN. This subset focuses on seven categories within the dataset that are underrepresented in standard image recognition models, selected for their critical role in biodiversity and agriculture: Aves("birds"),Arachnida("spiders/ticks/mites"),Insecta("insects"),Plantae("plants"),Fungi("fungi"), Mollusca("snails") and Reptilia("snakes/lizards"). To support rigorous assessment, we introduce several new benchmarks and report model accuracy for zero-shot learning across life stages, rare species, confounding species, and multiple taxonomic levels. We anticipate that BIOTROVE will spur the development of AI models capable of supporting digital tools for pest control, crop monitoring, biodiversity assessment, and environmental conservation. These advancements are crucial for ensuring food security, preserving ecosystems, and mitigating the impacts of climate change. BIOTROVE is publicly available, easily accessible, and ready for immediate use.

Dataset Features

Category Distribution

Distribution of the BioTrove dataset. (a) Size of the top seven Phyla in the BioTrove dataset. (b) Species counts for the top seven Phyla. (c) The 40 highest occurring species in entire BioTrove dataset.

Sample Image 2

Treemap diagram of the BioTrove dataset, starting from Kingdom. The nested boxes represent phyla, (taxonomic) classes, orders, and families. Box size represents the relative number of samples.

Sample Image 3

Comparison of BioTrove dataset with existing biodiversity datasets.

Dataset Benchmarks

BioTrove consists of several benchmark datasets - BioTrove-Train(40M) and New Benchmarks. The main BioTrove is ~162M in sample size.

BioTrove-Train

BioTrove-Train is a curated subset comprising approximately 40M samples and 33K species from the seven - Aves, Arachnida, Insecta, Plantae, Fungi, Mollusca, and Reptilia categories (iNaturalist prior to January 27, 2024). It contains 30 to maximum of 50,000 samples per species. We conducted semi-global shuffling and divided the data into mini-batches of approximately 50,000 samples each. From these mini-batches, 95% were randomly selected for training and validation, while the remaining 5% were reserved for testing.

lifestages

Training data sources used in BioTrove-Train and Diversity in Different Taxonomy Levels. We integrate taxonomic labels into the images.

New Benchmark

From BioTrove, we created three new benchmark datasets for fine-grained image classification. They are- BioTrove-Balanced, BioTrove-Unseen and BioTrove-LifeStages.

lifestages

(a) Example images from BioTrove-Unseen. (b) BioTrove-LIFE-STAGES with 20 class labels: four life stages (egg, larva, pupa, and adult) for five distinct insect species

Data Preparation

Our Github includes the pipeline and biotrove-process package installation instructions for the data preparation. The metadata can be downloaded from the HuggingFace dataset cards: BioTrove-Train and BioTrove (main). This procedure will generate machine learning-ready image-text pairs from the downloaded metadata in four steps:

datprep

BioTrove-CLIP Model

We use BioTrove-Train to train new CLIP-style foundation models, and then evaluate them on zero-shot image classification tasks. A ViT-B/16 backbone initialized from the OpenAI CLIP weights, a ViT-L/14 from the MetaCLIP and a ViT-B/16 from the BioCLIP checkpoint were trained to develop these models. The BioTrove-CLIP models are designed to analyze and categorize various plant species using advanced machine learning techniques. It leverages a vast dataset and sophisticated algorithms to achieve high accuracy in species recognition.

ArborCLIP Model

BioTrove-CLIP performances well on various benchmarks. The top three rows are pre-trained checkpoints: OpenAI-B refers to OpenAI's ViT-B-16 model, BioCLIP-B refers to the BioCLIP ViT-B-16 model, and MetaCLIP-L refers to the MetaCLIP-cc ViTL-14 model. The bottom three rows are Biotrove-Clip models fine-tuned on different checkpoints: BT-CLIP-O (from OpenAI-B), BT-CLIP-B (from BioCLIP-B), and BT-CLIP-M (from MetaCLIP-L). Benchmark abbreviations: BTU (Biotrove-Unseen, n=300), BTB (Biotrove-Balanced, n=2253), BCR (BioCLIP-Rare, n=400), F (Fungi, n=25), I2 (Insects-2, n=102), B (Birds-525, n=525), LS (Life-Stages, n=20), and DW (DeepWeeds, n=9).95% confidence intervals (±CI) are included.

Model Weights

The BioTrove-CLIP models were developed for the benefit of the AI community as an open-source product. We released three BioTrove-CLIP trained models based on OpenAI's CLIP model. The models were trained on BioTrove-Train dataset. Please utilize our HuggingFace Model card to access the model checkpoints.

Acknowledgements

This work was supported by the AI Research Institutes program supported by the NSF and USDA-NIFA under AI Institute: for Resilient Agriculture (AIIRA), Award No.2021-67021-35329. This was also partly supported by the NSF under CPS Frontier grant CNS-1954556. Additionally, we gratefully acknowledge the support of NYU IT - NYU High Performance Computing, NYU Greene resources, services, and staff expertise.

BibTeX


        @misc{yang2024BioTrovelargemultimodaldataset,
          title={BioTrove: A Large Multimodal Dataset Enabling AI for Biodiversity}, 
          author={Chih-Hsuan Yang, Benjamin Feuer, Zaki Jubery, Zi K. Deng, Andre Nakkab,
          Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, Nirmal Baishnab, Asheesh K Singh,
          Arti Singh, Soumik Sarkar, Nirav Merchant, Chinmay Hegde, Baskar Ganapathysubramanian},
          year={2024},
          eprint={2406.17720},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2406.17720}}