BioTrove: A Large Curated Image Dataset Enabling AI for Biodiversity

Chih-Hsuan Yang*, Benjamin Feuer*, Talukder Zaki Jubery, Zi K. Deng, Andre Nakkab, Md Zahid Hasan, Shivani Chiranjeevi, Kelly O. Marshall, Nirmal Baishnab, Asheesh K Singh, Arti Singh, Soumik Sarkar, Nirav Merchant, Chinmay Hegde, Baskar Ganapathysubramanian

Iowa State University, New York University, University of Arizona
NeurIPS 2024 Track on Datasets and Benchmarks (Spotlight)
^*Equal Contribution

Paper GitHub 🤗 Dataset card 🤗 Model card 🤗 BioTrove-CLIP Demo arXiv

Top Seven Phyla in the BioTrove Dataset. This figure displays the seven most frequently occurring phyla within BioTrove, which is curated to include data exclusively from the three primary kingdoms: Animalia, Plantae and Fungi. For each phylum, the five most common species are shown, including their scientific names, common names, and the number of images per species. The phyla are ordered by species diversity, with the most diverse phylum on the right and the least diverse on the left.

Dataset Features

Distribution of the BioTrove dataset. (a) Size of the top seven Phyla in the BioTrove dataset. (b) Species counts for the top seven Phyla. (c) The 40 highest occurring species in entire BioTrove dataset.

Treemap diagram of the BioTrove dataset, starting from Kingdom. The nested boxes represent phyla, (taxonomic) classes, orders, and families. Box size represents the relative number of samples.

Comparison of BioTrove dataset with existing biodiversity datasets.

Dataset Benchmarks

BioTrove consists of several benchmark datasets - BioTrove-Train(40M) and New Benchmarks. The main BioTrove is ~162M in sample size.

BioTrove-Train

BioTrove-Train is a curated subset comprising approximately 40M samples and 33K species from the seven - Aves, Arachnida, Insecta, Plantae, Fungi, Mollusca, and Reptilia categories (iNaturalist prior to January 27, 2024). It contains 30 to maximum of 50,000 samples per species. We conducted semi-global shuffling and divided the data into mini-batches of approximately 50,000 samples each. From these mini-batches, 95% were randomly selected for training and validation, while the remaining 5% were reserved for testing.

Training data sources used in BioTrove-Train and Diversity in Different Taxonomy Levels. We integrate taxonomic labels into the images.

New Benchmark

From BioTrove, we created three new benchmark datasets for fine-grained image classification. They are- BioTrove-Balanced, BioTrove-Unseen and BioTrove-LifeStages.

(a) Example images from BioTrove-Unseen. (b) BioTrove-LIFE-STAGES with 20 class labels: four life stages (egg, larva, pupa, and adult) for five distinct insect species

Data Preparation

Our Github includes the pipeline and biotrove-process package installation instructions for the data preparation. The metadata can be downloaded from the HuggingFace dataset cards: BioTrove-Train and BioTrove (main). This procedure will generate machine learning-ready image-text pairs from the downloaded metadata in four steps:

BioTrove-CLIP Model

We use BioTrove-Train to train new CLIP-style foundation models, and then evaluate them on zero-shot image classification tasks. A ViT-B/16 backbone initialized from the OpenAI CLIP weights, a ViT-L/14 from the MetaCLIP and a ViT-B/16 from the BioCLIP checkpoint were trained to develop these models. The BioTrove-CLIP models are designed to analyze and categorize various plant species using advanced machine learning techniques. It leverages a vast dataset and sophisticated algorithms to achieve high accuracy in species recognition.

BioTrove-CLIP performances well on various benchmarks. The top three rows are pre-trained checkpoints: OpenAI-B refers to OpenAI's ViT-B-16 model, BioCLIP-B refers to the BioCLIP ViT-B-16 model, and MetaCLIP-L refers to the MetaCLIP-cc ViTL-14 model. The bottom three rows are Biotrove-Clip models fine-tuned on different checkpoints: BT-CLIP-O (from OpenAI-B), BT-CLIP-B (from BioCLIP-B), and BT-CLIP-M (from MetaCLIP-L). Benchmark abbreviations: BTU (Biotrove-Unseen, n=300), BTB (Biotrove-Balanced, n=2253), BCR (BioCLIP-Rare, n=400), F (Fungi, n=25), I2 (Insects-2, n=102), B (Birds-525, n=525), LS (Life-Stages, n=20), and DW (DeepWeeds, n=9).95% confidence intervals (±CI) are included.

Model Weights

The BioTrove-CLIP models were developed for the benefit of the AI community as an open-source product. We released three BioTrove-CLIP trained models based on OpenAI's CLIP model. The models were trained on BioTrove-Train dataset. Please utilize our HuggingFace Model card to access the model checkpoints.

Acknowledgements

This work was supported by the AI Research Institutes program supported by the NSF and USDA-NIFA under AI Institute: for Resilient Agriculture (AIIRA), Award No.2021-67021-35329. This was also partly supported by the NSF under CPS Frontier grant CNS-1954556. Additionally, we gratefully acknowledge the support of NYU IT - NYU High Performance Computing, NYU Greene resources, services, and staff expertise.