BioTrove consists of several benchmark datasets - BioTrove-Train(40M) and New Benchmarks. The main BioTrove is ~162M in sample size.
We introduce BIOTROVE, the largest publicly accessible dataset designed to advance AI applications in biodiversity. Curated from the iNaturalist platform and vetted by domain experts to include only research-grade data, BIOTROVE contains 161.9 million images, offering unprecedented scale and diversity from three primary kingdoms: Animalia ("animals"), Fungi ("fungi"), and Plantae ("plants"), spanning approximately 366.6K species. Each image is annotated with scientific names, taxonomic hierarchies, and common names, providing rich metadata to support accurate AI model development across diverse species and ecosystems. We demonstrate the value of BIOTROVE by releasing a suite of CLIP models trained using a subset of 40 million captioned images, known as BIOTROVE-TRAIN. This subset focuses on seven categories within the dataset that are underrepresented in standard image recognition models, selected for their critical role in biodiversity and agriculture: Aves("birds"),Arachnida("spiders/ticks/mites"),Insecta("insects"),Plantae("plants"),Fungi("fungi"), Mollusca("snails") and Reptilia("snakes/lizards"). To support rigorous assessment, we introduce several new benchmarks and report model accuracy for zero-shot learning across life stages, rare species, confounding species, and multiple taxonomic levels. We anticipate that BIOTROVE will spur the development of AI models capable of supporting digital tools for pest control, crop monitoring, biodiversity assessment, and environmental conservation. These advancements are crucial for ensuring food security, preserving ecosystems, and mitigating the impacts of climate change. BIOTROVE is publicly available, easily accessible, and ready for immediate use.
Distribution of the BioTrove dataset. (a) Size of the top seven Phyla in the BioTrove dataset. (b) Species counts for the top seven Phyla. (c) The 40 highest occurring species in entire BioTrove dataset.
Treemap diagram of the BioTrove dataset, starting from Kingdom. The nested boxes represent phyla, (taxonomic) classes, orders, and families. Box size represents the relative number of samples.
Comparison of BioTrove dataset with existing biodiversity datasets.
BioTrove consists of several benchmark datasets - BioTrove-Train(40M) and New Benchmarks. The main BioTrove is ~162M in sample size.
BioTrove-Train is a curated subset comprising approximately 40M samples and 33K species from the seven - Aves, Arachnida, Insecta, Plantae, Fungi, Mollusca, and Reptilia categories (iNaturalist prior to January 27, 2024). It contains 30 to maximum of 50,000 samples per species. We conducted semi-global shuffling and divided the data into mini-batches of approximately 50,000 samples each. From these mini-batches, 95% were randomly selected for training and validation, while the remaining 5% were reserved for testing.
From BioTrove, we created three new benchmark datasets for fine-grained image classification. They are- BioTrove-Balanced, BioTrove-Unseen and BioTrove-LifeStages.
Our Github includes the pipeline and biotrove-process package installation instructions for the data preparation. The metadata can be downloaded from the HuggingFace dataset cards: BioTrove-Train and BioTrove (main). This procedure will generate machine learning-ready image-text pairs from the downloaded metadata in four steps:
We use BioTrove-Train to train new CLIP-style foundation models, and then evaluate them on zero-shot image classification tasks.
A ViT-B/16
backbone initialized from the OpenAI CLIP weights,
a ViT-L/14
from the MetaCLIP and
a ViT-B/16
from the BioCLIP checkpoint were trained to develop these models.
The BioTrove-CLIP models are designed to analyze and categorize various plant species using advanced machine learning techniques.
It leverages a vast dataset and sophisticated algorithms to achieve high accuracy in species recognition.
The BioTrove-CLIP models were developed for the benefit of the AI community as an open-source product. We released three BioTrove-CLIP trained models based on OpenAI's CLIP model. The models were trained on BioTrove-Train dataset. Please utilize our HuggingFace Model card to access the model checkpoints.
This work was supported by the AI Research Institutes program supported by the NSF and USDA-NIFA under
AI Institute: for Resilient Agriculture (AIIRA),
Award No.2021-67021-35329
. This was also partly supported by the NSF under CPS Frontier grant CNS-1954556
. Additionally, we gratefully acknowledge the
support of NYU IT - NYU High Performance Computing, NYU Greene resources, services, and staff expertise.
@misc{yang2024BioTrovelargemultimodaldataset,
title={BioTrove: A Large Multimodal Dataset Enabling AI for Biodiversity},
author={Chih-Hsuan Yang, Benjamin Feuer, Zaki Jubery, Zi K. Deng, Andre Nakkab,
Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, Nirmal Baishnab, Asheesh K Singh,
Arti Singh, Soumik Sarkar, Nirav Merchant, Chinmay Hegde, Baskar Ganapathysubramanian},
year={2024},
eprint={2406.17720},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2406.17720}}