Towards Large Reasoning Models for Agriculture

Hossein Zaremehrjerdi1†, Shreyan Ganguly1†, Ashlyn Rairdin1†, Elizabeth Tranel1, Benjamin Feuer2, Juan Ignacio Di Salvo1, Srikanth Panthulugiri1, Hernan Torres Pacin1, Victoria Moser1, Sarah Jones1, Joscif G Raigne1, Yanben Shen1, Heidi M. Dornath1, Aditya Balu1, Adarsh Krishnamurthy1, Asheesh K Singh1, Arti Singh1*, Baskar Ganapathysubramanian1, Chinmay Hegde2*, Soumik Sarkar1*
1 Iowa State University   2 New York University
Equal contribution   *Corresponding authors
Workflow for the development of AgThoughts and AgReason
Workflow for the development of AgThoughts and AgReason: (1) Base templates are expanded into detailed question–answer pairs using LLMs; (2) Expert feedback on 200 sampled examples identifies common issues; (3) Expert and LLM-based feedback is used to iteratively filter and finalize 44.6k Q&A pairs; (4) The AgReason Benchmark evaluates candidate LLM performance using 100 questions with expert-curated gold-standard answers using LLM-as-judge.

Abstract

Agricultural decision-making involves complex, context-specific reasoning, where choices about crops, practices, and interventions depend heavily on geographic, climatic, and economic conditions. Traditional large language models (LLMs) often fall short in navigating this nuanced problem due to limited reasoning capacity.

We hypothesize that recent advances in large reasoning models (LRMs) can better handle such structured, domain-specific inference. To investigate this, we introduce AgReason, the first expert-curated open-ended science benchmark with 100 questions for agricultural reasoning.

Evaluations across fifteen open-source and proprietary models reveal that LRMs outperform conventional ones, though notable challenges persist, with the strongest Gemini–based baseline achieving 36% accuracy.

We also present AgThoughts, a large-scale dataset of 44.6K question-answer pairs generated with human oversight and equipped with synthetically generated reasoning traces.

Using AgThoughts, we develop AgThinker, a suite of small reasoning models that can be run on consumer-grade GPUs, and show that our dataset can be effective in unlocking agricultural reasoning abilities in LLMs.

Model Performance Leaderboard

Model Abiotic Harvest Qs Plant & Seed Health Qs Abiotic Soil Qs Abiotic Weather Qs Biotic Diseases Qs Biotic Insects Qs Biotic Weeds Qs Cover Crop Qs Crop Management Qs Crop Inputs Qs Overall

Click on column headers to sort. Use the tabs above to switch between predefined views or create a custom one.

BibTeX