Agricultural decision-making involves complex, context-specific reasoning, where choices about crops, practices, and interventions depend heavily on geographic, climatic, and economic conditions. Traditional large language models (LLMs) often fall short in navigating this nuanced problem due to limited reasoning capacity.
We hypothesize that recent advances in large reasoning models (LRMs) can better handle such structured, domain-specific inference. To investigate this, we introduce AgReason, the first expert-curated open-ended science benchmark with 100 questions for agricultural reasoning.
Evaluations across fifteen open-source and proprietary models reveal that LRMs outperform conventional ones, though notable challenges persist, with the strongest Gemini–based baseline achieving 36% accuracy.
We also present AgThoughts, a large-scale dataset of 44.6K question-answer pairs generated with human oversight and equipped with synthetically generated reasoning traces.
Using AgThoughts, we develop AgThinker, a suite of small reasoning models that can be run on consumer-grade GPUs, and show that our dataset can be effective in unlocking agricultural reasoning abilities in LLMs.
Model | Abiotic Harvest Qs | Plant & Seed Health Qs | Abiotic Soil Qs | Abiotic Weather Qs | Biotic Diseases Qs | Biotic Insects Qs | Biotic Weeds Qs | Cover Crop Qs | Crop Management Qs | Crop Inputs Qs | Overall |
---|
Click on column headers to sort. Use the tabs above to switch between predefined views or create a custom one.