Papers
arxiv:2502.01159

AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science

Published on Feb 3
Authors:
,
,
,

Abstract

AtmosSci-Bench is a benchmark that assesses large language models in atmospheric science through multiple-choice and open-ended questions across five categories: hydrology, atmospheric dynamics, atmospheric physics, geophysics, and physical oceanography.

AI-generated summary

The rapid advancements in large language models (LLMs), particularly in their reasoning capabilities, hold transformative potential for addressing complex challenges and boosting scientific discovery in atmospheric science. However, leveraging LLMs effectively in this domain requires a robust and comprehensive evaluation benchmark. Toward this end, we present AtmosSci-Bench, a novel benchmark designed to systematically assess LLM performance across five core categories of atmospheric science problems: hydrology, atmospheric dynamics, atmospheric physics, geophysics, and physical oceanography. AtmosSci-Bench features a dual-format design comprising both multiple-choice questions (MCQs) and open-ended questions (OEQs), enabling scalable automated evaluation alongside deeper analysis of conceptual understanding. We employ a template-based MCQ generation framework to create diverse, graduate-level problems with symbolic perturbation, while OEQs are used to probe open-ended reasoning. We conduct a comprehensive evaluation of representative LLMs, categorized into four groups: instruction-tuned models, advanced reasoning models, math-augmented models, and domain-specific climate models. Our analysis provides some interesting insights into the reasoning and problem-solving capabilities of LLMs in atmospheric science. We believe AtmosSci-Bench can serve as a critical step toward advancing LLM applications in climate services by offering a standard and rigorous evaluation framework. Our source code is available at https://github.com/Relaxed-System-Lab/AtmosSci-Bench.

Community

@librarian-bot related

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.01159 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.01159 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.01159 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.