When LLMs Annotate: Reliability Challenges in Low-Resource NLI

Panahi, Solmaz; Kelleher, John D.

When LLMs Annotate: Reliability Challenges in Low-Resource NLI

Share and Export

Panahi, Solmaz and Kelleher, John D. (2026) When LLMs Annotate: Reliability Challenges in Low-Resource NLI. In: EACL-2026, March 24-29, 2026, Rabat, Morocco.

Preview

Text
Available under License Creative Commons Attribution Non-commercial Share Alike.
Download (576kB) | Preview

Abstract

This paper systematically evaluates LLM reliability on the complex semantic task of Natural Language Inference (NLI) in Farsi, assessing six prominent models across eight prompt variations through a multi-dimensional framework that measures accuracy, prompt sensitivity, and intra-class consistency. Our results demonstrate that prompt design-particularly the order of premise and hypothesis-significantly impacts prediction stability. Proprietary models (Claude-Opus-4, GPT-4o) exhibit superior stability and accuracy compared to open-weight alternatives. Across all models, the ’Neutral’ class emerges as the most challenging and least stable category. Crucially, we redefine model instability as a diagnostic tool for benchmark quality, demonstrating that observed disagreement often reflects valid challenges to ambiguous or erroneous gold-standard labels.

Item Type:	Conference or Workshop Item (Paper)
Keywords:	LLMs Annotate; Reliability; Challenges; Low-Resource NLI;
Academic Unit:	Faculty of Science and Engineering > Research Institutes > Hamilton Institute
Item ID:	21248
Depositing User:	IR Editor
Date Deposited:	26 Feb 2026 15:44
Refereed:	Yes
Use Licence:	This item is available under a Creative Commons Attribution Non Commercial Share Alike Licence (CC BY-NC-SA). Details of this licence are available here

MURAL - Maynooth University Research Archive Library

When LLMs Annotate: Reliability Challenges in Low-Resource NLI

Abstract

Downloads

Origin of downloads

Repository Staff Only (login required)