Panahi, Solmaz and Kelleher, John D. (2026) When LLMs Annotate: Reliability Challenges in Low-Resource NLI. In: EACL-2026, March 24-29, 2026, Rabat, Morocco.
Preview
Available under License Creative Commons Attribution Non-commercial Share Alike.
Download (576kB) | Preview
Abstract
This paper systematically evaluates LLM reliability on the complex semantic task of Natural Language Inference (NLI) in Farsi, assessing six prominent models across eight prompt
variations through a multi-dimensional framework that measures accuracy, prompt sensitivity, and intra-class consistency. Our results
demonstrate that prompt design-particularly the
order of premise and hypothesis-significantly
impacts prediction stability. Proprietary models
(Claude-Opus-4, GPT-4o) exhibit superior stability and accuracy compared to open-weight
alternatives. Across all models, the ’Neutral’
class emerges as the most challenging and least
stable category. Crucially, we redefine model
instability as a diagnostic tool for benchmark
quality, demonstrating that observed disagreement often reflects valid challenges to ambiguous or erroneous gold-standard labels.
| Item Type: | Conference or Workshop Item (Paper) |
|---|---|
| Keywords: | LLMs Annotate; Reliability; Challenges; Low-Resource NLI; |
| Academic Unit: | Faculty of Science and Engineering > Research Institutes > Hamilton Institute |
| Item ID: | 21248 |
| Depositing User: | IR Editor |
| Date Deposited: | 26 Feb 2026 15:44 |
| Refereed: | Yes |
| Use Licence: | This item is available under a Creative Commons Attribution Non Commercial Share Alike Licence (CC BY-NC-SA). Details of this licence are available here |
Downloads
Downloads per month over past year
Share and Export
Share and Export