MURAL - Maynooth University Research Archive Library



    When LLMs Annotate: Reliability Challenges in Low-Resource NLI


    Panahi, Solmaz and Kelleher, John D. (2026) When LLMs Annotate: Reliability Challenges in Low-Resource NLI. In: EACL-2026, March 24-29, 2026, Rabat, Morocco.

    Abstract

    This paper systematically evaluates LLM reliability on the complex semantic task of Natural Language Inference (NLI) in Farsi, assessing six prominent models across eight prompt variations through a multi-dimensional framework that measures accuracy, prompt sensitivity, and intra-class consistency. Our results demonstrate that prompt design-particularly the order of premise and hypothesis-significantly impacts prediction stability. Proprietary models (Claude-Opus-4, GPT-4o) exhibit superior stability and accuracy compared to open-weight alternatives. Across all models, the ’Neutral’ class emerges as the most challenging and least stable category. Crucially, we redefine model instability as a diagnostic tool for benchmark quality, demonstrating that observed disagreement often reflects valid challenges to ambiguous or erroneous gold-standard labels.
    Item Type: Conference or Workshop Item (Paper)
    Keywords: LLMs Annotate; Reliability; Challenges; Low-Resource NLI;
    Academic Unit: Faculty of Science and Engineering > Research Institutes > Hamilton Institute
    Item ID: 21248
    Depositing User: IR Editor
    Date Deposited: 26 Feb 2026 15:44
    Refereed: Yes
    Use Licence: This item is available under a Creative Commons Attribution Non Commercial Share Alike Licence (CC BY-NC-SA). Details of this licence are available here

    Downloads

    Downloads per month over past year

    Origin of downloads

    Repository Staff Only (login required)

    Item control page
    Item control page