Hi all,
The next Research Showcase will be live-streamed next Wednesday, September 18, at 9:30 AM PST / 16:30 UTC. Find your local time here https://zonestamp.toolforge.org/1726677000. The theme for this showcase is *Curation of Wikimedia AI Datasets*.
You are welcome to watch via the YouTube stream: https://youtube.com/live/USzLGJ5LLC8?feature=share. As usual, you can join the conversation in the YouTube chat as soon as the showcase goes live.
This month's presentations: Supporting Community-Driven Data Curation for AI Evaluation on Wikipedia through WikibenchBy *Tzu-Sheng Kuo, Carnegie Mellon University*AI tools are increasingly deployed in community contexts. However, datasets used to evaluate AI are typically created by developers and annotators outside a given community, which can yield misleading conclusions about AI performance. How might we empower communities to drive the intentional design and curation of evaluation datasets for AI that impacts them? We investigate this question on Wikipedia, an online community with multiple AI-based content moderation tools deployed. We introduce Wikibench, a system that enables communities to collaboratively curate AI evaluation datasets, while navigating ambiguities and differences in perspective through discussion. A field study on Wikipedia shows that datasets curated using Wikibench can effectively capture community consensus, disagreement, and uncertainty. Furthermore, study participants used Wikibench to shape the overall data curation process, including refining label definitions, determining data inclusion criteria, and authoring data statements. Based on our findings, we propose future directions for systems that support community-driven data curation.WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from WikipediaBy *Yufang Hou, IBM Research Europe - Ireland*Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information. However, it remains unclear how LLMs handle knowledge conflicts arising from different augmented retrieved passages, especially when these passages originate from the same source and have equal trustworthiness. In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions that have varying answers based on contradictory passages from Wikipedia, a dataset widely regarded as a high-quality pre-training resource for most LLMs. Specifically, we introduce WikiContradict, a benchmark consisting of 253 high-quality, human-annotated instances designed to assess LLM performance when augmented with retrieved passages containing real-world knowledge conflicts. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages. Through rigorous human evaluations on a subset of WikiContradict instances involving 5 LLMs and over 3,500 judgements, we shed light on the behaviour and limitations of these models. For instance, when provided with two passages containing contradictory facts, all models struggle to generate answers that accurately reflect the conflicting nature of the context, especially for implicit conflicts requiring reasoning. Since human evaluation is costly, we also introduce an automated model that estimates LLM performance using a strong open-source language model, achieving an F-score of 0.8. Using this automated metric, we evaluate more than 1,500 answers from seven LLMs across all WikiContradict instances. To facilitate future work, we release WikiContradict on: this https URL https://ibm.biz/wikicontradict. Best,Kinneret
Hello everyone,
Quick reminder that we'll be starting this month's showcase focused on *Curation of Wikimedia AI Datasets *in about 45 minutes at https://youtube.com/live/USzLGJ5LLC8?feature=share.
Best, Kinneret
On Fri, Sep 13, 2024 at 1:56 PM Kinneret Gordon kgordon@wikimedia.org wrote:
Hi all,
The next Research Showcase will be live-streamed next Wednesday, September 18, at 9:30 AM PST / 16:30 UTC. Find your local time here https://zonestamp.toolforge.org/1726677000. The theme for this showcase is *Curation of Wikimedia AI Datasets*.
You are welcome to watch via the YouTube stream: https://youtube.com/live/USzLGJ5LLC8?feature=share. As usual, you can join the conversation in the YouTube chat as soon as the showcase goes live.
This month's presentations: Supporting Community-Driven Data Curation for AI Evaluation on Wikipedia through WikibenchBy *Tzu-Sheng Kuo, Carnegie Mellon University*AI tools are increasingly deployed in community contexts. However, datasets used to evaluate AI are typically created by developers and annotators outside a given community, which can yield misleading conclusions about AI performance. How might we empower communities to drive the intentional design and curation of evaluation datasets for AI that impacts them? We investigate this question on Wikipedia, an online community with multiple AI-based content moderation tools deployed. We introduce Wikibench, a system that enables communities to collaboratively curate AI evaluation datasets, while navigating ambiguities and differences in perspective through discussion. A field study on Wikipedia shows that datasets curated using Wikibench can effectively capture community consensus, disagreement, and uncertainty. Furthermore, study participants used Wikibench to shape the overall data curation process, including refining label definitions, determining data inclusion criteria, and authoring data statements. Based on our findings, we propose future directions for systems that support community-driven data curation.WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from WikipediaBy *Yufang Hou, IBM Research Europe - Ireland*Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information. However, it remains unclear how LLMs handle knowledge conflicts arising from different augmented retrieved passages, especially when these passages originate from the same source and have equal trustworthiness. In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions that have varying answers based on contradictory passages from Wikipedia, a dataset widely regarded as a high-quality pre-training resource for most LLMs. Specifically, we introduce WikiContradict, a benchmark consisting of 253 high-quality, human-annotated instances designed to assess LLM performance when augmented with retrieved passages containing real-world knowledge conflicts. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages. Through rigorous human evaluations on a subset of WikiContradict instances involving 5 LLMs and over 3,500 judgements, we shed light on the behaviour and limitations of these models. For instance, when provided with two passages containing contradictory facts, all models struggle to generate answers that accurately reflect the conflicting nature of the context, especially for implicit conflicts requiring reasoning. Since human evaluation is costly, we also introduce an automated model that estimates LLM performance using a strong open-source language model, achieving an F-score of 0.8. Using this automated metric, we evaluate more than 1,500 answers from seven LLMs across all WikiContradict instances. To facilitate future work, we release WikiContradict on: this https URL https://ibm.biz/wikicontradict. Best,Kinneret
--
Kinneret Gordon
Lead Research Community Officer
Wikimedia Foundation https://wikimediafoundation.org/