[Wikimedia Research Showcase] Curation of Wikimedia AI Datasets- September 18 at 16:30 UTC - Wiki-research-l

13 Sep 2024


      Hi all,
The next Research Showcase will be live-streamed next Wednesday, September
18, at 9:30 AM PST / 16:30 UTC. Find your local time here
https://zonestamp.toolforge.org/1726677000. The theme for this showcase is
 *Curation of Wikimedia AI Datasets*.
You are welcome to watch via the YouTube stream:
https://youtube.com/live/USzLGJ5LLC8?feature=share. As usual, you can join
the conversation in the YouTube chat as soon as the showcase goes live.
This month's presentations:
Supporting Community-Driven Data Curation for AI Evaluation on Wikipedia
through WikibenchBy *Tzu-Sheng Kuo, Carnegie Mellon University*AI tools are
increasingly deployed in community contexts. However, datasets used to
evaluate AI are typically created by developers and annotators outside a
given community, which can yield misleading conclusions about AI
performance. How might we empower communities to drive the intentional
design and curation of evaluation datasets for AI that impacts them? We
investigate this question on Wikipedia, an online community with multiple
AI-based content moderation tools deployed. We introduce Wikibench, a
system that enables communities to collaboratively curate AI evaluation
datasets, while navigating ambiguities and differences in perspective
through discussion. A field study on Wikipedia shows that datasets curated
using Wikibench can effectively capture community consensus, disagreement,
and uncertainty. Furthermore, study participants used Wikibench to shape
the overall data curation process, including refining label definitions,
determining data inclusion criteria, and authoring data statements. Based
on our findings, we propose future directions for systems that support
community-driven data curation.WikiContradict: A Benchmark for Evaluating
LLMs on Real-World Knowledge Conflicts from WikipediaBy *Yufang Hou, IBM
Research Europe - Ireland*Retrieval-augmented generation (RAG) has emerged
as a promising solution to mitigate the limitations of large language
models (LLMs), such as hallucinations and outdated information. However, it
remains unclear how LLMs handle knowledge conflicts arising from different
augmented retrieved passages, especially when these passages originate from
the same source and have equal trustworthiness. In this work, we conduct a
comprehensive evaluation of LLM-generated answers to questions that have
varying answers based on contradictory passages from Wikipedia, a dataset
widely regarded as a high-quality pre-training resource for most LLMs.
Specifically, we introduce WikiContradict, a benchmark consisting of 253
high-quality, human-annotated instances designed to assess LLM performance
when augmented with retrieved passages containing real-world knowledge
conflicts. We benchmark a diverse range of both closed and open-source LLMs
under different QA scenarios, including RAG with a single passage, and RAG
with 2 contradictory passages. Through rigorous human evaluations on a
subset of WikiContradict instances involving 5 LLMs and over 3,500
judgements, we shed light on the behaviour and limitations of these models.
For instance, when provided with two passages containing contradictory
facts, all models struggle to generate answers that accurately reflect the
conflicting nature of the context, especially for implicit conflicts
requiring reasoning. Since human evaluation is costly, we also introduce an
automated model that estimates LLM performance using a strong open-source
language model, achieving an F-score of 0.8. Using this automated metric,
we evaluate more than 1,500 answers from seven LLMs across all
WikiContradict instances. To facilitate future work, we release
WikiContradict on: this https URL https://ibm.biz/wikicontradict.
Best,Kinneret
-- 

Kinneret Gordon

Lead Research Community Officer

Wikimedia Foundation https://wikimediafoundation.org/