Hi Everyone,

The next Research Showcase will be live-streamed this Wednesday, June 21, 2017 at 11:30 AM (PST) 18:30 UTC.

YouTube stream: https://www.youtube.com/watch?v=i2jpKRwPT-Q

As usual, you can join the conversation on IRC at #wikimedia-research. And, you can watch our past research showcases here.

This month's presentations:

Title: Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia

By Allen Yilun Lin

Abstract: Wikipedia-based studies and systems frequently assume that each article describes a separate concept. However, in this paper, we show that this article-as-concept assumption is problematic due to editors’ tendency to split articles into parent articles and sub-articles when articles get too long for readers (e.g. “United States” and “American literature” in the English Wikipedia). In this paper, we present evidence that this issue can have significant impacts on Wikipedia-based studies and systems and introduce the subarticle matching problem. The goal of the sub-article matching problem is to automatically connect sub-articles to parent articles to help Wikipedia-based studies and systems retrieve complete information about a concept. We then describe the first system to address the sub-article matching problem. We show that, using a diverse feature set and standard machine learning techniques, our system can achieve good performance on most of our ground truth datasets, significantly outperforming baseline approaches.

Title: Understanding Wikidata Queries

By Markus Kroetzsch

Abstract: Wikimedia provides a public service that lets anyone answer complex questions over the sum of all knowledge stored in Wikidata. These questions are expressed in the query language SPARQL and range from the most simple fact retrievals ("What is the birthday of Douglas Adams?") to complex analytical queries ("Average lifespan of people by occupation"). The talk presents ongoing efforts to analyse the server logs of the millions of queries that are answered each month. It is an important but difficult challenge to draw meaningful conclusions from this dataset. One might hope to learn relevant information about the usage of the service and Wikidata in general, but at the same time one has to be careful not to be misled by the data. Indeed, the dataset turned out to be highly heterogeneous and unpredictable, with strongly varying usage patterns that make it difficult to draw conclusions about "normal" usage. The talk will give a status report, present preliminary results, and discuss possible next steps.

Sarah R. Rodlund

Senior Project Coordinator-Product & Technology, Wikimedia Foundation

srodlund@wikimedia.org