Hi all,
The next Research Showcase will be live-streamed on Wednesday, November 20, 2019, at 9:30 AM PST/17:30 UTC. We’ll have a presentation from Martin Potthast of Leipzig University on text reuse in Wikipedia and other presentation from the Wikimedia Foundation’s Isaac Johnson on the demographics and interests of Wikipedia’s readers.
YouTube stream: https://www.youtube.com/watch?v=tIko_V1k09s
As usual, you can join the conversation on IRC at #wikimedia-research. You can also watch our past research showcases here: https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
This month's presentations:
Wikipedia Text Reuse: Within and Without
By Martin Potthast, Leipzig University
We study text reuse related to Wikipedia at scale by compiling the first corpus of text reuse cases within Wikipedia as well as without (i.e., reuse of Wikipedia text in a sample of the Common Crawl). To discover reuse beyond verbatim copy and paste, we employ state-of-the-art text reuse detection technology, scaling it for the first time to process the entire Wikipedia as part of a distributed retrieval pipeline. We further report on a pilot analysis of the 100 million reuse cases inside, and the 1.6 million reuse cases outside Wikipedia that we discovered. Text reuse inside Wikipedia gives rise to new tasks such as article template induction, fixing quality flaws, or complementing Wikipedia’s ontology. Text reuse outside Wikipedia yields a tangible metric for the emerging field of quantifying Wikipedia’s influence on the web. To foster future research into these tasks, and for reproducibility’s sake, the Wikipedia text reuse corpus and the retrieval pipeline are made freely available. Paper, Demo
Characterizing Wikipedia Reader Demographics and Interests
By Isaac Johnson, Wikimedia Foundation