Wiki-research-l September 2020

wiki-research-l@lists.wikimedia.org

16 participants
13 discussions

WikiHist.html: English Wikipedia's Full Revision History in HTML Format

by Robert West

Hi all, *TL;DR:* So far, Wikipedia's full revision history has been available only in wiki markup, not in HTML -- a big limitation for researchers. We are changing this by releasing WikiHist.html, Wikipedia's full history (up until March 2019) in HTML: https://zenodo.org/record/3605388 <https://t.co/ZhK7kKaPCi?amp=1> Caveat emptor: 7 TB! Tweet: https://twitter.com/cervisiarius/status/1301791239558311936 *More details:* Wikipedia is written in the wikitext markup language. When serving content, the MediaWiki software that powers Wikipedia parses wikitext to HTML, thereby inserting additional content by expanding macros (templates and modules). Hence, researchers who intend to analyze Wikipedia as seen by its readers should work with HTML, rather than wikitext. Since Wikipedia’s revision history is made publicly available by the Wikimedia Foundation exclusively in wikitext format, researchers have had to produce HTML themselves, typically by using Wikipedia’s REST API for ad-hoc wikitext-to-HTML parsing. This approach, however, (1) does not scale to very large amounts of data and (2) does not correctly expand macros in historical article revisions. We have solved these problems by developing a parallelized architecture for parsing massive amounts of wikitext using local instances of MediaWiki, enhanced with the capacity of correct historical macro expansion. By deploying our system, we produce and hereby release WikiHist.html, English Wikipedia’s full revision history in HTML format. It comprises the HTML content of 580M revisions of 5.8M articles generated from the full English Wikipedia history spanning 18 years from 1 January 2001 to 1 March 2019. Boilerplate content such as page headers, footers, and navigation sidebars are not included in the HTML. For more details, please refer to https://zenodo.org/record/3605388 <https://t.co/ZhK7kKaPCi?amp=1> and to the dataset paper: Blagoj Mitrevski, Tiziano Piccardi, and Robert West: WikiHist.html: English Wikipedia’s Full Revision History in HTML Format. In *Proceedings of the 14th International AAAI Conference on Web and Social Media,* 2020. https://arxiv.org/abs/2001.10256 Best regards, Bob

3 years, 7 months

DBpedia Autumn Hackathon, starting Sept 21st

by Sebastian Hellmann

Apologies for cross-posting Dear DBpedians, Linked Data savvies and Ontologists, We would like to invite you to join the DBpedia Autumn Hackathon 2020 as a new format to contribute to DBpedia, gain fame, win small prizes and experience the latest technology provided by DBpedia Association members. The hackathon is part of the Knowledge Graphs in Action conference on October 6, 2020. Please check here: https://wiki.dbpedia.org/meetings/KnowledgeGraphsInAction # Timeline * Registration of participants - main communication channel will be the #hackathon channel in DBpedia Slack (sign up https://dbpedia-slack.herokuapp.com/, then add yourself to the channel). If you wish to receive a reminder email on Sep 21st, you can leave your email address in this form: https://tinyurl.com/y24ps5jt * Until September 14th - preparation phase, participating organisations prepare details, track formation, additional tracks can be proposed, please contact dbpedia-events(a)infai.org <mailto:dbpedia-events@infai.org> * September 21st - Announcement of details for each track, including prizes, participating data, demos, tools and tasks. Check updates on hackathon website https://wiki.dbpedia.org/events/dbpedia-autumn-hackathon-2020 * September 21st to October 1st - hacking period, coordinated via DBpedia slack * October 1st, 23:59 Hawaii Time - Submission of hacking result (3 min video and 2-3 paragraph summary with links, if not stated otherwise in the track) * October 5th, 16:00 CEST - Final Event, each track chair presents a short recap of the track, announces prizes or summarizes the result of hacking. * October 6th, 9:50 - 15:30 CEST - Knowledge Graphs in Action Event * Results and videos are documented on the DBpedia Website and the DBpedia Youtube channel. # Member Tracks The member tracks are hosted by DBpedia Association members, who are technology leaders in the area of Knowledge Engineering. Additional tracks can be proposed until Sep 14th, please contact dbpedia-events(a)infai.org <mailto:dbpedia-events@infai.org>. * timbr SQL Knowledge Graph: Learn how to model, map and query ontologies in timbr and then model an ontology of GDELT, map it to the GDELT database, and answer a number of questions that currently are quite impossible to get from the BigQuery GDELT database. Cash prizes planned. https://www.timbr.ai/ * GNOSS Knowledge Graph Builder: Give meaning to your organisation’s documents and data with a Knowledge Graph. https://www.gnoss.com/en/products/semantic-framework * ImageSnippets: Labeling images with semantic descriptions. Use DBpedia spotlight and an entity matching lookup to select DBpedia terms to describe images. Then explore the resulting dataset through searches over inference graphs and explore the ImageSnippets dataset through our SPARQL endpoint. Prizes planned. http://www.imagesnippets.com * Diffbot: Build Your Own Knowledge Graph! Use the Natural Language API to extract triples from natural language text and expand these triples with data from the Diffbot Knowledge Graph (10+ billion entities, 1+ trillion facts). Check out the demo http://demo.nl.diffbot.com/. All participants will receive access to the Diffbot KG and tools for (non-commercial) research for one year ($10,000 value). # Dutch National Knowledge Graph Track Following the DBpedia FlexiFusion approach, we are currently flexi-fusing a huge, dbpedia-style knowledge graph that will connect many Linked Data sources and data silos relevant to the country of the Netherlands. We hope that this will eventually crystallize a well-connected sub-community linked open data (LOD) cloud in the same manner as DBpedia crystallized the original LOD cloud with some improvements (you could call it LOD Mark II). Data and hackathon details will be announced on 21st of September. # Improve DBpedia Track A community track, where everybody can participate and contribute in improving existing DBpedia components, in particular the extraction framework, the mappings, the ontology, data quality test cases, new extractors, links and other extensions. Best individual contributions will be acknowledged on the DBpedia website by anointing the WebID/Foaf profile. (chaired by Milan Dojchinovski and Marvin Hofer from the DBpedia Association & InfAI and the DBpedia Hacking Committee) # DBpedia Open Innovation Track (not part of the hackathon, pre-announcement) For the DBpedia Spring Event 2021, we are planning an Open Innovation Track, where DBpedians can showcase their applications. This endeavour will not be part of the hackathon as we are looking for significant showcases with development effort of months & years built on the core infrastructure of DBpedia such as the SPARQL endpoint, the data, lookup, spotlight, DBpedia Live, etc. Details will be announced during the Hackathon Final Event on October 5. (chaired by Heiko Paulheim et al.) Stay tuned and stay safe! With kind regards, The DBpedia Organizing-Team

3 years, 8 months

Upcoming WMF/Research-Team Office hours on September 1st, 2020

by Martin Gerlach

Hi all, Join the Research Team at the Wikimedia Foundation [1] for their monthly Office hours on 2020-09-01 at 16.00-17.00 (UTC). Through these office hours, we aim to make ourselves more available to answer some of the research related questions that you as Wikimedia volunteer editors, organizers, affiliates, staff, and researchers face in your projects and initiatives (*). To participate, join the video-call via this Wikimedia-meet link [2]. There is no set agenda - feel free to add your item to the list of topics in the etherpad [3] (You can do this after you join the meeting, too.), otherwise you are welcome to also just hang out. More detailed information (e.g. about how to attend) can be found here [4]. Started in the beginning of 2020 as an experiment [5], after the first 6 editions we have evaluated the scope and format of the Research office hours. In order to decrease barriers of accessibility and to facilitate more direct interaction, we have switched the format from IRC to video call. We will re-evaluate the current format at the end of the year. We would also be glad to hear your feedback and/or comments. (*) Some example cases we hope to be able to support you in: - You have a specific research related question that you suspect you should be able to answer with the publicly available data and you don’t know how to find an answer for it, or you just need some more help with it. For example, how can I compute the ratio of anonymous to registered editors in my wiki? - You run into repetitive or very manual work as part of your Wikimedia contributions and you wish to find out if there are ways to use machines to improve your workflows. These types of conversations can sometimes be harder to find an answer for during an office hour, however, discussing them can help us understand your challenges better and we may find ways to work with each other to support you in addressing it in the future. - You want to learn what the Research team at the Wikimedia Foundation does and how we can potentially support you. Specifically for affiliates: if you are interested in building relationships with the academic institutions in your country, we would love to talk with you and learn more. We have a series of programs that aim to expand the network of Wikimedia researchers globally and we would love to collaborate with those of you interested more closely in this space. - You want to talk with us about one of our existing programs [6]. Hope to see many of you, Martin (WMF Research Team) [1] https://research.wikimedia.org/team.html [2] https://meet.wmcloud.org/ResearchOfficeHours [3] https://etherpad.wikimedia.org/p/Research-Analytics-Office-hours [4] https://www.mediawiki.org/wiki/Wikimedia_Research/Office_hours [5] https://lists.wikimedia.org/pipermail/wiki-research-l/2019-December/007039.… [6] https://research.wikimedia.org/projects.html -- Martin Gerlach Research Scientist Wikimedia Foundation

3 years, 8 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Wiki-research-l September 2020