Hi all,
*TL;DR:*
So far, Wikipedia's full revision history has been available only in wiki
markup, not in HTML -- a big limitation for researchers. We are changing
this by releasing WikiHist.html, Wikipedia's full history (up until March
2019) in HTML:
https://zenodo.org/record/3605388 <https://t.co/ZhK7kKaPCi?amp=1>
Caveat emptor: 7 TB!
Tweet: https://twitter.com/cervisiarius/status/1301791239558311936
*More details:*
Wikipedia is written in the wikitext markup language. When serving content,
the MediaWiki software that powers Wikipedia parses wikitext to HTML,
thereby inserting additional content by expanding macros (templates and
modules). Hence, researchers who intend to analyze Wikipedia as seen by its
readers should work with HTML, rather than wikitext. Since Wikipedia’s
revision history is made publicly available by the Wikimedia Foundation
exclusively in wikitext format, researchers have had to produce HTML
themselves, typically by using Wikipedia’s REST API for ad-hoc
wikitext-to-HTML parsing. This approach, however, (1) does not scale to
very large amounts of data and (2) does not correctly expand macros in
historical article revisions.
We have solved these problems by developing a parallelized architecture for
parsing massive amounts of wikitext using local instances of MediaWiki,
enhanced with the capacity of correct historical macro expansion. By
deploying our system, we produce and hereby release WikiHist.html, English
Wikipedia’s full revision history in HTML format. It comprises the HTML
content of 580M revisions of 5.8M articles generated from the full English
Wikipedia history spanning 18 years from 1 January 2001 to 1 March 2019.
Boilerplate content such as page headers, footers, and navigation sidebars
are not included in the HTML.
For more details, please refer to https://zenodo.org/record/3605388
<https://t.co/ZhK7kKaPCi?amp=1> and to the dataset paper:
Blagoj Mitrevski, Tiziano Piccardi, and Robert West: WikiHist.html: English
Wikipedia’s Full Revision History in HTML Format. In *Proceedings of the
14th International AAAI Conference on Web and Social Media,* 2020.
https://arxiv.org/abs/2001.10256
Best regards,
Bob
Apologies for cross-posting
Dear DBpedians, Linked Data savvies and Ontologists,
We would like to invite you to join the DBpedia Autumn Hackathon 2020 as
a new format to contribute to DBpedia, gain fame, win small prizes and
experience the latest technology provided by DBpedia Association
members. The hackathon is part of the Knowledge Graphs in Action
conference on October 6, 2020. Please check here:
https://wiki.dbpedia.org/meetings/KnowledgeGraphsInAction
# Timeline
*
Registration of participants - main communication channel will be
the #hackathon channel in DBpedia Slack (sign up
https://dbpedia-slack.herokuapp.com/, then add yourself to the
channel). If you wish to receive a reminder email on Sep 21st, you
can leave your email address in this form: https://tinyurl.com/y24ps5jt
*
Until September 14th - preparation phase, participating
organisations prepare details, track formation, additional tracks
can be proposed, please contact dbpedia-events(a)infai.org
<mailto:dbpedia-events@infai.org>
*
September 21st - Announcement of details for each track, including
prizes, participating data, demos, tools and tasks. Check updates on
hackathon website
https://wiki.dbpedia.org/events/dbpedia-autumn-hackathon-2020
*
September 21st to October 1st - hacking period, coordinated via
DBpedia slack
*
October 1st, 23:59 Hawaii Time - Submission of hacking result (3
min video and 2-3 paragraph summary with links, if not stated
otherwise in the track)
*
October 5th, 16:00 CEST - Final Event, each track chair presents a
short recap of the track, announces prizes or summarizes the result
of hacking.
*
October 6th, 9:50 - 15:30 CEST - Knowledge Graphs in Action Event
*
Results and videos are documented on the DBpedia Website and the
DBpedia Youtube channel.
# Member Tracks
The member tracks are hosted by DBpedia Association members, who are
technology leaders in the area of Knowledge Engineering. Additional
tracks can be proposed until Sep 14th, please contact
dbpedia-events(a)infai.org <mailto:dbpedia-events@infai.org>.
*
timbr SQL Knowledge Graph: Learn how to model, map and query
ontologies in timbr and then model an ontology of GDELT, map it to
the GDELT database, and answer a number of questions that currently
are quite impossible to get from the BigQuery GDELT database. Cash
prizes planned. https://www.timbr.ai/
*
GNOSS Knowledge Graph Builder: Give meaning to your organisation’s
documents and data with a Knowledge Graph.
https://www.gnoss.com/en/products/semantic-framework
*
ImageSnippets: Labeling images with semantic descriptions. Use
DBpedia spotlight and an entity matching lookup to select DBpedia
terms to describe images. Then explore the resulting dataset through
searches over inference graphs and explore the ImageSnippets dataset
through our SPARQL endpoint. Prizes planned.
http://www.imagesnippets.com
*
Diffbot: Build Your Own Knowledge Graph! Use the Natural Language
API to extract triples from natural language text and expand these
triples with data from the Diffbot Knowledge Graph (10+ billion
entities, 1+ trillion facts). Check out the demo
http://demo.nl.diffbot.com/. All participants will receive access to
the Diffbot KG and tools for (non-commercial) research for one year
($10,000 value).
# Dutch National Knowledge Graph Track
Following the DBpedia FlexiFusion approach, we are currently
flexi-fusing a huge, dbpedia-style knowledge graph that will connect
many Linked Data sources and data silos relevant to the country of the
Netherlands. We hope that this will eventually crystallize a
well-connected sub-community linked open data (LOD) cloud in the same
manner as DBpedia crystallized the original LOD cloud with some
improvements (you could call it LOD Mark II). Data and hackathon details
will be announced on 21st of September.
# Improve DBpedia Track
A community track, where everybody can participate and contribute in
improving existing DBpedia components, in particular the extraction
framework, the mappings, the ontology, data quality test cases, new
extractors, links and other extensions. Best individual contributions
will be acknowledged on the DBpedia website by anointing the WebID/Foaf
profile.
(chaired by Milan Dojchinovski and Marvin Hofer from the DBpedia
Association & InfAI and the DBpedia Hacking Committee)
# DBpedia Open Innovation Track
(not part of the hackathon, pre-announcement)
For the DBpedia Spring Event 2021, we are planning an Open Innovation
Track, where DBpedians can showcase their applications. This endeavour
will not be part of the hackathon as we are looking for significant
showcases with development effort of months & years built on the core
infrastructure of DBpedia such as the SPARQL endpoint, the data, lookup,
spotlight, DBpedia Live, etc. Details will be announced during the
Hackathon Final Event on October 5.
(chaired by Heiko Paulheim et al.)
Stay tuned and stay safe!
With kind regards,
The DBpedia Organizing-Team
Hi all,
Join the Research Team at the Wikimedia Foundation [1] for their monthly
Office hours on 2020-09-01 at 16.00-17.00 (UTC).
Through these office hours, we aim to make ourselves more available to
answer some of the research related questions that you as Wikimedia
volunteer editors, organizers, affiliates, staff, and researchers face in
your projects and initiatives (*).
To participate, join the video-call via this Wikimedia-meet link [2]. There
is no set agenda - feel free to add your item to the list of topics in the
etherpad [3] (You can do this after you join the meeting, too.), otherwise
you are welcome to also just hang out. More detailed information (e.g.
about how to attend) can be found here [4].
Started in the beginning of 2020 as an experiment [5], after the first 6
editions we have evaluated the scope and format of the Research office
hours. In order to decrease barriers of accessibility and to facilitate
more direct interaction, we have switched the format from IRC to video
call. We will re-evaluate the current format at the end of the year. We
would also be glad to hear your feedback and/or comments.
(*) Some example cases we hope to be able to support you in:
-
You have a specific research related question that you suspect you
should be able to answer with the publicly available data and you don’t
know how to find an answer for it, or you just need some more help with it.
For example, how can I compute the ratio of anonymous to registered editors
in my wiki?
-
You run into repetitive or very manual work as part of your Wikimedia
contributions and you wish to find out if there are ways to use machines to
improve your workflows. These types of conversations can sometimes be
harder to find an answer for during an office hour, however, discussing
them can help us understand your challenges better and we may find ways to
work with each other to support you in addressing it in the future.
-
You want to learn what the Research team at the Wikimedia Foundation
does and how we can potentially support you. Specifically for affiliates:
if you are interested in building relationships with the academic
institutions in your country, we would love to talk with you and learn
more. We have a series of programs that aim to expand the network of
Wikimedia researchers globally and we would love to collaborate with those
of you interested more closely in this space.
-
You want to talk with us about one of our existing programs [6].
Hope to see many of you,
Martin (WMF Research Team)
[1] https://research.wikimedia.org/team.html
[2] https://meet.wmcloud.org/ResearchOfficeHours
[3] https://etherpad.wikimedia.org/p/Research-Analytics-Office-hours
[4] https://www.mediawiki.org/wiki/Wikimedia_Research/Office_hours
[5]
https://lists.wikimedia.org/pipermail/wiki-research-l/2019-December/007039.…
[6] https://research.wikimedia.org/projects.html
--
Martin Gerlach
Research Scientist
Wikimedia Foundation