Hi all,
*TL;DR:*
So far, Wikipedia's full revision history has been available only in wiki
markup, not in HTML -- a big limitation for researchers. We are changing
this by releasing WikiHist.html, Wikipedia's full history (up until March
2019) in HTML:
https://zenodo.org/record/3605388 <https://t.co/ZhK7kKaPCi?amp=1>
Caveat emptor: 7 TB!
Tweet:
https://twitter.com/cervisiarius/status/1301791239558311936
*More details:*
Wikipedia is written in the wikitext markup language. When serving content,
the MediaWiki software that powers Wikipedia parses wikitext to HTML,
thereby inserting additional content by expanding macros (templates and
modules). Hence, researchers who intend to analyze Wikipedia as seen by its
readers should work with HTML, rather than wikitext. Since Wikipedia’s
revision history is made publicly available by the Wikimedia Foundation
exclusively in wikitext format, researchers have had to produce HTML
themselves, typically by using Wikipedia’s REST API for ad-hoc
wikitext-to-HTML parsing. This approach, however, (1) does not scale to
very large amounts of data and (2) does not correctly expand macros in
historical article revisions.
We have solved these problems by developing a parallelized architecture for
parsing massive amounts of wikitext using local instances of MediaWiki,
enhanced with the capacity of correct historical macro expansion. By
deploying our system, we produce and hereby release WikiHist.html, English
Wikipedia’s full revision history in HTML format. It comprises the HTML
content of 580M revisions of 5.8M articles generated from the full English
Wikipedia history spanning 18 years from 1 January 2001 to 1 March 2019.
Boilerplate content such as page headers, footers, and navigation sidebars
are not included in the HTML.
For more details, please refer to
https://zenodo.org/record/3605388
<https://t.co/ZhK7kKaPCi?amp=1> and to the dataset paper:
Blagoj Mitrevski, Tiziano Piccardi, and Robert West: WikiHist.html: English
Wikipedia’s Full Revision History in HTML Format. In *Proceedings of the
14th International AAAI Conference on Web and Social Media,* 2020.
https://arxiv.org/abs/2001.10256
Best regards,
Bob