[Wiki-research-l] Publishing a data set with citation (and map) metrics for all wikis

23 Aug 2023

As part of Wikimedia Germany's work around reference reuse, we wrote a 
tool which processes the HTML dumps of all articles and produces 
detailed information about how Cite references (and Kartographer maps) 
are used on each page.

I'm writing this list for advice on how to publish the results so that 
the data can be easily discovered and consumed by researchers.  
Currently, the data is contained in 3,100 JSON and NDJSON files hosted 
on a Wikimedia Cloud VPS server, with a total size of 3.4GB.  The 
outputs can be split or merged into whatever form will make them more 
useable.

For an overview of the columns and sample rows, please see this task: 
https://phabricator.wikimedia.org/T341751

We plan to run the scraper again in the future, and its modular 
architecture makes it simple to include or exclude additional 
information if anyone has suggestions about what else we might want to 
extract from rendered articles.  To read more about the tool itself and 
why we decided to process HTML dumps directly, see this post: 
https://mw.ludd.net/wiki/Elixir/HTML_dump_scraper

-Adam Wight
[[mw:Adamw]]

https://meta.wikimedia.org/wiki/WMDE_Technical_Wishes

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

[Wiki-research-l] Publishing a data set with citation (and map) metrics for all wikis