As part of Wikimedia Germany's work around reference reuse, we wrote a tool which processes the HTML dumps of all articles and produces detailed information about how Cite references (and Kartographer maps) are used on each page.
I'm writing this list for advice on how to publish the results so that the data can be easily discovered and consumed by researchers. Currently, the data is contained in 3,100 JSON and NDJSON files hosted on a Wikimedia Cloud VPS server, with a total size of 3.4GB. The outputs can be split or merged into whatever form will make them more useable.
For an overview of the columns and sample rows, please see this task: https://phabricator.wikimedia.org/T341751
We plan to run the scraper again in the future, and its modular architecture makes it simple to include or exclude additional information if anyone has suggestions about what else we might want to extract from rendered articles. To read more about the tool itself and why we decided to process HTML dumps directly, see this post: https://mw.ludd.net/wiki/Elixir/HTML_dump_scraper
-Adam Wight [[mw:Adamw]]
We've just published the full dataset of <ref> (citation) and map usage across wikis, please find the metadata here:
https://figshare.com/articles/dataset/Reference_and_map_usage_across_Wikimed...
and the raw data here:
https://analytics.wikimedia.org/published/datasets/one-off/html-dump-scraper...
Feel free to reply with questions or suggestions, I hope you find the results useful in your own work!
Kind regards, Adam W. [[mw:Adamw]]
for https://meta.wikimedia.org/wiki/WMDE_Technical_Wishes
On Wed, Aug 23, 2023 at 12:22 PM Adam Wight adam.wight@wikimedia.de wrote:
As part of Wikimedia Germany's work around reference reuse, we wrote a tool which processes the HTML dumps of all articles and produces detailed information about how Cite references (and Kartographer maps) are used on each page.
I'm writing this list for advice on how to publish the results so that the data can be easily discovered and consumed by researchers. Currently, the data is contained in 3,100 JSON and NDJSON files hosted on a Wikimedia Cloud VPS server, with a total size of 3.4GB. The outputs can be split or merged into whatever form will make them more useable.
For an overview of the columns and sample rows, please see this task: https://phabricator.wikimedia.org/T341751
We plan to run the scraper again in the future, and its modular architecture makes it simple to include or exclude additional information if anyone has suggestions about what else we might want to extract from rendered articles. To read more about the tool itself and why we decided to process HTML dumps directly, see this post: https://mw.ludd.net/wiki/Elixir/HTML_dump_scraper
-Adam Wight [[mw:Adamw]]
wiki-research-l@lists.wikimedia.org