Dear All,
I am User:Hydriz on Wikimedia wikis and I am working on a grant proposal to facilitate browsing and downloading of Wikimedia datasets (including the database dumps as well as other datasets). It is a proposed rewrite of the existing system which focused primarily on archiving the datasets to the Internet Archive. [1]
My proposal aims to modernize the software used for automatically archiving datasets to the Internet Archive. More importantly, it aims to put researchers and downloaders first, by providing both a human-readable and a machine-readable interface for browsing and downloading datasets, whether present or historical. I also intend to integrate a "watchlist" feature that can automatically notify users when new datasets are available.
Please do express your support for this proposal and help make this project a reality. Thank you!
Warmest regards. Hydriz Scholz
[1]: https://meta.wikimedia.org/wiki/Grants:Project/Hydriz/Balchivist_2.0
Hi all,
On 15.03.21 02:57, Hydriz Scholz wrote:
I also intend to integrate a "watchlist" feature that can automatically notify users when new datasets are available.
Not sure, if this is a killer feature for human users, i.e. mailbox notification. We are using the Wikimedia Dumps since 13 years now for DBpedia and implemented a download function [1]. However, this is not running optimal. I think it still uses the links in the HTML page to find the download URLs.
The way we implemented it is: download (2021-01-01) and then it tries to download the dumps from the beginning of the month and fails if it don't find some and you need to re-run later.
Would be nice to have an API to check for availability and define sets. We are in the progress of open-sourcing databus.dbpedia.org which is a registry offering this functionality for any files, i.e. shasums, downloadUrls, API for querying, machine-readable and actionable licenses, etc. We will put the wikimedia dumps on the bus eventually.
For me/us, we would value the ability to work with them programatically over yet another notification, but others might have different opinions.
-- Sebastian
[1] https://github.com/dbpedia/extraction-framework/blob/a334ac2af877531a082dc9a...
wiki-research-l@lists.wikimedia.org