Dear All,
I am User:Hydriz on Wikimedia wikis and I am working on a grant proposal to facilitate browsing and downloading of Wikimedia datasets (including the database dumps as well as other datasets). It is a proposed rewrite of the existing system which focused primarily on archiving the datasets to the Internet Archive. [1]
My proposal aims to modernize the software used for automatically archiving datasets to the Internet Archive. More importantly, it aims to put researchers and downloaders first, by providing both a human-readable and a machine-readable interface for browsing and downloading datasets, whether present or historical. I also intend to integrate a "watchlist" feature that can automatically notify users when new datasets are available.
Please do express your support for this proposal and help make this project a reality. Thank you!
Warmest regards. Hydriz Scholz
[1]: https://meta.wikimedia.org/wiki/Grants:Project/Hydriz/Balchivist_2.0
Are you going to implement retention times for data sets and removal of data info under gdpr order when asked ?
Sent from my iPod
On 15 Mar 2021, at 01:59, Hydriz Scholz hydriz@jorked.com wrote:
Dear All,
I am User:Hydriz on Wikimedia wikis and I am working on a grant proposal to facilitate browsing and downloading of Wikimedia datasets (including the database dumps as well as other datasets). It is a proposed rewrite of the existing system which focused primarily on archiving the datasets to the Internet Archive. [1]
My proposal aims to modernize the software used for automatically archiving datasets to the Internet Archive. More importantly, it aims to put researchers and downloaders first, by providing both a human-readable and a machine-readable interface for browsing and downloading datasets, whether present or historical. I also intend to integrate a "watchlist" feature that can automatically notify users when new datasets are available.
Please do express your support for this proposal and help make this project a reality. Thank you!
Warmest regards. Hydriz Scholz
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Thank you for your question.
The datasets are intended to be retained forever, as researchers may want access to historical data. If any removal is necessary for compliance with local and international laws, it will be primarily handled by the Internet Archive, as they are the ones storing the data.
Warmest regards, Hydriz Scholz
On Mon, 15 Mar 2021 at 13:59, colin johnston colinj@gt86car.org.uk wrote:
Are you going to implement retention times for data sets and removal of data info under gdpr order when asked ?
Sent from my iPod
On 15 Mar 2021, at 01:59, Hydriz Scholz hydriz@jorked.com wrote:
Dear All,
I am User:Hydriz on Wikimedia wikis and I am working on a grant proposal to facilitate browsing and downloading of Wikimedia datasets (including the database dumps as well as other datasets). It is a proposed rewrite of the existing system which focused primarily on archiving the datasets to the Internet Archive. [1]
My proposal aims to modernize the software used for automatically archiving datasets to the Internet Archive. More importantly, it aims to put researchers and downloaders first, by providing both a human-readable and a machine-readable interface for browsing and downloading datasets, whether present or historical. I also intend to integrate a "watchlist" feature that can automatically notify users when new datasets are available.
Please do express your support for this proposal and help make this project a reality. Thank you!
Warmest regards. Hydriz Scholz
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
The dumps need retention dates applied to for compliance with audit The removal under gdpr is important to address for compliance with audit and for confidence with public that gdpr is being adhered to.
Colin
On 15 Mar 2021, at 06:45, Hydriz Scholz hydriz@jorked.com wrote:
Thank you for your question.
The datasets are intended to be retained forever, as researchers may want access to historical data. If any removal is necessary for compliance with local and international laws, it will be primarily handled by the Internet Archive, as they are the ones storing the data.
Warmest regards, Hydriz Scholz
On Mon, 15 Mar 2021 at 13:59, colin johnston colinj@gt86car.org.uk wrote:
Are you going to implement retention times for data sets and removal of data info under gdpr order when asked ?
Sent from my iPod
On 15 Mar 2021, at 01:59, Hydriz Scholz hydriz@jorked.com wrote:
Dear All,
I am User:Hydriz on Wikimedia wikis and I am working on a grant proposal to facilitate browsing and downloading of Wikimedia datasets (including the database dumps as well as other datasets). It is a proposed rewrite of the existing system which focused primarily on archiving the datasets to the Internet Archive. [1]
My proposal aims to modernize the software used for automatically archiving datasets to the Internet Archive. More importantly, it aims to put researchers and downloaders first, by providing both a human-readable and a machine-readable interface for browsing and downloading datasets, whether present or historical. I also intend to integrate a "watchlist" feature that can automatically notify users when new datasets are available.
Please do express your support for this proposal and help make this project a reality. Thank you!
Warmest regards. Hydriz Scholz
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
-- Hydriz Scholz
Hi all,
On 15.03.21 02:57, Hydriz Scholz wrote:
I also intend to integrate a "watchlist" feature that can automatically notify users when new datasets are available.
Not sure, if this is a killer feature for human users, i.e. mailbox notification. We are using the Wikimedia Dumps since 13 years now for DBpedia and implemented a download function [1]. However, this is not running optimal. I think it still uses the links in the HTML page to find the download URLs.
The way we implemented it is: download (2021-01-01) and then it tries to download the dumps from the beginning of the month and fails if it don't find some and you need to re-run later.
Would be nice to have an API to check for availability and define sets. We are in the progress of open-sourcing databus.dbpedia.org which is a registry offering this functionality for any files, i.e. shasums, downloadUrls, API for querying, machine-readable and actionable licenses, etc. We will put the wikimedia dumps on the bus eventually.
For me/us, we would value the ability to work with them programatically over yet another notification, but others might have different opinions.
-- Sebastian
[1] https://github.com/dbpedia/extraction-framework/blob/a334ac2af877531a082dc9a...
xmldatadumps-l@lists.wikimedia.org