Xmldatadumps-l March 2021

xmldatadumps-l@lists.wikimedia.org

5 participants
3 discussions

Proposed Project Grant for browsing and downloading Wikimedia datasets - Balchivist 2.0

by Hydriz Scholz

Dear All, I am User:Hydriz on Wikimedia wikis and I am working on a grant proposal to facilitate browsing and downloading of Wikimedia datasets (including the database dumps as well as other datasets). It is a proposed rewrite of the existing system which focused primarily on archiving the datasets to the Internet Archive. [1] My proposal aims to modernize the software used for automatically archiving datasets to the Internet Archive. More importantly, it aims to put researchers and downloaders first, by providing both a human-readable and a machine-readable interface for browsing and downloading datasets, whether present or historical. I also intend to integrate a "watchlist" feature that can automatically notify users when new datasets are available. Please do express your support for this proposal and help make this project a reality. Thank you! Warmest regards. Hydriz Scholz [1]: https://meta.wikimedia.org/wiki/Grants:Project/Hydriz/Balchivist_2.0

3 years, 1 month

possibility of illegal images? or illegal text?

by griffin tucker

i intend on collecting all dumps in all languages, including images, and don't want to break any laws (i'm in australia) so far, i'm just experimenting with compression of the smaller dumps a few things to note: by uncompressing and renaming the text dump files (removing "<wikiname>-<wikidumpdate>-*") and then using rdiffbackup, each subsequent text dump is a fraction of a percent of the previous dump for example, with enwikinews (without multistream or .7z): original compression: 1.79GB recompressed with xz -9e: 974MB uncompressed: 22GB uncompressed with rdiff-backup: 22GB, with only 200MB each subsequent monthly dump (202???01 /* not 202???20/*) rdiff-backup uses the rsync library i intend on using the dumps for machine learning (when i study machine learning in a few years) so i thought it'd be good to dealing with huge amounts of data as a head start - also i'm concerned about the dumps one day becoming corrupted, so i want copies now i suggest that sha256sums be calculated for distribution (as well as for the uncompressed files!) as apparently google has found potential security flaws with sha1, and md5 is already completely insecure anyway, i am just concerned about storing these dumps, especially the images, but also the text what are the chances of any of the images being of illegal content? have there been cases of illegal images being stored in these dumps before? what happened? what was the process? if i find illegal images, do i just report to this dump list? what happens if i'm unaware of illegal content in these dumps? is there such a thing as text being illegal? can you elaborate? also, the statistics below, are those sizes in page counts or bytes or megabytes? be awesome griffin tucker On Fri, 5 Mar 2021 at 23:00, <xmldatadumps-l-request(a)lists.wikimedia.org> wrote: > > Send Xmldatadumps-l mailing list submissions to > xmldatadumps-l(a)lists.wikimedia.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l > or, via email, send a message with subject or body 'help' to > xmldatadumps-l-request(a)lists.wikimedia.org > > You can reach the person managing the list at > xmldatadumps-l-owner(a)lists.wikimedia.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Xmldatadumps-l digest..." > > > Today's Topics: > > 1. XML Dumps FAQ monthly update (noreply.xmldatadumps(a)wikimedia.org) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 04 Mar 2021 13:52:30 +0000 > From: noreply.xmldatadumps(a)wikimedia.org > To: xmldatadumps-l(a)lists.wikimedia.org > Subject: [Xmldatadumps-l] XML Dumps FAQ monthly update > Message-ID: <20210304135230.bBj8c%noreply.xmldatadumps(a)wikimedia.org> > > > Greetings XML Dump users and contributors! > > This is your automatic monthly Dumps FAQ update email. This update > contains figures for the 20210201 full revision history content run. > > We are currently dumping 935 projects in total. > > > --------------------- > Stats for bugwiki on date 20210201 > > Total size of page content dump files for articles, current content only: > 19,778,622 > > Total size of page content dump files for all pages, current content only: > 24,718,009 > > Total size of page content dump files for all pages, all revisions: > 371,063,136 > --------------------- > Stats for enwiki on date 20210201 > > Total size of page content dump files for articles, current content only: > 81,801,780,624 > > Total size of page content dump files for all pages, current content only: > 181,026,335,781 > > Total size of page content dump files for all pages, all revisions: > 22,229,491,552,833 > --------------------- > > > Sincerely, > > Your friendly Wikimedia Dump Info Collector > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > Xmldatadumps-l mailing list > Xmldatadumps-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l > > > ------------------------------ > > End of Xmldatadumps-l Digest, Vol 127, Issue 1 > **********************************************

3 years, 1 month

XML Dumps FAQ monthly update

by noreply.xmldatadumps＠wikimedia.org

Greetings XML Dump users and contributors! This is your automatic monthly Dumps FAQ update email. This update contains figures for the 20210201 full revision history content run. We are currently dumping 935 projects in total. --------------------- Stats for bugwiki on date 20210201 Total size of page content dump files for articles, current content only: 19,778,622 Total size of page content dump files for all pages, current content only: 24,718,009 Total size of page content dump files for all pages, all revisions: 371,063,136 --------------------- Stats for enwiki on date 20210201 Total size of page content dump files for articles, current content only: 81,801,780,624 Total size of page content dump files for all pages, current content only: 181,026,335,781 Total size of page content dump files for all pages, all revisions: 22,229,491,552,833 --------------------- Sincerely, Your friendly Wikimedia Dump Info Collector

3 years, 1 month

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l March 2021