Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20210801 full revision history content run.
We are currently dumping 942 projects in total.
---------------------
Stats for ilowiki on date 20210801
Total size of page content dump files for articles, current content only:
110,426,279
Total size of page content dump files for all pages, current content only:
120,468,480
Total size of page content dump files for all pages, all revisions:
1,661,587,793
---------------------
Stats for enwiki on date 20210801
Total size of page content dump files for articles, current content only:
84,775,627,212
Total size of page content dump files for all pages, current content only:
186,886,040,561
Total size of page content dump files for all pages, all revisions:
23,185,464,570,213
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector
I've been using the monthly page view summaries from pagecounts-ez. Now on https://dumps.wikimedia.org/other/pagecounts-ez/ it says:
"NOTE: This dataset has had some problems and we are no longer generating new data, since September 2020. We are phasing it out in favor of Pageviews Complete... When it's finished we will announce it widely and explain how to migrate."
Is the announcement and explanation available somewhere. I'm having problems because
1. The "totals" files, such as https://dumps.wikimedia.org/other/pagecounts-ez/merged/pagecounts-2020-08-v…, which are of the order of 500Mb per month seem to have no equivalents in the new pageview complete dump archives. The monthly files at https://dumps.wikimedia.org/other/pageview_complete/monthly/2020/2020-08/ are 10x larger (and I can't find any description of what the "automated" "user" and "spider" files represent, although I can guess)
2. If I download (say) https://dumps.wikimedia.org/other/pageview_complete/monthly/2020/2020-08/pa…, and peek at the file using bzless, it seems to contain lots of binary characters: it's not clear to me what the format is, or how to decode it. Is there any information online to help me?
Thanks for any pointers that might help.
Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20210701 full revision history content run.
We are currently dumping 941 projects in total.
---------------------
Stats for azwikisource on date 20210701
Total size of page content dump files for articles, current content only:
54,945,209
Total size of page content dump files for all pages, current content only:
62,037,845
Total size of page content dump files for all pages, all revisions:
349,972,631
---------------------
Stats for enwiki on date 20210701
Total size of page content dump files for articles, current content only:
84,219,557,413
Total size of page content dump files for all pages, current content only:
185,886,341,450
Total size of page content dump files for all pages, all revisions:
23,026,883,007,314
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector
Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20210601 full revision history content run.
We are currently dumping 938 projects in total.
---------------------
Stats for bowikibooks on date 20210601
Total size of page content dump files for articles, current content only:
21,939
Total size of page content dump files for all pages, current content only:
120,566
Total size of page content dump files for all pages, all revisions:
239,642
---------------------
Stats for enwiki on date 20210601
Total size of page content dump files for articles, current content only:
83,761,691,097
Total size of page content dump files for all pages, current content only:
184,983,464,749
Total size of page content dump files for all pages, all revisions:
22,873,288,105,970
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector
Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20210501 full revision history content run.
We are currently dumping 938 projects in total.
---------------------
Stats for knwikiquote on date 20210501
Total size of page content dump files for articles, current content only:
2,424,605
Total size of page content dump files for all pages, current content only:
3,075,235
Total size of page content dump files for all pages, all revisions:
137,872,620
---------------------
Stats for enwiki on date 20210501
Total size of page content dump files for articles, current content only:
83,279,070,675
Total size of page content dump files for all pages, current content only:
184,008,595,449
Total size of page content dump files for all pages, all revisions:
22,704,503,957,165
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector
Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20210401 full revision history content run.
We are currently dumping 938 projects in total.
---------------------
Stats for zhwikiversity on date 20210401
Total size of page content dump files for articles, current content only:
43,134,248
Total size of page content dump files for all pages, current content only:
47,565,022
Total size of page content dump files for all pages, all revisions:
2,950,412,627
---------------------
Stats for enwiki on date 20210401
Total size of page content dump files for articles, current content only:
82,787,618,854
Total size of page content dump files for all pages, current content only:
183,015,180,891
Total size of page content dump files for all pages, all revisions:
22,543,999,212,092
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector
Hello everyone,
I am looking for gzipped Wikidata JSON dumps, of the “all” variety, for the following dates:
• 2020-05-18
• 2020-03-02
• 2017-08-21
• 2017-06-26
Please let me know if you have any of these.
Thank you,
James Hare
Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20210301 full revision history content run.
We are currently dumping 938 projects in total.
---------------------
Stats for miwikibooks on date 20210301
Total size of page content dump files for articles, current content only:
30,327
Total size of page content dump files for all pages, current content only:
133,657
Total size of page content dump files for all pages, all revisions:
318,859
---------------------
Stats for enwiki on date 20210301
Total size of page content dump files for articles, current content only:
82,267,477,533
Total size of page content dump files for all pages, current content only:
181,997,120,787
Total size of page content dump files for all pages, all revisions:
22,373,809,360,853
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector
Dear All,
I am User:Hydriz on Wikimedia wikis and I am working on a grant
proposal to facilitate browsing and downloading of Wikimedia datasets
(including the database dumps as well as other datasets). It is a
proposed rewrite of the existing system which focused primarily on
archiving the datasets to the Internet Archive. [1]
My proposal aims to modernize the software used for automatically
archiving datasets to the Internet Archive. More importantly, it aims
to put researchers and downloaders first, by providing both a
human-readable and a machine-readable interface for browsing and
downloading datasets, whether present or historical. I also intend to
integrate a "watchlist" feature that can automatically notify users
when new datasets are available.
Please do express your support for this proposal and help make this
project a reality. Thank you!
Warmest regards.
Hydriz Scholz
[1]: https://meta.wikimedia.org/wiki/Grants:Project/Hydriz/Balchivist_2.0
i intend on collecting all dumps in all languages, including images,
and don't want to break any laws (i'm in australia)
so far, i'm just experimenting with compression of the smaller dumps
a few things to note:
by uncompressing and renaming the text dump files (removing
"<wikiname>-<wikidumpdate>-*") and then using rdiffbackup, each
subsequent text dump is a fraction of a percent of the previous dump
for example, with enwikinews (without multistream or .7z):
original compression: 1.79GB
recompressed with xz -9e: 974MB
uncompressed: 22GB
uncompressed with rdiff-backup: 22GB, with only 200MB each subsequent
monthly dump (202???01 /* not 202???20/*)
rdiff-backup uses the rsync library
i intend on using the dumps for machine learning (when i study machine
learning in a few years) so i thought it'd be good to dealing with
huge amounts of data as a head start - also i'm concerned about the
dumps one day becoming corrupted, so i want copies now
i suggest that sha256sums be calculated for distribution (as well as
for the uncompressed files!) as apparently google has found potential
security flaws with sha1, and md5 is already completely insecure
anyway, i am just concerned about storing these dumps, especially the
images, but also the text
what are the chances of any of the images being of illegal content?
have there been cases of illegal images being stored in these dumps
before? what happened? what was the process?
if i find illegal images, do i just report to this dump list?
what happens if i'm unaware of illegal content in these dumps?
is there such a thing as text being illegal? can you elaborate?
also, the statistics below, are those sizes in page counts or bytes or
megabytes?
be awesome
griffin tucker
On Fri, 5 Mar 2021 at 23:00, <xmldatadumps-l-request(a)lists.wikimedia.org> wrote:
>
> Send Xmldatadumps-l mailing list submissions to
> xmldatadumps-l(a)lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
> or, via email, send a message with subject or body 'help' to
> xmldatadumps-l-request(a)lists.wikimedia.org
>
> You can reach the person managing the list at
> xmldatadumps-l-owner(a)lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Xmldatadumps-l digest..."
>
>
> Today's Topics:
>
> 1. XML Dumps FAQ monthly update (noreply.xmldatadumps(a)wikimedia.org)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 04 Mar 2021 13:52:30 +0000
> From: noreply.xmldatadumps(a)wikimedia.org
> To: xmldatadumps-l(a)lists.wikimedia.org
> Subject: [Xmldatadumps-l] XML Dumps FAQ monthly update
> Message-ID: <20210304135230.bBj8c%noreply.xmldatadumps(a)wikimedia.org>
>
>
> Greetings XML Dump users and contributors!
>
> This is your automatic monthly Dumps FAQ update email. This update
> contains figures for the 20210201 full revision history content run.
>
> We are currently dumping 935 projects in total.
>
>
> ---------------------
> Stats for bugwiki on date 20210201
>
> Total size of page content dump files for articles, current content only:
> 19,778,622
>
> Total size of page content dump files for all pages, current content only:
> 24,718,009
>
> Total size of page content dump files for all pages, all revisions:
> 371,063,136
> ---------------------
> Stats for enwiki on date 20210201
>
> Total size of page content dump files for articles, current content only:
> 81,801,780,624
>
> Total size of page content dump files for all pages, current content only:
> 181,026,335,781
>
> Total size of page content dump files for all pages, all revisions:
> 22,229,491,552,833
> ---------------------
>
>
> Sincerely,
>
> Your friendly Wikimedia Dump Info Collector
>
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
>
> ------------------------------
>
> End of Xmldatadumps-l Digest, Vol 127, Issue 1
> **********************************************