Hello,
just a heads-up for anyone using HTML dumps, apart from the missing namespaces issue already mentioned on this list, there also seem to be entire pages missing, and some of the included page data is outdated and does not contain the latest changes. I have no idea how many pages are affected.
phabricator ticket with more details: https://phabricator.wikimedia.org/T305407
– Jan
Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20220301 full revision history content run.
We are currently dumping 945 projects in total.
---------------------
Stats for csbwiktionary on date 20220301
Total size of page content dump files for articles, current content only:
2,005,634
Total size of page content dump files for all pages, current content only:
2,900,582
Total size of page content dump files for all pages, all revisions:
46,688,480
---------------------
Stats for enwiki on date 20220301
Total size of page content dump files for articles, current content only:
88,064,759,123
Total size of page content dump files for all pages, current content only:
193,569,131,075
Total size of page content dump files for all pages, all revisions:
24,339,335,307,700
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector
Hello data dump enthusiasts,
I've been working on an API (written in Python) for downloading files from Wikimedia Data Dump files, and I've just released its earliest version here: https://github.com/jon-edward/wiki_dump.
I'd love to know what you all think (how it can be improved, how you are interested in using it, positive/negative remarks) if you can spare the time.
All the best,
jon-edward