Hi all,
I’m doing research to the existence of gender bias in Wikipedia texts over time. To do this, I need old pages-articles.xml dumps. I am still looking for dumps from 2009 and 2011-2013, does anyone know how I can get one of these or does someone have one of these stored themselves?
Thanks in advance,
Katja Schmahl
Hi,
First of all, excuse me as I guess this is not appropriate channel to ask
this.
The dumps are not accessible anymore from some Kubernetes pods in the
ToolLabs server: https://phabricator.wikimedia.org/T247455
Please, could anyone help me to improve this ticket so it is taken into
account?
Kind regards,
For the past few years we have not dumped private tables at all; they would
not be accessible to the public in any case, and they do not suffice as a
backup in case of catastrophic failure.
We are therefore removing the feature to dump private tables along with
public tables in a dump run. Anyone who wishes to use the dump scripts in
our python repo to dump privat tables in their wiki will need to create a
separate dumps configuration file and tables yaml file describing which
tables to dump and where to put them, as a separate dump run.
This change will be committed by April 20, 2020, in time for the second
dump run of the month.
Note that this does not impact the actual output of the Wikimedia SQL/XML
dumps at all, since we have not been dumping private tables since late 2016.
See T249508 to follow along.
Ariel
Hi,
Sorry if this is not the right place to report this.
In the last Spanish Wikipedia dump (still in progress):
https://dumps.wikimedia.org/eswiki/20200401/
the "pages-articles" dump is duplicated. I guess, based on the dump from
March (and also the previous ones) that they are really two different
files, so the filename should reflect it as before.
I am aware there had been recent modifications in the way the multistream
dumps are built, so maybe there is some kind of issue there.
Best Regards,
Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20200301 full revision history content run.
We are currently dumping 910 projects in total.
---------------------
Stats for euwiktionary on date 20200301
Total size of page content dump files for articles, current content only:
67765999
Total size of page content dump files for all pages, current content only:
68901734
Total size of page content dump files for all pages, all revisions:
532573202
---------------------
Stats for enwiki on date 20200301
Total size of page content dump files for articles, current content only:
76154026008
Total size of page content dump files for all pages, current content only:
169370554542
Total size of page content dump files for all pages, all revisions:
20379079466354
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector