Because the first run of the month was delayed, we need a couple days delay
now for the second run to start, so that the last of the wikis (dewiki) ca
finish up the first run. I expect the second monthly run to finish on time
however, once started.
CCing the data dumps mailing list, which is the recommended venue for
questions like this (https://meta.wikimedia.org/wiki/Data_dumps#Where_to_go_
On Wed, Nov 1, 2017 at 8:44 AM, Shubhanshu Mishra <
> Also, important categories like Computer Architechture, Human based
> computation, Programming language theory, Software Engineering, and Theory
> of Computation, are missing from the subcategories of Areas of Computer
> *Shubhanshu Mishra*
> Research Assistant,
> iSchool at University of Illinois at Urbana-Champaign
> *Website:* http://shubhanshu.com
> *LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
> Blog <http://shubhanshu.com/blog> || Facebook
> <http://www.facebook.com/shubhanshu.mishra> || Twitter
> <http://www.twitter.com/TheShubhanshu> || LinkedIn
> On Wed, Nov 1, 2017 at 10:42 AM, Shubhanshu Mishra <
> shubhanshumishra(a)gmail.com> wrote:
>> When using the wikipedia dump files, I am unable to find many categories
>> and pages in the dump.
>> E.g. under the Areas_of_computer_science category I get only 13
>> subcategories and 2 pages instead of 17 subcategories, 2 pages.
>> Furthermore, 1 page "Computational_creativity" is not present as a
>> I am using the following wikipedia dump files to extract the
>> categorylinks, and page details:
>> 1.6G Sep 21 00:45 enwiki-20170920-page.sql.gz
>> 21M Sep 21 00:45 enwiki-20170920-category.sql.gz
>> 113M Sep 21 00:55 enwiki-20170920-redirect.sql.gz
>> 2.2G Sep 21 03:10 enwiki-20170920-categorylinks.sql.gz
>> 221M Sep 21 03:13 enwiki-20170920-page_props.sql.gz
>> I use https://github.com/napsternxg/WikiUtils to parse the sql.gz dump
>> files, but I also tried searching in the sql.gz files and couldn't find any
>> entry for 16300571 in the page.sql.gz and in category.sql.gz
>> files. 16300571 supposedly refers to the Computational_creativity page as
>> the following categories are linked to this page:
>> 16300571 'All_NPOV_disputes' 'page'
>> 16300571 'All_articles_needing_additional_references' 'page'
>> 16300571 'All_articles_with_dead_external_links' 'page'
>> 16300571 'All_articles_with_unsourced_statements' 'page'
>> 16300571 'Areas_of_computer_science' 'page'
>> 16300571 'Articles_needing_additional_references_from_May_2013' 'page'
>> 16300571 'Articles_with_French-language_external_links' 'page'
>> 16300571 'Articles_with_dead_external_links_from_November_2016' 'page'
>> 16300571 'Articles_with_permanently_dead_external_links' 'page'
>> 16300571 'Articles_with_unsourced_statements_from_April_2015' 'page'
>> 16300571 'Articles_with_unsourced_statements_from_April_2016' 'page'
>> 16300571 'Articles_with_unsourced_statements_from_December_2015'
>> 16300571 'Articles_with_unsourced_statements_from_January_2010' 'page'
>> 16300571 'Articles_with_unsourced_statements_from_October_2016' 'page'
>> 16300571 'Artificial_intelligence' 'page'
>> 16300571 'Arts' 'page'
>> 16300571 'CS1_maint:_Extra_text:_authors_list' 'page'
>> 16300571 'Cognitive_psychology' 'page'
>> 16300571 'Computational_fields_of_study' 'page'
>> 16300571 'Creativity_techniques' 'page'
>> 16300571 'NPOV_disputes_from_January_2013' 'page'
>> 16300571 'Philosophical_movements' 'page'
>> 16300571 'Webarchive_template_wayback_links' 'page'
>> 16300571 'Wikipedia_articles_needing_clarification_from_November_2008'
>> More details can be found at: https://twitter.com/TheShu
>> Is there something, I am doing wrong, or are these rows just missing from
>> the dumps.
>> *Shubhanshu Mishra*
>> Research Assistant,
>> iSchool at University of Illinois at Urbana-Champaign
>> *Website:* http://shubhanshu.com
>> *LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
>> Blog <http://shubhanshu.com/blog> || Facebook
>> <http://www.facebook.com/shubhanshu.mishra> || Twitter
>> <http://www.twitter.com/TheShubhanshu> || LinkedIn
> Analytics mailing list
IRC (Freenode): HaeB
As was previously announced on the xmldatadumps-l list, the sql/xml dumps
generated twice a month will be written to an internal server, starting
with the November run. This is in part to reduce load on the web/rsync/nfs
server which has been doing this work also until now. We want separation
of roles for some other reasons too.
Because I want to get this right, and there are a lot of moving parts, and
I don't want to rsync all the prefetch data over to these boxes again next
month after cancelling the move:
If needed, the November full run will be delayed for a few days.
If the November full run takes too long, the partial run, usually starting
on the 20th of the month, will not take place.
Additionally, as described in an earlier email on the xmldatadumps-l list:
files will show up on the web server/rsync server with a substantial
delay. Initially this may be a day or more. This includes index.html and
other status files.
You can keep track of developments here:
If you know folks not on the lists in the recipients field for this email,
please forward it to them and suggest that they subscribe to this list.
These jobs are currently written uncompressed. Starting with the next run,
I plan to write these as gzip compressed files. This means that we'll save
a lot of space for the larger abstracts dumps. Additionally,only status and
html files will be uncompressed, which is convenient for maintenance
If anyone has a strong objection to this, please raise it now. There's a
ticket open for it: https://phabricator.wikimedia.org/T178046