CCing the data dumps mailing list, which is the recommended venue for questions like this (https://meta.wikimedia.org/wiki/Data_dumps#Where_to_go_ for_help ).
On Wed, Nov 1, 2017 at 8:44 AM, Shubhanshu Mishra < shubhanshumishra@gmail.com> wrote:
Also, important categories like Computer Architechture, Human based computation, Programming language theory, Software Engineering, and Theory of Computation, are missing from the subcategories of Areas of Computer Science.
*Regards,* *Shubhanshu Mishra* Research Assistant, iSchool at University of Illinois at Urbana-Champaign
*Website:* http://shubhanshu.com *LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
Blog http://shubhanshu.com/blog || Facebook http://www.facebook.com/shubhanshu.mishra || Twitter http://www.twitter.com/TheShubhanshu || LinkedIn http://www.linkedin.com/in/shubhanshumishra
On Wed, Nov 1, 2017 at 10:42 AM, Shubhanshu Mishra < shubhanshumishra@gmail.com> wrote:
Hi,
When using the wikipedia dump files, I am unable to find many categories and pages in the dump.
E.g. under the Areas_of_computer_science category I get only 13 subcategories and 2 pages instead of 17 subcategories, 2 pages. Furthermore, 1 page "Computational_creativity" is not present as a subcategory.
I am using the following wikipedia dump files to extract the categorylinks, and page details:
1.6G Sep 21 00:45 enwiki-20170920-page.sql.gz 21M Sep 21 00:45 enwiki-20170920-category.sql.gz 113M Sep 21 00:55 enwiki-20170920-redirect.sql.gz 2.2G Sep 21 03:10 enwiki-20170920-categorylinks.sql.gz 221M Sep 21 03:13 enwiki-20170920-page_props.sql.gz
I use https://github.com/napsternxg/WikiUtils to parse the sql.gz dump files, but I also tried searching in the sql.gz files and couldn't find any entry for 16300571 in the page.sql.gz and in category.sql.gz files. 16300571 supposedly refers to the Computational_creativity page as the following categories are linked to this page:
16300571 'All_NPOV_disputes' 'page' 16300571 'All_articles_needing_additional_references' 'page' 16300571 'All_articles_with_dead_external_links' 'page' 16300571 'All_articles_with_unsourced_statements' 'page' 16300571 'Areas_of_computer_science' 'page' 16300571 'Articles_needing_additional_references_from_May_2013' 'page' 16300571 'Articles_with_French-language_external_links' 'page' 16300571 'Articles_with_dead_external_links_from_November_2016' 'page' 16300571 'Articles_with_permanently_dead_external_links' 'page' 16300571 'Articles_with_unsourced_statements_from_April_2015' 'page' 16300571 'Articles_with_unsourced_statements_from_April_2016' 'page' 16300571 'Articles_with_unsourced_statements_from_December_2015' 'page' 16300571 'Articles_with_unsourced_statements_from_January_2010' 'page' 16300571 'Articles_with_unsourced_statements_from_October_2016' 'page' 16300571 'Artificial_intelligence' 'page' 16300571 'Arts' 'page' 16300571 'CS1_maint:_Extra_text:_authors_list' 'page' 16300571 'Cognitive_psychology' 'page' 16300571 'Computational_fields_of_study' 'page' 16300571 'Creativity_techniques' 'page' 16300571 'NPOV_disputes_from_January_2013' 'page' 16300571 'Philosophical_movements' 'page' 16300571 'Webarchive_template_wayback_links' 'page' 16300571 'Wikipedia_articles_needing_clarification_from_November_2008' 'page'
More details can be found at: https://twitter.com/TheShu bhanshu/status/925736635572072449
Is there something, I am doing wrong, or are these rows just missing from the dumps.
*Regards,* *Shubhanshu Mishra* Research Assistant, iSchool at University of Illinois at Urbana-Champaign
*Website:* http://shubhanshu.com *LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
Blog http://shubhanshu.com/blog || Facebook http://www.facebook.com/shubhanshu.mishra || Twitter http://www.twitter.com/TheShubhanshu || LinkedIn http://www.linkedin.com/in/shubhanshumishra
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics