I checked the files directly, both the pages.sql.gz and the
categorylinks.sql.gz files for 20170920. The page is listed:
$ zcat enwiki-20170920-page.sql.gz | sed -e 's/),/),\n/g;' | grep
Computational_creativity | more
(16300571,0,'Computational_creativity','',0,0,0,0.718037721126,'20170903222622','20170903222623',798803037,59318,'wikitext',NULL),
(16390036,1,'Computational_creativity','',0,0,0,0.20741249006,'20170831064438','20170831084246',786288354,107057,'wikitext',NULL),
The first entry is the page, the second is the talk page.
$ zcat enwiki-20170920-categorylinks.sql.gz | sed -e 's/),/),\n/g;' | grep
16300571 | cat -vte
(16300571,'All_NPOV_disputes','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2013-01-27
10:43:57','','uca-default-u-kn','page'),$
(16300571,'All_articles_needing_additional_references','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2013-05-19
16:52:06','','uca-default-u-kn','page'),$
(16300571,'All_articles_with_dead_external_links','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-11-29
07:32:22','','uca-default-u-kn','page'),$
(16300571,'All_articles_with_unsourced_statements','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2008-11-21
10:36:21','','uca-default-u-kn','page'),$
(16300571,'Areas_of_computer_science','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-04-15
15:40:40','','uca-default-u-kn','page'),$
(16300571,'Articles_needing_additional_references_from_May_2013','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2013-05-19
16:52:06','','uca-default-u-kn','page'),$
(16300571,'Articles_with_French-language_external_links','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2013-06-20
04:05:59','','uca-default-u-kn','page'),$
(16300571,'Articles_with_dead_external_links_from_November_2016','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-11-29
07:32:22','','uca-default-u-kn','page'),$
(16300571,'Articles_with_permanently_dead_external_links','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-11-29
07:32:22','','uca-default-u-kn','page'),$
(16300571,'Articles_with_unsourced_statements_from_April_2015','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-04-15
15:40:40','','uca-default-u-kn','page'),$
(16300571,'Articles_with_unsourced_statements_from_April_2016','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-04-15
15:40:40','','uca-default-u-kn','page'),$
(16300571,'Articles_with_unsourced_statements_from_December_2015','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2015-12-01
14:40:27','','uca-default-u-kn','page'),$
(16300571,'Articles_with_unsourced_statements_from_January_2010','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2010-01-09
05:50:15','','uca-default-u-kn','page'),$
(16300571,'Articles_with_unsourced_statements_from_October_2016','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-10-10
21:27:12','','uca-default-u-kn','page'),$
(16300571,'Artificial_intelligence','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2008-03-19
03:45:58','','uca-default-u-kn','page'),$
(16300571,'Arts','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-04-15
15:40:40','','uca-default-u-kn','page'),$
(16300571,'CS1_maint:_Extra_text:_authors_list','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2017-06-04
08:45:09','','uca-default-u-kn','page'),$
(16300571,'Cognitive_psychology','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-04-15
15:40:40','','uca-default-u-kn','page'),$
(16300571,'Computational_fields_of_study','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-11-10
15:53:12','','uca-default-u-kn','page'),$
(16300571,'Creativity_techniques','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-04-15
15:40:40','','uca-default-u-kn','page'),$
(16300571,'NPOV_disputes_from_January_2013','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2013-05-19
15:48:55','','uca-default-u-kn','page'),$
(16300571,'Philosophical_movements','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2017-01-07
20:24:38','','uca-default-u-kn','page'),$
(16300571,'Webarchive_template_wayback_links','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2017-01-27
20:04:18','','uca-default-u-kn','page'),$
(16300571,'Wikipedia_articles_needing_clarification_from_November_2008','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2009-02-13
10:49:28','','uca-default-u-kn','page'),$
That list of categorylinks entries matches your results.
Is it possible that your download of the pages.sql file is corrupted? Do
the md5 sums check out? Or perhaps it is an issue with the tools.
Ariel
On Wed, Nov 1, 2017 at 7:40 PM, Tilman Bayer <tbayer(a)wikimedia.org> wrote:
CCing the data dumps mailing list, which is the
recommended venue for
questions like this (
https://meta.wikimedia.org/wi
ki/Data_dumps#Where_to_go_for_help ).
On Wed, Nov 1, 2017 at 8:44 AM, Shubhanshu Mishra <
shubhanshumishra(a)gmail.com> wrote:
Also, important categories like Computer
Architechture, Human based
computation, Programming language theory, Software Engineering, and Theory
of Computation, are missing from the subcategories of Areas of Computer
Science.
*Regards,*
*Shubhanshu Mishra*
Research Assistant,
iSchool at University of Illinois at Urbana-Champaign
--------------------------------------------------
*Website:*
http://shubhanshu.com
*LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
Blog <http://shubhanshu.com/blog> || Facebook
<http://www.facebook.com/shubhanshu.mishra> || Twitter
<http://www.twitter.com/TheShubhanshu> || LinkedIn
<http://www.linkedin.com/in/shubhanshumishra>
On Wed, Nov 1, 2017 at 10:42 AM, Shubhanshu Mishra <
shubhanshumishra(a)gmail.com> wrote:
Hi,
When using the wikipedia dump files, I am unable to find many categories
and pages in the dump.
E.g. under the Areas_of_computer_science category I get only 13
subcategories and 2 pages instead of 17 subcategories, 2 pages.
Furthermore, 1 page "Computational_creativity" is not present as a
subcategory.
I am using the following wikipedia dump files to extract the
categorylinks, and page details:
1.6G Sep 21 00:45 enwiki-20170920-page.sql.gz
21M Sep 21 00:45 enwiki-20170920-category.sql.gz
113M Sep 21 00:55 enwiki-20170920-redirect.sql.gz
2.2G Sep 21 03:10 enwiki-20170920-categorylinks.sql.gz
221M Sep 21 03:13 enwiki-20170920-page_props.sql.gz
I use
https://github.com/napsternxg/WikiUtils to parse the sql.gz dump
files, but I also tried searching in the sql.gz files and couldn't find any
entry for 16300571 in the page.sql.gz and in category.sql.gz
files. 16300571 supposedly refers to the Computational_creativity page as
the following categories are linked to this page:
16300571 'All_NPOV_disputes' 'page'
16300571 'All_articles_needing_additional_references' 'page'
16300571 'All_articles_with_dead_external_links' 'page'
16300571 'All_articles_with_unsourced_statements' 'page'
16300571 'Areas_of_computer_science' 'page'
16300571 'Articles_needing_additional_references_from_May_2013' 'page'
16300571 'Articles_with_French-language_external_links' 'page'
16300571 'Articles_with_dead_external_links_from_November_2016' 'page'
16300571 'Articles_with_permanently_dead_external_links' 'page'
16300571 'Articles_with_unsourced_statements_from_April_2015' 'page'
16300571 'Articles_with_unsourced_statements_from_April_2016' 'page'
16300571 'Articles_with_unsourced_statements_from_December_2015'
'page'
16300571 'Articles_with_unsourced_statements_from_January_2010' 'page'
16300571 'Articles_with_unsourced_statements_from_October_2016' 'page'
16300571 'Artificial_intelligence' 'page'
16300571 'Arts' 'page'
16300571 'CS1_maint:_Extra_text:_authors_list' 'page'
16300571 'Cognitive_psychology' 'page'
16300571 'Computational_fields_of_study' 'page'
16300571 'Creativity_techniques' 'page'
16300571 'NPOV_disputes_from_January_2013' 'page'
16300571 'Philosophical_movements' 'page'
16300571 'Webarchive_template_wayback_links' 'page'
16300571 'Wikipedia_articles_needing_clarification_from_November_2008'
'page'
More details can be found at:
https://twitter.com/TheShu
bhanshu/status/925736635572072449
Is there something, I am doing wrong, or are these rows just missing
from the dumps.
*Regards,*
*Shubhanshu Mishra*
Research Assistant,
iSchool at University of Illinois at Urbana-Champaign
--------------------------------------------------
*Website:*
http://shubhanshu.com
*LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
Blog <http://shubhanshu.com/blog> || Facebook
<http://www.facebook.com/shubhanshu.mishra> || Twitter
<http://www.twitter.com/TheShubhanshu> || LinkedIn
<http://www.linkedin.com/in/shubhanshumishra>
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l