Hi,
When using the wikipedia dump files, I am unable to find many categories and pages in the dump.
E.g. under the Areas_of_computer_science category I get only 13 subcategories and 2 pages instead of 17 subcategories, 2 pages. Furthermore, 1 page "Computational_creativity" is not present as a subcategory.
I am using the following wikipedia dump files to extract the categorylinks, and page details:
1.6G Sep 21 00:45 enwiki-20170920-page.sql.gz 21M Sep 21 00:45 enwiki-20170920-category.sql.gz 113M Sep 21 00:55 enwiki-20170920-redirect.sql.gz 2.2G Sep 21 03:10 enwiki-20170920-categorylinks.sql.gz 221M Sep 21 03:13 enwiki-20170920-page_props.sql.gz
I use https://github.com/napsternxg/WikiUtils to parse the sql.gz dump files, but I also tried searching in the sql.gz files and couldn't find any entry for 16300571 in the page.sql.gz and in category.sql.gz files. 16300571 supposedly refers to the Computational_creativity page as the following categories are linked to this page:
16300571 'All_NPOV_disputes' 'page' 16300571 'All_articles_needing_additional_references' 'page' 16300571 'All_articles_with_dead_external_links' 'page' 16300571 'All_articles_with_unsourced_statements' 'page' 16300571 'Areas_of_computer_science' 'page' 16300571 'Articles_needing_additional_references_from_May_2013' 'page' 16300571 'Articles_with_French-language_external_links' 'page' 16300571 'Articles_with_dead_external_links_from_November_2016' 'page' 16300571 'Articles_with_permanently_dead_external_links' 'page' 16300571 'Articles_with_unsourced_statements_from_April_2015' 'page' 16300571 'Articles_with_unsourced_statements_from_April_2016' 'page' 16300571 'Articles_with_unsourced_statements_from_December_2015' 'page' 16300571 'Articles_with_unsourced_statements_from_January_2010' 'page' 16300571 'Articles_with_unsourced_statements_from_October_2016' 'page' 16300571 'Artificial_intelligence' 'page' 16300571 'Arts' 'page' 16300571 'CS1_maint:_Extra_text:_authors_list' 'page' 16300571 'Cognitive_psychology' 'page' 16300571 'Computational_fields_of_study' 'page' 16300571 'Creativity_techniques' 'page' 16300571 'NPOV_disputes_from_January_2013' 'page' 16300571 'Philosophical_movements' 'page' 16300571 'Webarchive_template_wayback_links' 'page' 16300571 'Wikipedia_articles_needing_clarification_from_November_2008' 'page'
More details can be found at: https://twitter.com/TheShubhanshu/status/925736635572072449
Is there something, I am doing wrong, or are these rows just missing from the dumps.
*Regards,* *Shubhanshu Mishra* Research Assistant, iSchool at University of Illinois at Urbana-Champaign -------------------------------------------------- *Website:* http://shubhanshu.com *LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
Blog http://shubhanshu.com/blog || Facebook http://www.facebook.com/shubhanshu.mishra || Twitter http://www.twitter.com/TheShubhanshu || LinkedIn http://www.linkedin.com/in/shubhanshumishra
Also, important categories like Computer Architechture, Human based computation, Programming language theory, Software Engineering, and Theory of Computation, are missing from the subcategories of Areas of Computer Science.
*Regards,* *Shubhanshu Mishra* Research Assistant, iSchool at University of Illinois at Urbana-Champaign -------------------------------------------------- *Website:* http://shubhanshu.com *LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
Blog http://shubhanshu.com/blog || Facebook http://www.facebook.com/shubhanshu.mishra || Twitter http://www.twitter.com/TheShubhanshu || LinkedIn http://www.linkedin.com/in/shubhanshumishra
On Wed, Nov 1, 2017 at 10:42 AM, Shubhanshu Mishra < shubhanshumishra@gmail.com> wrote:
Hi,
When using the wikipedia dump files, I am unable to find many categories and pages in the dump.
E.g. under the Areas_of_computer_science category I get only 13 subcategories and 2 pages instead of 17 subcategories, 2 pages. Furthermore, 1 page "Computational_creativity" is not present as a subcategory.
I am using the following wikipedia dump files to extract the categorylinks, and page details:
1.6G Sep 21 00:45 enwiki-20170920-page.sql.gz 21M Sep 21 00:45 enwiki-20170920-category.sql.gz 113M Sep 21 00:55 enwiki-20170920-redirect.sql.gz 2.2G Sep 21 03:10 enwiki-20170920-categorylinks.sql.gz 221M Sep 21 03:13 enwiki-20170920-page_props.sql.gz
I use https://github.com/napsternxg/WikiUtils to parse the sql.gz dump files, but I also tried searching in the sql.gz files and couldn't find any entry for 16300571 in the page.sql.gz and in category.sql.gz files. 16300571 supposedly refers to the Computational_creativity page as the following categories are linked to this page:
16300571 'All_NPOV_disputes' 'page' 16300571 'All_articles_needing_additional_references' 'page' 16300571 'All_articles_with_dead_external_links' 'page' 16300571 'All_articles_with_unsourced_statements' 'page' 16300571 'Areas_of_computer_science' 'page' 16300571 'Articles_needing_additional_references_from_May_2013' 'page' 16300571 'Articles_with_French-language_external_links' 'page' 16300571 'Articles_with_dead_external_links_from_November_2016' 'page' 16300571 'Articles_with_permanently_dead_external_links' 'page' 16300571 'Articles_with_unsourced_statements_from_April_2015' 'page' 16300571 'Articles_with_unsourced_statements_from_April_2016' 'page' 16300571 'Articles_with_unsourced_statements_from_December_2015' 'page' 16300571 'Articles_with_unsourced_statements_from_January_2010' 'page' 16300571 'Articles_with_unsourced_statements_from_October_2016' 'page' 16300571 'Artificial_intelligence' 'page' 16300571 'Arts' 'page' 16300571 'CS1_maint:_Extra_text:_authors_list' 'page' 16300571 'Cognitive_psychology' 'page' 16300571 'Computational_fields_of_study' 'page' 16300571 'Creativity_techniques' 'page' 16300571 'NPOV_disputes_from_January_2013' 'page' 16300571 'Philosophical_movements' 'page' 16300571 'Webarchive_template_wayback_links' 'page' 16300571 'Wikipedia_articles_needing_clarification_from_November_2008' 'page'
More details can be found at: https://twitter.com/TheShubhanshu/status/ 925736635572072449
Is there something, I am doing wrong, or are these rows just missing from the dumps.
*Regards,* *Shubhanshu Mishra* Research Assistant, iSchool at University of Illinois at Urbana-Champaign
*Website:* http://shubhanshu.com *LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
Blog http://shubhanshu.com/blog || Facebook http://www.facebook.com/shubhanshu.mishra || Twitter http://www.twitter.com/TheShubhanshu || LinkedIn http://www.linkedin.com/in/shubhanshumishra
CCing the data dumps mailing list, which is the recommended venue for questions like this (https://meta.wikimedia.org/wiki/Data_dumps#Where_to_go_ for_help ).
On Wed, Nov 1, 2017 at 8:44 AM, Shubhanshu Mishra < shubhanshumishra@gmail.com> wrote:
Also, important categories like Computer Architechture, Human based computation, Programming language theory, Software Engineering, and Theory of Computation, are missing from the subcategories of Areas of Computer Science.
*Regards,* *Shubhanshu Mishra* Research Assistant, iSchool at University of Illinois at Urbana-Champaign
*Website:* http://shubhanshu.com *LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
Blog http://shubhanshu.com/blog || Facebook http://www.facebook.com/shubhanshu.mishra || Twitter http://www.twitter.com/TheShubhanshu || LinkedIn http://www.linkedin.com/in/shubhanshumishra
On Wed, Nov 1, 2017 at 10:42 AM, Shubhanshu Mishra < shubhanshumishra@gmail.com> wrote:
Hi,
When using the wikipedia dump files, I am unable to find many categories and pages in the dump.
E.g. under the Areas_of_computer_science category I get only 13 subcategories and 2 pages instead of 17 subcategories, 2 pages. Furthermore, 1 page "Computational_creativity" is not present as a subcategory.
I am using the following wikipedia dump files to extract the categorylinks, and page details:
1.6G Sep 21 00:45 enwiki-20170920-page.sql.gz 21M Sep 21 00:45 enwiki-20170920-category.sql.gz 113M Sep 21 00:55 enwiki-20170920-redirect.sql.gz 2.2G Sep 21 03:10 enwiki-20170920-categorylinks.sql.gz 221M Sep 21 03:13 enwiki-20170920-page_props.sql.gz
I use https://github.com/napsternxg/WikiUtils to parse the sql.gz dump files, but I also tried searching in the sql.gz files and couldn't find any entry for 16300571 in the page.sql.gz and in category.sql.gz files. 16300571 supposedly refers to the Computational_creativity page as the following categories are linked to this page:
16300571 'All_NPOV_disputes' 'page' 16300571 'All_articles_needing_additional_references' 'page' 16300571 'All_articles_with_dead_external_links' 'page' 16300571 'All_articles_with_unsourced_statements' 'page' 16300571 'Areas_of_computer_science' 'page' 16300571 'Articles_needing_additional_references_from_May_2013' 'page' 16300571 'Articles_with_French-language_external_links' 'page' 16300571 'Articles_with_dead_external_links_from_November_2016' 'page' 16300571 'Articles_with_permanently_dead_external_links' 'page' 16300571 'Articles_with_unsourced_statements_from_April_2015' 'page' 16300571 'Articles_with_unsourced_statements_from_April_2016' 'page' 16300571 'Articles_with_unsourced_statements_from_December_2015' 'page' 16300571 'Articles_with_unsourced_statements_from_January_2010' 'page' 16300571 'Articles_with_unsourced_statements_from_October_2016' 'page' 16300571 'Artificial_intelligence' 'page' 16300571 'Arts' 'page' 16300571 'CS1_maint:_Extra_text:_authors_list' 'page' 16300571 'Cognitive_psychology' 'page' 16300571 'Computational_fields_of_study' 'page' 16300571 'Creativity_techniques' 'page' 16300571 'NPOV_disputes_from_January_2013' 'page' 16300571 'Philosophical_movements' 'page' 16300571 'Webarchive_template_wayback_links' 'page' 16300571 'Wikipedia_articles_needing_clarification_from_November_2008' 'page'
More details can be found at: https://twitter.com/TheShu bhanshu/status/925736635572072449
Is there something, I am doing wrong, or are these rows just missing from the dumps.
*Regards,* *Shubhanshu Mishra* Research Assistant, iSchool at University of Illinois at Urbana-Champaign
*Website:* http://shubhanshu.com *LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
Blog http://shubhanshu.com/blog || Facebook http://www.facebook.com/shubhanshu.mishra || Twitter http://www.twitter.com/TheShubhanshu || LinkedIn http://www.linkedin.com/in/shubhanshumishra
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I checked the files directly, both the pages.sql.gz and the categorylinks.sql.gz files for 20170920. The page is listed:
$ zcat enwiki-20170920-page.sql.gz | sed -e 's/),/),\n/g;' | grep Computational_creativity | more (16300571,0,'Computational_creativity','',0,0,0,0.718037721126,'20170903222622','20170903222623',798803037,59318,'wikitext',NULL), (16390036,1,'Computational_creativity','',0,0,0,0.20741249006,'20170831064438','20170831084246',786288354,107057,'wikitext',NULL),
The first entry is the page, the second is the talk page.
$ zcat enwiki-20170920-categorylinks.sql.gz | sed -e 's/),/),\n/g;' | grep 16300571 | cat -vte (16300571,'All_NPOV_disputes','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2013-01-27 10:43:57','','uca-default-u-kn','page'),$ (16300571,'All_articles_needing_additional_references','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2013-05-19 16:52:06','','uca-default-u-kn','page'),$ (16300571,'All_articles_with_dead_external_links','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2016-11-29 07:32:22','','uca-default-u-kn','page'),$ (16300571,'All_articles_with_unsourced_statements','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2008-11-21 10:36:21','','uca-default-u-kn','page'),$ (16300571,'Areas_of_computer_science','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2016-04-15 15:40:40','','uca-default-u-kn','page'),$ (16300571,'Articles_needing_additional_references_from_May_2013','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2013-05-19 16:52:06','','uca-default-u-kn','page'),$ (16300571,'Articles_with_French-language_external_links','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2013-06-20 04:05:59','','uca-default-u-kn','page'),$ (16300571,'Articles_with_dead_external_links_from_November_2016','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2016-11-29 07:32:22','','uca-default-u-kn','page'),$ (16300571,'Articles_with_permanently_dead_external_links','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2016-11-29 07:32:22','','uca-default-u-kn','page'),$ (16300571,'Articles_with_unsourced_statements_from_April_2015','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2016-04-15 15:40:40','','uca-default-u-kn','page'),$ (16300571,'Articles_with_unsourced_statements_from_April_2016','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2016-04-15 15:40:40','','uca-default-u-kn','page'),$ (16300571,'Articles_with_unsourced_statements_from_December_2015','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2015-12-01 14:40:27','','uca-default-u-kn','page'),$ (16300571,'Articles_with_unsourced_statements_from_January_2010','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2010-01-09 05:50:15','','uca-default-u-kn','page'),$ (16300571,'Articles_with_unsourced_statements_from_October_2016','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2016-10-10 21:27:12','','uca-default-u-kn','page'),$ (16300571,'Artificial_intelligence','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2008-03-19 03:45:58','','uca-default-u-kn','page'),$ (16300571,'Arts','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2016-04-15 15:40:40','','uca-default-u-kn','page'),$ (16300571,'CS1_maint:_Extra_text:_authors_list','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2017-06-04 08:45:09','','uca-default-u-kn','page'),$ (16300571,'Cognitive_psychology','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2016-04-15 15:40:40','','uca-default-u-kn','page'),$ (16300571,'Computational_fields_of_study','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2016-11-10 15:53:12','','uca-default-u-kn','page'),$ (16300571,'Creativity_techniques','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2016-04-15 15:40:40','','uca-default-u-kn','page'),$ (16300571,'NPOV_disputes_from_January_2013','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2013-05-19 15:48:55','','uca-default-u-kn','page'),$ (16300571,'Philosophical_movements','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2017-01-07 20:24:38','','uca-default-u-kn','page'),$ (16300571,'Webarchive_template_wayback_links','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2017-01-27 20:04:18','','uca-default-u-kn','page'),$ (16300571,'Wikipedia_articles_needing_clarification_from_November_2008','+C?EOM'M7CA'=^D+I/'M7Q7MW^A^^AM-^O^[','2009-02-13 10:49:28','','uca-default-u-kn','page'),$
That list of categorylinks entries matches your results. Is it possible that your download of the pages.sql file is corrupted? Do the md5 sums check out? Or perhaps it is an issue with the tools.
Ariel
On Wed, Nov 1, 2017 at 7:40 PM, Tilman Bayer tbayer@wikimedia.org wrote:
CCing the data dumps mailing list, which is the recommended venue for questions like this (https://meta.wikimedia.org/wi ki/Data_dumps#Where_to_go_for_help ).
On Wed, Nov 1, 2017 at 8:44 AM, Shubhanshu Mishra < shubhanshumishra@gmail.com> wrote:
Also, important categories like Computer Architechture, Human based computation, Programming language theory, Software Engineering, and Theory of Computation, are missing from the subcategories of Areas of Computer Science.
*Regards,* *Shubhanshu Mishra* Research Assistant, iSchool at University of Illinois at Urbana-Champaign
*Website:* http://shubhanshu.com *LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
Blog http://shubhanshu.com/blog || Facebook http://www.facebook.com/shubhanshu.mishra || Twitter http://www.twitter.com/TheShubhanshu || LinkedIn http://www.linkedin.com/in/shubhanshumishra
On Wed, Nov 1, 2017 at 10:42 AM, Shubhanshu Mishra < shubhanshumishra@gmail.com> wrote:
Hi,
When using the wikipedia dump files, I am unable to find many categories and pages in the dump.
E.g. under the Areas_of_computer_science category I get only 13 subcategories and 2 pages instead of 17 subcategories, 2 pages. Furthermore, 1 page "Computational_creativity" is not present as a subcategory.
I am using the following wikipedia dump files to extract the categorylinks, and page details:
1.6G Sep 21 00:45 enwiki-20170920-page.sql.gz 21M Sep 21 00:45 enwiki-20170920-category.sql.gz 113M Sep 21 00:55 enwiki-20170920-redirect.sql.gz 2.2G Sep 21 03:10 enwiki-20170920-categorylinks.sql.gz 221M Sep 21 03:13 enwiki-20170920-page_props.sql.gz
I use https://github.com/napsternxg/WikiUtils to parse the sql.gz dump files, but I also tried searching in the sql.gz files and couldn't find any entry for 16300571 in the page.sql.gz and in category.sql.gz files. 16300571 supposedly refers to the Computational_creativity page as the following categories are linked to this page:
16300571 'All_NPOV_disputes' 'page' 16300571 'All_articles_needing_additional_references' 'page' 16300571 'All_articles_with_dead_external_links' 'page' 16300571 'All_articles_with_unsourced_statements' 'page' 16300571 'Areas_of_computer_science' 'page' 16300571 'Articles_needing_additional_references_from_May_2013' 'page' 16300571 'Articles_with_French-language_external_links' 'page' 16300571 'Articles_with_dead_external_links_from_November_2016' 'page' 16300571 'Articles_with_permanently_dead_external_links' 'page' 16300571 'Articles_with_unsourced_statements_from_April_2015' 'page' 16300571 'Articles_with_unsourced_statements_from_April_2016' 'page' 16300571 'Articles_with_unsourced_statements_from_December_2015' 'page' 16300571 'Articles_with_unsourced_statements_from_January_2010' 'page' 16300571 'Articles_with_unsourced_statements_from_October_2016' 'page' 16300571 'Artificial_intelligence' 'page' 16300571 'Arts' 'page' 16300571 'CS1_maint:_Extra_text:_authors_list' 'page' 16300571 'Cognitive_psychology' 'page' 16300571 'Computational_fields_of_study' 'page' 16300571 'Creativity_techniques' 'page' 16300571 'NPOV_disputes_from_January_2013' 'page' 16300571 'Philosophical_movements' 'page' 16300571 'Webarchive_template_wayback_links' 'page' 16300571 'Wikipedia_articles_needing_clarification_from_November_2008' 'page'
More details can be found at: https://twitter.com/TheShu bhanshu/status/925736635572072449
Is there something, I am doing wrong, or are these rows just missing from the dumps.
*Regards,* *Shubhanshu Mishra* Research Assistant, iSchool at University of Illinois at Urbana-Champaign
*Website:* http://shubhanshu.com *LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
Blog http://shubhanshu.com/blog || Facebook http://www.facebook.com/shubhanshu.mishra || Twitter http://www.twitter.com/TheShubhanshu || LinkedIn http://www.linkedin.com/in/shubhanshumishra
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l