Xmldatadumps-l November 2017

xmldatadumps-l@lists.wikimedia.org

2 participants
4 discussions

Delaying the second November run by 2 days

by Ariel Glenn WMF

Because the first run of the month was delayed, we need a couple days delay now for the second run to start, so that the last of the wikis (dewiki) ca finish up the first run. I expect the second monthly run to finish on time however, once started. Ariel

6 years, 5 months

Re: [Xmldatadumps-l] [Analytics] Missing categorylinks and pages in Wikipedia dumps

by Tilman Bayer

CCing the data dumps mailing list, which is the recommended venue for questions like this (https://meta.wikimedia.org/wiki/Data_dumps#Where_to_go_ for_help ). On Wed, Nov 1, 2017 at 8:44 AM, Shubhanshu Mishra < shubhanshumishra(a)gmail.com> wrote: > Also, important categories like Computer Architechture, Human based > computation, Programming language theory, Software Engineering, and Theory > of Computation, are missing from the subcategories of Areas of Computer > Science. > > > *Regards,* > *Shubhanshu Mishra* > Research Assistant, > iSchool at University of Illinois at Urbana-Champaign > -------------------------------------------------- > *Website:* http://shubhanshu.com > *LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra > > Blog <http://shubhanshu.com/blog> || Facebook > <http://www.facebook.com/shubhanshu.mishra> || Twitter > <http://www.twitter.com/TheShubhanshu> || LinkedIn > <http://www.linkedin.com/in/shubhanshumishra> > > On Wed, Nov 1, 2017 at 10:42 AM, Shubhanshu Mishra < > shubhanshumishra(a)gmail.com> wrote: > >> Hi, >> >> When using the wikipedia dump files, I am unable to find many categories >> and pages in the dump. >> >> E.g. under the Areas_of_computer_science category I get only 13 >> subcategories and 2 pages instead of 17 subcategories, 2 pages. >> Furthermore, 1 page "Computational_creativity" is not present as a >> subcategory. >> >> I am using the following wikipedia dump files to extract the >> categorylinks, and page details: >> >> 1.6G Sep 21 00:45 enwiki-20170920-page.sql.gz >> 21M Sep 21 00:45 enwiki-20170920-category.sql.gz >> 113M Sep 21 00:55 enwiki-20170920-redirect.sql.gz >> 2.2G Sep 21 03:10 enwiki-20170920-categorylinks.sql.gz >> 221M Sep 21 03:13 enwiki-20170920-page_props.sql.gz >> >> >> I use https://github.com/napsternxg/WikiUtils to parse the sql.gz dump >> files, but I also tried searching in the sql.gz files and couldn't find any >> entry for 16300571 in the page.sql.gz and in category.sql.gz >> files. 16300571 supposedly refers to the Computational_creativity page as >> the following categories are linked to this page: >> >> 16300571 'All_NPOV_disputes' 'page' >> 16300571 'All_articles_needing_additional_references' 'page' >> 16300571 'All_articles_with_dead_external_links' 'page' >> 16300571 'All_articles_with_unsourced_statements' 'page' >> 16300571 'Areas_of_computer_science' 'page' >> 16300571 'Articles_needing_additional_references_from_May_2013' 'page' >> 16300571 'Articles_with_French-language_external_links' 'page' >> 16300571 'Articles_with_dead_external_links_from_November_2016' 'page' >> 16300571 'Articles_with_permanently_dead_external_links' 'page' >> 16300571 'Articles_with_unsourced_statements_from_April_2015' 'page' >> 16300571 'Articles_with_unsourced_statements_from_April_2016' 'page' >> 16300571 'Articles_with_unsourced_statements_from_December_2015' >> 'page' >> 16300571 'Articles_with_unsourced_statements_from_January_2010' 'page' >> 16300571 'Articles_with_unsourced_statements_from_October_2016' 'page' >> 16300571 'Artificial_intelligence' 'page' >> 16300571 'Arts' 'page' >> 16300571 'CS1_maint:_Extra_text:_authors_list' 'page' >> 16300571 'Cognitive_psychology' 'page' >> 16300571 'Computational_fields_of_study' 'page' >> 16300571 'Creativity_techniques' 'page' >> 16300571 'NPOV_disputes_from_January_2013' 'page' >> 16300571 'Philosophical_movements' 'page' >> 16300571 'Webarchive_template_wayback_links' 'page' >> 16300571 'Wikipedia_articles_needing_clarification_from_November_2008' >> 'page' >> >> More details can be found at: https://twitter.com/TheShu >> bhanshu/status/925736635572072449 >> >> Is there something, I am doing wrong, or are these rows just missing from >> the dumps. >> >> >> >> >> >> *Regards,* >> *Shubhanshu Mishra* >> Research Assistant, >> iSchool at University of Illinois at Urbana-Champaign >> -------------------------------------------------- >> *Website:* http://shubhanshu.com >> *LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra >> >> Blog <http://shubhanshu.com/blog> || Facebook >> <http://www.facebook.com/shubhanshu.mishra> || Twitter >> <http://www.twitter.com/TheShubhanshu> || LinkedIn >> <http://www.linkedin.com/in/shubhanshumishra> >> > > > _______________________________________________ > Analytics mailing list > Analytics(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > > -- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB

6 years, 5 months

Important news about the November dumps run!

by Ariel Glenn WMF

As was previously announced on the xmldatadumps-l list, the sql/xml dumps generated twice a month will be written to an internal server, starting with the November run. This is in part to reduce load on the web/rsync/nfs server which has been doing this work also until now. We want separation of roles for some other reasons too. Because I want to get this right, and there are a lot of moving parts, and I don't want to rsync all the prefetch data over to these boxes again next month after cancelling the move: ******** If needed, the November full run will be delayed for a few days. If the November full run takes too long, the partial run, usually starting on the 20th of the month, will not take place. ********* Additionally, as described in an earlier email on the xmldatadumps-l list: ********* files will show up on the web server/rsync server with a substantial delay. Initially this may be a day or more. This includes index.html and other status files. ********* You can keep track of developments here: https://phabricator.wikimedia.org/T178893 If you know folks not on the lists in the recipients field for this email, please forward it to them and suggest that they subscribe to this list. Thanks, Ariel

6 years, 5 months

IMPORTANT: Changes to abstracts and siteinfo-namespaces jobs

by Ariel Glenn WMF

These jobs are currently written uncompressed. Starting with the next run, I plan to write these as gzip compressed files. This means that we'll save a lot of space for the larger abstracts dumps. Additionally,only status and html files will be uncompressed, which is convenient for maintenance reasons. If anyone has a strong objection to this, please raise it now. There's a ticket open for it: https://phabricator.wikimedia.org/T178046 Thanks! Ariel

6 years, 5 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l November 2017