Analytics November 2017

analytics@lists.wikimedia.org

24 participants
15 discussions

Spark 2 now available in Hadoop
by Andrew Otto 13 Nov '17

13 Nov '17

Hi all! We’ve recently made Spark 2.1 available in the Analytics Hadoop cluster. It is installed on stat1004 and stat1005 alongside Spark 1.6. To use Spark 2, you should access it via the spark2* (and pyspark2) executables, rather than the usual spark-shell, spark-submit, etc. I’ve added a little bit of documentation <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark> about this on wikitech. We’d like to deploy Spark 2.2, but we first need to upgrade Hadoop to use Java 8 rather than Java 7. Hopefully this will happen in early 2018. analytics/refinery/source <https://github.com/wikimedia/analytics-refinery-source> still uses Spark 1, but we’d also like to update jobs and dependencies there to use Spark 2 soon. Anyway, let me know if there are any questions. Enjoy! - Andrew Otto Systems Engineer, WMF

1 0

Audiences:Readers core metrics for Q1 (Jul-Sep 2017)
by Deborah Tankersley 07 Nov '17

07 Nov '17

Hello, The slide deck from today's quarterly metrics presentation of the Wikimedia Foundation's Readers team (which is an appendix to the main quarterly check-in presentation) has been published. [1] [2] This deck gives an overview of the core metrics regarding readership of Wikimedia sites and including data about search, maps, and the Wikipedia portal from the Discovery team. [1] https://commons.wikimedia.org/wiki/File%3AWikimedia_Foundation_Readers_metr… [2] https://commons.wikimedia.org/wiki/File%3AAudiences_2_check-in_Q1_October_2… -- deb tankersley Product Manager, Discovery Wikimedia Foundation

2 1

Missing categorylinks and pages in Wikipedia dumps
by Shubhanshu Mishra 07 Nov '17

07 Nov '17

Hi, When using the wikipedia dump files, I am unable to find many categories and pages in the dump. E.g. under the Areas_of_computer_science category I get only 13 subcategories and 2 pages instead of 17 subcategories, 2 pages. Furthermore, 1 page "Computational_creativity" is not present as a subcategory. I am using the following wikipedia dump files to extract the categorylinks, and page details: 1.6G Sep 21 00:45 enwiki-20170920-page.sql.gz 21M Sep 21 00:45 enwiki-20170920-category.sql.gz 113M Sep 21 00:55 enwiki-20170920-redirect.sql.gz 2.2G Sep 21 03:10 enwiki-20170920-categorylinks.sql.gz 221M Sep 21 03:13 enwiki-20170920-page_props.sql.gz I use https://github.com/napsternxg/WikiUtils to parse the sql.gz dump files, but I also tried searching in the sql.gz files and couldn't find any entry for 16300571 in the page.sql.gz and in category.sql.gz files. 16300571 supposedly refers to the Computational_creativity page as the following categories are linked to this page: 16300571 'All_NPOV_disputes' 'page' 16300571 'All_articles_needing_additional_references' 'page' 16300571 'All_articles_with_dead_external_links' 'page' 16300571 'All_articles_with_unsourced_statements' 'page' 16300571 'Areas_of_computer_science' 'page' 16300571 'Articles_needing_additional_references_from_May_2013' 'page' 16300571 'Articles_with_French-language_external_links' 'page' 16300571 'Articles_with_dead_external_links_from_November_2016' 'page' 16300571 'Articles_with_permanently_dead_external_links' 'page' 16300571 'Articles_with_unsourced_statements_from_April_2015' 'page' 16300571 'Articles_with_unsourced_statements_from_April_2016' 'page' 16300571 'Articles_with_unsourced_statements_from_December_2015' 'page' 16300571 'Articles_with_unsourced_statements_from_January_2010' 'page' 16300571 'Articles_with_unsourced_statements_from_October_2016' 'page' 16300571 'Artificial_intelligence' 'page' 16300571 'Arts' 'page' 16300571 'CS1_maint:_Extra_text:_authors_list' 'page' 16300571 'Cognitive_psychology' 'page' 16300571 'Computational_fields_of_study' 'page' 16300571 'Creativity_techniques' 'page' 16300571 'NPOV_disputes_from_January_2013' 'page' 16300571 'Philosophical_movements' 'page' 16300571 'Webarchive_template_wayback_links' 'page' 16300571 'Wikipedia_articles_needing_clarification_from_November_2008' 'page' More details can be found at: https://twitter.com/TheShubhanshu/status/925736635572072449 Is there something, I am doing wrong, or are these rows just missing from the dumps. *Regards,* *Shubhanshu Mishra* Research Assistant, iSchool at University of Illinois at Urbana-Champaign -------------------------------------------------- *Website:* http://shubhanshu.com *LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra Blog <http://shubhanshu.com/blog> || Facebook <http://www.facebook.com/shubhanshu.mishra> || Twitter <http://www.twitter.com/TheShubhanshu> || LinkedIn <http://www.linkedin.com/in/shubhanshumishra>

3 3

Important news about the November dumps run!
by Ariel Glenn WMF 06 Nov '17

06 Nov '17

As was previously announced on the xmldatadumps-l list, the sql/xml dumps generated twice a month will be written to an internal server, starting with the November run. This is in part to reduce load on the web/rsync/nfs server which has been doing this work also until now. We want separation of roles for some other reasons too. Because I want to get this right, and there are a lot of moving parts, and I don't want to rsync all the prefetch data over to these boxes again next month after cancelling the move: ******** If needed, the November full run will be delayed for a few days. If the November full run takes too long, the partial run, usually starting on the 20th of the month, will not take place. ********* Additionally, as described in an earlier email on the xmldatadumps-l list: ********* files will show up on the web server/rsync server with a substantial delay. Initially this may be a day or more. This includes index.html and other status files. ********* You can keep track of developments here: https://phabricator.wikimedia.org/T178893 If you know folks not on the lists in the recipients field for this email, please forward it to them and suggest that they subscribe to this list. Thanks, Ariel

1 2

Number of List Subscribers
by Reem Al-Kashif 02 Nov '17

02 Nov '17

Hello, Is there any way to know the number of subscribers to a Wikimedia mailing list for research purposes? Best, Reem -- *Kind regards,Reem Al-Kashif* <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_cam…> Virus-free. www.avg.com <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_cam…> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics November 2017