Analytics April 2018

analytics@lists.wikimedia.org

22 participants
22 discussions

Fwd: Code of Conduct for Wikimedia technical spaces - Code of Conduct Committee - Candidates
by Sébastien Santoro 13 Apr '18

13 Apr '18

Hello all, The Code of Conduct Committee has published the list of candidates for the next 6 months term: https://www.mediawiki.org/wiki/Code_of_Conduct/Committee/Candidates/2018-I If nominated, these candidates will be trusted to enforce the code of conduct for Wikimedia technical spaces. You can read it at https://www.mediawiki.org/wiki/Code_of_Conduct. Any feedback or concern about a candidate can be submitted in private to techconduct(a)wikimedia.org for the next two weeks, until Tuesday 2018-04-24. If there is any need to change the candidates slate following the community feedback, the committee will submit a new list, and a new two weeks period will take place. -- For the Code of Conduct Committee, Sébastien Santoro aka Dereckson https://www.dereckson.be/

1 0

Licensing for screenshots of pageviews data
by Leon Ziemba 13 Apr '18

13 Apr '18

Hello Analytics! I have a licensing question. If someone were to share a screenshot of the Pageviews Analysis <https://tools.wmflabs.org/pageviews> tool (or similar), are they bound the REST API licenses described at https://wikimedia.org/api/rest_v1/ ? I assume so but wanted to make sure. I was prompted with this question because a user was considering using screenshots in a scholarly article. They wish to publish it under a CC-BY license. Please tell me if this statement is accurate: *The underlying pageviews data shown in Pageviews Analysis is provided by the Wikimedia RESTBase API <https://wikimedia.org/api/rest_v1/>, released under the CC-BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0/> and GFDL <https://www.gnu.org/copyleft/fdl.html> licenses. Any use of this data, including screenshots, is bound by these licenses, and you irrevocably agree to release modifications or additions under these licenses.* Thanks! ~Leon

3 3

How best to accurately record page interactions in Page Previews
by Sam Smith 12 Apr '18

12 Apr '18

Hullo, Page Previews is now fully deployed to all but 2 of the Wikipedias. In deploying it, we've created a new way to interact with pages without navigating to them. This impacts the overall and per-page pageviews metrics that are used in myriad reports, e.g. to editors about the readership of their articles and in monthly reports to the board. Consequently, we need to be able to report a user reading the preview of a page just like we do them navigating to it. Readers Web are planning to instrument Page Previews such that when a preview is available and open for longer than X ms, a "page interaction" is recorded. We're aware of a couple of mechanisms for recording something like this from the client: 1. All files viewed with the media viewer are recorded by the client requesting the /beacon/media?duration=X&uri=Y URL at some point [0] – as Nuria points out in that thread, requests to /beacon/... are already filtered and a canned response is sent immediately by Varnish [1]. 2. Requesting a URL with the X-Analytics header [2] set to "preview". In this context, we'd make a HEAD request to the URL of the page with the header set. IMO #1 is preferable from the operations and performance perspectives as the response is always served from the edge and includes very few headers, whereas the request in #2 may be served by the application servers if the user is logged in (or in the mobile site's beta cohort). However, the requests in #2 are already We're currently considering recording page interactions when previews are open for longer than 1000 ms. We estimate that this would increase overall web requests by 0.3% [3]. Are there other ways of recording this information? We're fairly confident that #1 seems like the best choice here but it's referred to as the "virtual file view hack". Is this really the case? Moreover, should we request a distinct URL, e.g. /beacon/preview?duration=X&uri=Y, or should we consolidate the URLs as both represent the same thing essentially? Thanks, -Sam Timezone: GMT IRC (Freenode): phuedx [0] https://lists.wikimedia.org/pipermail/analytics/2015-March/003633.html [1] *https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb;1bce79d58e03bd02888beef986c41989e8345037$269 <https://phabricator.wikimedia.org/source/operations-puppet/browse/productio…>* [2] https://wikitech.wikimedia.org/wiki/X-Analytics [3] https://phabricator.wikimedia.org/T184793#3901365

8 43

Reminder: Share you feedback in this Wikimedia survey
by Edward Galvez 12 Apr '18

12 Apr '18

Hi everyone, This is a friendly reminder about the Wikimedia Communities and Contributors Survey. *We have only heard from 50 Wikimedia volunteer developers. The survey will close Sunday 22 April 2018.* If you are volunteer developer, and have contributed code to any pieces of MediaWiki, gadgets, or tools, please complete the survey. The opinions you share will affect the work of the Wikimedia Foundation. *Follow this link to take the survey:* https://wikimedia.qualtrics. com/jfe/form/SV_5ABs6WwrDHzAeLr?aud=DEV If you have already seen a similar message on Phabricator, Mediawiki.org, Discourse, or other platforms for volunteer developers, please don't take the survey twice. It is available in various languages and will take between 20 and 40 minutes to complete. You can find more information about this survey on the project page <https://meta.wikimedia.org/wiki/Community_Engagement_Insights/About_CE_> and see how your feedback helps the Wikimedia Foundation support contributors like you. This survey is hosted by a third-party service and governed by this privacy statement <https://wikimediafoundation.org/wiki/Community_Engagement_Insights_2018_Sur…>. Please visit our frequently asked questions page <https://meta.wikimedia.org/wiki/Community_Engagement_Insights/Frequently_as…> to find more information about this survey. Feel free to email me directly with any questions you may have. Thank you! Edward Galvez from the Community Engagement department Wikimedia Foundation

1 0

New SWAP (Jupyter Notebook) servers and updates!
by Andrew Otto 12 Apr '18

12 Apr '18

Hi everyone! *tl;dr stop using notebook1001 by Monday April 2nd, use notebook1003 instead.* *(If you don’t have production access, you can ignore this email.)* As part of https://phabricator.wikimedia.org/T183145, we’ve ordered new hardware to replace the aging notebook1001. The new servers are ready to go, so we need to schedule a deprecation timeline for notebook1001. That timeline is Monday April 2nd. After that, your work on notebook1001 will not longer be accessible. Instead you should use notebook1003 (or notebook1004). But there is good news too! Last week I rsynced everyone’s home directories from notebook1001 over to notebook1003. I also upgraded the default virtualenv your notebooks run from. Your notebook files should all be accessible on notebook1003. However, the version of Python3 changed from 3.4 to 3.5 during this upgrade. Dependencies that your notebook uses that you installed on notebook1001 may not be available at first. You might need to redo a pip install those dependencies into the new notebook Python 3.5 virtualenv. (I can’t really give you explicit instructions to do that, as I don’t know what you use for your notebooks.) I’ll do a final rsync any newer files in home directories from notebook1001 on Monday April 2nd. If you’ve been working on notebook1001 since after March 15th, this should get everything up to date on notebook1003 before notebook1001 goes away. BUT! *Do not work on both notebook1001 and notebook1003*! My final rsync will keep the most recently modified version of files from either server. OOooOo and there’s even more good news! I’ve made the notebooks able to access system site packages, and installed a ton of useful packages <https://github.com/wikimedia/puppet/blob/production/modules/statistics/mani…> by default. pandas, scipy, requests, etc. If there’s something else you think you might need, let us know. Or just pip install it into your notebook. Additionally, pyhive has been installed too, so you should be able to more easily access Hive directly from a python notebook. I’ve updated docs at https://wikitech.wikimedia.org/wiki/SWAP#Usage, please take a look. If you have any questions, please don’t hesitate to ask, either here on or phabricator: https://phabricator.wikimedia.org/T183145. - Andrew Otto & Analytics Engineering

5 9

Introducing the HDFS Trash directory in the Analytics Hadoop cluster
by Luca Toscano 12 Apr '18

12 Apr '18

Hi everybody, in T189051 the Analytics team introduced a new feature in the Hadoop cluster, namely the HDFS Trash directory. This means that now if you use the hdfs -rm CLI command you will not directly delete a file or a directory, but you'll move it under /user/$yourusername/.Trash. The Trash directory is "partitioned" by daily directories (called checkpoints), and will keep files for a month before deleting them. Here's a quick FAQ about how to recover data if needed: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster#recover_files… If you want to skip the Trash directory then you can use the hdfs -skipTrash option, but of course it should be done only when you are really sure about what you are doing :) We hope that this extra safety net will help all the Hadoop users to preserve their data (in this case that might get deleted by mistake). If you have comments/suggestions/etc.. feel free to reach out to the Analytics team via mailing list or via IRC. Thanks! Luca (on behalf of the Analytics team)

1 0

[Wikistats 2] Like all the other things in your life, Wikistats 2 is now mobile-friendly
by Francisco Dans 11 Apr '18

11 Apr '18

Hi from the Analytics team! Just dropping a quick email to let you all know that after our latest release, the new Wikistats is now a responsive site! We spent a bit of time last quarter making sure that the experience of browsing the stats in mobile is at least as good as it is in a desktop browser. Please take a look at the site with your phone or tablet and let us know if you see something weird or you have any suggestions on how to make it better. https://stats.wikimedia.org/v2/#/all-projects Thank you! Fran/The A Team -- *Francisco Dans* Software Engineer, Analytics Team Wikimedia Foundation

1 0

Spark2 upgraded to Spark 2.3.0, Spark 1 on the way out
by Andrew Otto 10 Apr '18

10 Apr '18

Hi all! I just upgraded spark2 across the cluster to Spark 2.3.0 <https://spark.apache.org/releases/spark-release-2-3-0.html>. If you are using the pyspark2*, spark2-*, etc. executables, you will now be using Spark 2.3.0. We are moving towards making Spark 2 the default Spark for all Analytics production jobs. We don’t have a deprecation plan for Spark 1 yet, so you should be able to continue using Spark 1 for the time being. However, in order to support large yarn Spark 2 jobs, we need to upgrade the default Yarn Spark Shuffle Service to Spark 2. This means that large Spark 1 jobs may no longer work properly. We don’t know of any large productionized Spark 1 jobs other than the ones the Analytics team manages, but if you have any that you are worried about, please let us know ASAP. -Andrew & Analytics Engineering

2 1

How to get the traces of requests to the Wikipedia site in each web server
by Ta-Yuan Hsu 09 Apr '18

09 Apr '18

Dear all, Since we are studying workloads including a sample of Wikipedia's traffic over a certain period of time, what we need is patterns of user access to web servers in a decentralized hosting environment. The access patterns need to include real hits on their servers per time for one language. In other words, one trace record we require should contain at least four features - timestamp (like MM:DD:SS), web server id, page size, and operations (e.g., create, read, or update a page). We already reviewed some available downloaded datasets, such as https://dumps.wikimedia.org/other/pagecounts-raw/. However, they do not match our requirement. Does anyone know if it is possible to download a dataset with four features from Wikimedia website? Or should we use REST API to acquire it? Thank you! -- Sincerely, TA-YUAN

2 1

Wikimedia contributors survey is here: share your feedback
by Edward Galvez 04 Apr '18

04 Apr '18

Hi everyone, The Wikimedia Foundation is asking for your feedback in a survey. We want to know how well we are supporting your work on and off wiki, and how we can change or improve things in the future. The opinions you share will affect the current and future work of the Wikimedia Foundation. If you are volunteer developer, and have contributed code to any pieces of MediaWiki, gadgets, or tools, please complete the survey. It is available in various languages and will take between 20 and 40 minutes to complete. *Follow this link to the Survey:* https://wikimedia.qualtrics.com/jfe/form/SV_5ABs6WwrDHzAeLr?aud=DEV If you have already seen a similar message on Phabricator, Mediawiki.org, Discourse, or other platforms for volunteer developers, please don't take the survey twice. You can find more information about this survey on the project page <https://meta.wikimedia.org/wiki/Community_Engagement_Insights/About_CE_> and see how your feedback helps the Wikimedia Foundation support contributors like you. This survey is hosted by a third-party service and governed by this privacy statement <https://wikimediafoundation.org/wiki/Community_Engagement_Insights_2018_Sur…>. Please visit our frequently asked questions page <https://meta.wikimedia.org/wiki/Community_Engagement_Insights/Frequently_as…> to find more information about this survey. Feel free to email me directly with any questions you may have. Thank you! Edward Galvez -- Edward Galvez Evaluation Strategist, Surveys Learning & Evaluation Community Engagement Wikimedia Foundation

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics April 2018