Hello all,
The Code of Conduct Committee has published the list of candidates for
the next 6 months term:
https://www.mediawiki.org/wiki/Code_of_Conduct/Committee/Candidates/2018-I
If nominated, these candidates will be trusted to enforce the code of
conduct for Wikimedia
technical spaces. You can read it at
https://www.mediawiki.org/wiki/Code_of_Conduct.
Any feedback or concern about a candidate can be submitted in private
to techconduct(a)wikimedia.org
for the next two weeks, until Tuesday 2018-04-24.
If there is any need to change the candidates slate following the
community feedback,
the committee will submit a new list, and a new two weeks period will
take place.
--
For the Code of Conduct Committee,
Sébastien Santoro aka Dereckson
https://www.dereckson.be/
Hello Analytics!
I have a licensing question. If someone were to share a screenshot of
the Pageviews
Analysis <https://tools.wmflabs.org/pageviews> tool (or similar), are they
bound the REST API licenses described at https://wikimedia.org/api/rest_v1/
?
I assume so but wanted to make sure. I was prompted with this question
because a user was considering using screenshots in a scholarly article.
They wish to publish it under a CC-BY license.
Please tell me if this statement is accurate:
*The underlying pageviews data shown in Pageviews Analysis is provided by
the Wikimedia RESTBase API <https://wikimedia.org/api/rest_v1/>, released
under the CC-BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0/>
and GFDL <https://www.gnu.org/copyleft/fdl.html> licenses. Any use of this
data, including screenshots, is bound by these licenses, and you
irrevocably agree to release modifications or additions under these
licenses.*
Thanks!
~Leon
Hullo,
Page Previews is now fully deployed to all but 2 of the Wikipedias. In
deploying it, we've created a new way to interact with pages without
navigating to them. This impacts the overall and per-page pageviews metrics
that are used in myriad reports, e.g. to editors about the readership of
their articles and in monthly reports to the board. Consequently, we need
to be able to report a user reading the preview of a page just like we do
them navigating to it.
Readers Web are planning to instrument Page Previews such that when a
preview is available and open for longer than X ms, a "page interaction" is
recorded. We're aware of a couple of mechanisms for recording something
like this from the client:
1. All files viewed with the media viewer are recorded by the client
requesting the /beacon/media?duration=X&uri=Y URL at some point [0] – as
Nuria points out in that thread, requests to /beacon/... are already
filtered and a canned response is sent immediately by Varnish [1].
2. Requesting a URL with the X-Analytics header [2] set to "preview". In
this context, we'd make a HEAD request to the URL of the page with the
header set.
IMO #1 is preferable from the operations and performance perspectives as
the response is always served from the edge and includes very few headers,
whereas the request in #2 may be served by the application servers if the
user is logged in (or in the mobile site's beta cohort). However, the
requests in #2 are already
We're currently considering recording page interactions when previews are
open for longer than 1000 ms. We estimate that this would increase overall
web requests by 0.3% [3].
Are there other ways of recording this information? We're fairly confident
that #1 seems like the best choice here but it's referred to as the
"virtual file view hack". Is this really the case? Moreover, should we
request a distinct URL, e.g. /beacon/preview?duration=X&uri=Y, or should we
consolidate the URLs as both represent the same thing essentially?
Thanks,
-Sam
Timezone: GMT
IRC (Freenode): phuedx
[0] https://lists.wikimedia.org/pipermail/analytics/2015-March/003633.html
[1] *https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb;1bce79d58e03bd02888beef986c41989e8345037$269
<https://phabricator.wikimedia.org/source/operations-puppet/browse/productio…>*
[2] https://wikitech.wikimedia.org/wiki/X-Analytics
[3] https://phabricator.wikimedia.org/T184793#3901365
Hi everyone,
This is a friendly reminder about the Wikimedia Communities and
Contributors Survey.
*We have only heard from 50 Wikimedia volunteer developers. The survey will
close Sunday 22 April 2018.*
If you are volunteer developer, and have contributed code to any pieces of
MediaWiki, gadgets, or tools, please complete the survey. The opinions you
share will affect the work of the Wikimedia Foundation.
*Follow this link to take the survey:* https://wikimedia.qualtrics.
com/jfe/form/SV_5ABs6WwrDHzAeLr?aud=DEV
If you have already seen a similar message on Phabricator, Mediawiki.org,
Discourse, or other platforms for volunteer developers, please don't take
the survey twice.
It is available in various languages and will take between 20 and 40
minutes to complete.
You can find more information about this survey on the project page
<https://meta.wikimedia.org/wiki/Community_Engagement_Insights/About_CE_> and
see how your feedback helps the Wikimedia Foundation support contributors
like you. This survey is hosted by a third-party service and governed by this
privacy statement
<https://wikimediafoundation.org/wiki/Community_Engagement_Insights_2018_Sur…>.
Please visit our frequently asked questions page
<https://meta.wikimedia.org/wiki/Community_Engagement_Insights/Frequently_as…>
to find more information about this survey.
Feel free to email me directly with any questions you may have.
Thank you!
Edward Galvez from the Community Engagement department
Wikimedia Foundation
Hi everyone!
*tl;dr stop using notebook1001 by Monday April 2nd, use notebook1003
instead.*
*(If you don’t have production access, you can ignore this email.)*
As part of https://phabricator.wikimedia.org/T183145, we’ve ordered new
hardware to replace the aging notebook1001. The new servers are ready to
go, so we need to schedule a deprecation timeline for notebook1001. That
timeline is Monday April 2nd. After that, your work on notebook1001 will
not longer be accessible. Instead you should use notebook1003 (or
notebook1004).
But there is good news too! Last week I rsynced everyone’s home
directories from notebook1001 over to notebook1003. I also upgraded the
default virtualenv your notebooks run from. Your notebook files should all
be accessible on notebook1003. However, the version of Python3 changed
from 3.4 to 3.5 during this upgrade. Dependencies that your notebook uses
that you installed on notebook1001 may not be available at first. You
might need to redo a pip install those dependencies into the new notebook
Python 3.5 virtualenv. (I can’t really give you explicit instructions to
do that, as I don’t know what you use for your notebooks.)
I’ll do a final rsync any newer files in home directories from notebook1001
on Monday April 2nd. If you’ve been working on notebook1001 since after
March 15th, this should get everything up to date on notebook1003 before
notebook1001 goes away. BUT! *Do not work on both notebook1001 and
notebook1003*! My final rsync will keep the most recently modified version
of files from either server.
OOooOo and there’s even more good news! I’ve made the notebooks able to
access system site packages, and installed a ton of useful packages
<https://github.com/wikimedia/puppet/blob/production/modules/statistics/mani…>
by default. pandas, scipy, requests, etc. If there’s something else you
think you might need, let us know. Or just pip install it into your
notebook.
Additionally, pyhive has been installed too, so you should be able to more
easily access Hive directly from a python notebook.
I’ve updated docs at https://wikitech.wikimedia.org/wiki/SWAP#Usage, please
take a look.
If you have any questions, please don’t hesitate to ask, either here on or
phabricator: https://phabricator.wikimedia.org/T183145.
- Andrew Otto & Analytics Engineering
Hi everybody,
in T189051 the Analytics team introduced a new feature in the Hadoop
cluster, namely the HDFS Trash directory. This means that now if you use
the hdfs -rm CLI command you will not directly delete a file or a
directory, but you'll move it under /user/$yourusername/.Trash. The Trash
directory is "partitioned" by daily directories (called checkpoints), and
will keep files for a month before deleting them. Here's a quick FAQ about
how to recover data if needed:
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster#recover_files…
If you want to skip the Trash directory then you can use the hdfs
-skipTrash option, but of course it should be done only when you are really
sure about what you are doing :)
We hope that this extra safety net will help all the Hadoop users to
preserve their data (in this case that might get deleted by mistake).
If you have comments/suggestions/etc.. feel free to reach out to the
Analytics team via mailing list or via IRC.
Thanks!
Luca (on behalf of the Analytics team)
Hi from the Analytics team!
Just dropping a quick email to let you all know that after our latest
release, the new Wikistats is now a responsive site!
We spent a bit of time last quarter making sure that the experience of
browsing the stats in mobile is at least as good as it is in a desktop
browser. Please take a look at the site with your phone or tablet and let
us know if you see something weird or you have any suggestions on how to
make it better.
https://stats.wikimedia.org/v2/#/all-projects
Thank you!
Fran/The A Team
--
*Francisco Dans*
Software Engineer, Analytics Team
Wikimedia Foundation
Hi all!
I just upgraded spark2 across the cluster to Spark 2.3.0
<https://spark.apache.org/releases/spark-release-2-3-0.html>. If you are
using the pyspark2*, spark2-*, etc. executables, you will now be using
Spark 2.3.0.
We are moving towards making Spark 2 the default Spark for all Analytics
production jobs. We don’t have a deprecation plan for Spark 1 yet, so you
should be able to continue using Spark 1 for the time being. However, in
order to support large yarn Spark 2 jobs, we need to upgrade the default
Yarn Spark Shuffle Service to Spark 2. This means that large Spark 1 jobs
may no longer work properly. We don’t know of any large productionized
Spark 1 jobs other than the ones the Analytics team manages, but if you
have any that you are worried about, please let us know ASAP.
-Andrew & Analytics Engineering
Dear all,
Since we are studying workloads including a sample of Wikipedia's
traffic over a certain period of time, what we need is patterns of user
access to web servers in a decentralized hosting environment. The access
patterns need to include real hits on their servers per time for one
language. In other words, one trace record we require should contain at
least four features - timestamp (like MM:DD:SS), web server id, page size,
and operations (e.g., create, read, or update a page).
We already reviewed some available downloaded datasets, such as
https://dumps.wikimedia.org/other/pagecounts-raw/. However, they do not
match our requirement. Does anyone know if it is possible to download a
dataset with four features from Wikimedia website? Or should we use REST
API to acquire it? Thank you!
--
Sincerely,
TA-YUAN
Hi everyone,
The Wikimedia Foundation is asking for your feedback in a survey. We want
to know how well we are supporting your work on and off wiki, and how we
can change or improve things in the future. The opinions you share will
affect the current and future work of the Wikimedia Foundation.
If you are volunteer developer, and have contributed code to any pieces of
MediaWiki, gadgets, or tools, please complete the survey. It is available
in various languages and will take between 20 and 40 minutes to complete.
*Follow this link to the Survey:*
https://wikimedia.qualtrics.com/jfe/form/SV_5ABs6WwrDHzAeLr?aud=DEV
If you have already seen a similar message on Phabricator, Mediawiki.org,
Discourse, or other platforms for volunteer developers, please don't take
the survey twice.
You can find more information about this survey on the project page
<https://meta.wikimedia.org/wiki/Community_Engagement_Insights/About_CE_>
and see how your feedback helps the Wikimedia Foundation support
contributors like you. This survey is hosted by a third-party service and
governed by this privacy statement
<https://wikimediafoundation.org/wiki/Community_Engagement_Insights_2018_Sur…>.
Please visit our frequently asked questions page
<https://meta.wikimedia.org/wiki/Community_Engagement_Insights/Frequently_as…>
to find more information about this survey.
Feel free to email me directly with any questions you may have.
Thank you!
Edward Galvez
--
Edward Galvez
Evaluation Strategist, Surveys
Learning & Evaluation
Community Engagement
Wikimedia Foundation