Hi Analytics,
the Gerrit Cleanup Day on Wed 23rd is approaching fast - less than one
week left. More info: https://phabricator.wikimedia.org/T88531
Do you feel prepared for the day and all team members know what to do?
If not, what are you missing and how can we help?
Some Gerrit queries for each team are listed under "Gerrit queries per
team/area" in https://phabricator.wikimedia.org/T88531
Are they helpful and a good start? Or do they miss some areas (or do
you have existing Gerrit team queries to use instead or to "integrate",
e.g. for parts of MediaWiki core you might work on)?
Also, which person will be the main team contact for the day (and
available in #wikimedia-dev on IRC) and help organize review work in
your areas, so other teams could easily reach out?
Some team plates are emptier than others so they're wondering where and
how to lend a helping hand (to find out in advance, due to timezones).
Thanks for your help to make the Gerrit Cleanup day a success!
andre
--
Andre Klapper | Wikimedia Bugwrangler
http://blogs.gnome.org/aklapper/
Hi everyone. End of quarter is rapidly approaching and I wanted to ask a
quick question about one of the endpoints we want to push out. We want to
let you ask "what are the top articles" but we're not sure how to structure
the URL so it's most useful to you. Here are the choices:
Choice 1. /top/{project}/{access}/{days-in-the-past}
Example: top articles via all en.wikipedia sites for the past 30 days:
/top/en.wikipedia/all-access/30
Choice 2. /top/{project}/{access}/{start}/{end}
Example: top articles via all en.wikipedia sites from June 12th, 2014 to
August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30
(in all of those,
* {project} means en.wikipedia, commons.wikimedia, etc.
* {access} means access method as in desktop, mobile web, mobile app
)
Which do you prefer? Would any other query style be useful?
Hello, lists.
You may have heard of projects <https://www.mediawiki.org/wiki/Wikimedia_Research#Highlights> such as revision scoring or article recommendations. We’re now looking for a full-stack software engineer to join Wikimedia Research and support and scale up these and similar projects.
Job description below, please help us find the best possible candidates.
Dario
Software Engineer - Research <http://grnh.se/b12qur>
Summary
Help us create a world in which every single human being can freely share in the sum of all knowledge.
We are a team of scientists and UX researchers at the Wikimedia Foundation using data to understand and empower millions of users – readers, contributors, and donors – who interact with Wikipedia and its sister projects on a daily basis. We turn research questions into publicly shared knowledge, we design and test new technology, we produce data-driven insights to support product and engineering decisions and we publish research informing the organization’s and the movement’s strategy. We are strongly committed to principles of transparency, privacy and collaboration, we use free and open source technology and we collaborate with researchers in the industry and academia. As a full member of the Wikimedia Research department, you will help us build and scale the infrastructure our team needs for research and experimentation, implementing new technology and data-intensive applications.
Description
Collaborate with researchers to expose algorithms and machine learning systems through APIs and web applications
Design, develop, test, and deploy new features, improvements and upgrades to the infrastructure that supports research and powers data-intensive applications.
Support our data science team in optimizing computationally intensive data processing.
Support our UX research capacity by growing, expanding and maintaining our user testing platform and instrumentation stack.
Work in coordination with other infrastructure teams such as Services and Analytics Engineering as well as Product teams to grow and scale research-driven services and applications.
Requirements
Real world experience writing applications using both scripting (e.g. Python, Javascript, PHP) and compiled languages (e.g. Java, Scala, C, C#)
Experience with MySQL/Postgres or similar database technology
Experience developing APIs for data retrieval
Understanding of basic statistical concepts
BS, MS, or PhD in Computer Science, Mathematics, or equivalent work experience
Pluses
Experience with high-traffic web architectures and operations
Production experience with Hadoop and ecosystem technology (Pig, Hive, streaming)
Experience with web UI design (Javascript, HTML, CSS)
Familiarity with scientific computing libraries in Python and R
Experience working with volunteers
Big ups if you are a contributor to Wikipedia or other open collaboration projects
Show us your stuff! Please provide us with information you feel would be useful to us in gaining a better understanding of your technical background and accomplishments. Links to GitHub, your technical blogs, publications, personal projects, etc. are exceptionally useful. We especially appreciate pointers to your best contributions to open source projects.
About the Wikimedia Foundation
The Wikimedia Foundation is the non-profit organization that operates Wikipedia, the free encyclopedia. Wikipedia and the other projects operated by the Wikimedia Foundation receive more than 431 million unique visitors per month, making them the 5th most popular web property worldwide. Available in more than 287 languages, Wikipedia contains more than 32 million articles contributed by a global volunteer community of more than 100,000 people. Based in San Francisco, California, the Wikimedia Foundation is an audited, 501(c)(3) charity that is funded primarily through donations and grants. The Wikimedia Foundation was created in 2003 to manage the operation of Wikipedia and its sister projects. It currently employs over 208 staff members. Wikimedia is supported by local chapter organizations in 40 countries or regions.
The Wikimedia Foundation offers competitive benefits. Fully paid medical, dental, and vision coverage for employees and their eligible families (yes, fully paid premiums!). A Wellness Program which provides reimbursement for mind, body and soul activities such as fitness memberships, massages, cooking classes and much more. 401(k) retirement plan with matched contributions of 4% of annual salary.
More Information
http://wikimediafoundation.org <http://wikimediafoundation.org/>
http://blog.wikimedia.org <http://blog.wikimedia.org/>
http://wikimediafoundation.org/wiki/Vision <http://wikimediafoundation.org/wiki/Vision>
About Wikimedia Research
https://www.mediawiki.org/wiki/Wikimedia_Research <https://www.mediawiki.org/wiki/Wikimedia_Research>
Examples of code
https://github.com/wiki-ai/revscoring <https://github.com/wiki-ai/revscoring>
https://github.com/wiki-ai/ores <https://github.com/wiki-ai/revscoring>
https://github.com/halfak/MediaWiki-Utilities <https://github.com/halfak/MediaWiki-Utilities>
https://github.com/halfak/mwstreaming <https://github.com/halfak/mwstreaming>
Dario Taraborelli Head of Research, Wikimedia Foundation
wikimediafoundation.org <http://wikimediafoundation.org/> • nitens.org <http://nitens.org/> • @readermeter <http://twitter.com/readermeter>
This app is really cool. I wonder if beside future predictions, it
could be modified to support another use case: Assessing the impact of
past events and software changes on our pageviews.
As many of us are aware, the Wikimedia movement has been struggling
for a long time to understand the effects of our work (and of outside
events) on our readership. And while WMF engineering teams are getting
better about doing, say, A/B tests, it's often not possible to provide
a controlled environment for such experiments.
There's an established statistical technique aimed at such situations,
called "Intervention Analysis", see e.g. [1]. It requires modeling the
time series (here: monthly pageviews) with an ARIMA model just like it
has been done in the app. One then basically does a backdated forecast
from the time of the intervention, and uses the difference between
that forecast and the actual development to model the effect of the
intervention. I've been wondering recently if this has ever been used
for Wikipedia pageviews; yesterday while attending Morten's research
showcase talk about their "misalignment" paper I noticed that that
paper has indeed been applying it (to views of individual articles,
where it may be easier to isolate effects).[2] Is anyone aware of
other examples?
Would it be possible to modify the app to support such backdated
forecasts, as a first step, and also for calculating their difference
to the actual development?
[1] https://onlinecourses.science.psu.edu/stat510/node/76
[2] http://www-users.cs.umn.edu/~morten/publications/icwsm2015-popularity-quali…
(p.8)
On Tue, Sep 15, 2015 at 5:28 PM, Dario Taraborelli
<dtaraborelli(a)wikimedia.org> wrote:
>
> An updated version of a pageview forecasting application written by Ellery (Research & Data team) has just been released:
>
> https://ewulczyn.shinyapps.io/pageview_forecasting
> https://twitter.com/WikiResearch/status/643942154549592064
>
> The data is refreshed monthly and it includes breakdowns by country and platform.
>
> Dario
>
>
>
> Dario Taraborelli Head of Research, Wikimedia Foundation
> wikimediafoundation.org • nitens.org • @readermeter
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
This discussion is about needed updates of the definition and Analytics
implementation for mobile apps page view metrics. There is also an
associated Phab task[4]. Please add the proper Analytics project there.
Background / Changes
As you probably remember, the Android app splits a page view into two
requests: one for the lead section and metadata, plus another one for the
remainder.
The mobile apps are going to change the way they load pages in two
different ways:
1. We'll add a link preview when someone clicks on a link from a page.
2. We're planning on switching over the using RESTBase for loading pages
and also the link preview (initially just the Android beta, ater more)
This will have implications for the pageviews definition and how we count
user engagement.
The big question is
Should we count link previews as a page view since it's an indication of
user engagement? Or should there be a separate metric for link previews?
Counting page views
IIRC we currently count action=mobileview§ions=0 query parameters of
api.php as a page view. When we publish link previews for all Android app
users then we would either want to count also the calls to
action=query&prop=extracts as a page view or add them to another metric.
Once the apps use RESTBase the HTTPS requests will be very different:
- Page view: Instead of action=mobileview§ions=0 the app would call
the RESTBase endpoint for lead request[1] instead of the PHP API mentioned
above. Then it would call [2].
- Link preview: Instead of action=query&prop=extracts it would call the
lead request[1], too, since there is a lot of overlap. At least that our
current plan. The advantage of that is that the client doesn't need to
execute the lead request a second time if the user clicks on the link
preview (-- either through caching or app logic.)
So, in the RESTBase case we either want to count the
mobile-html-sections-lead requests or the
mobile-html-sections-remaining requests
depending on what our definition for page views actually is. We could also
add a query parameter or extra HTTP header to one of the
mobile-html-sections-lead requests if we need to distinguish between
previews and page views.
Both the current PHP API and the RESTBase based metrics would need to be
compatible and be collected in parallel since we cannot control when users
update their apps.
[1]
https://en.wikipedia.org/api/rest_v1/page/mobile-html-sections-lead/Dilbert
[2]
https://en.wikipedia.org/api/rest_v1/page/mobile-html-sections-remaining/Di…
[3]
https://www.mediawiki.org/wiki/Wikimedia_Apps/Team/RESTBase_services_for_ap…
[4] https://phabricator.wikimedia.org/T109383
Cheers,
Bernd
Hi Analytics listeners,
We are experiencing issues on the hadoop cluster (more precisely logs don't
flow from kafka into HDFS, even more precisly camus job seems broken).
Consequence is that data is late from yesterday 2015-09-16 hour 21.
We are actively working on that and will let you know when solved.
--
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal
Hi,
My name is Diana. I am a developer in AOL company.
We are partners of Wikipedia. Currently, we consume the pagecounts raw
files that you generate every day as an input of our Wikipedia
infrastructure.
http://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-09/
We have noticed that there are new files since yesterday at 18, and because
of that we are having issues in our side.
I would like to know if this issue is already noticed in your side and if
there is a current action to fix the generation of this files.
Thanks in advance,
Diana
When we process Event Logging events, we hash the origin IP address and add
it to the event as part of the "capsule. We salt the hash function and
rotate the salt frequently for security, but within those periods of time
the same IP would get hashed to the same hash, and some people depended on
that.
We recently made the Event Logging processor parallel, and we accidentally
forgot to make this hashing consistent across all the parallel instances.
So from September 10, 2015 until we fix the bug, client IPs will not be
hashed consistently.
We are tracking this issue here: https://phabricator.wikimedia.org/T112688
If you have some data crunching that's affected by this, come talk to us.
We are already adding a temporary fix to the scripts that generate the
edit-analysis dashboard [1]
[1] https://edit-analysis.wmflabs.org/compare/