Hi Everyone,
Many of you have read Magnus' post on his
blog<http://magnusmanske.de/wordpress/?p=173>.
I've commented on his blog and I wanted to repost here and address Magnus'
concerns directly.
First of all, I'm sorry that we let Magnus and other folks down on the page
view APIs -- we made some commitments late last year that we weren't able
to meet. Not only that, these failures echoed previous points of
frustration with the Foundation.
I do want to note that we actively support the infrastructure that feeds
data to stats.grok.se. We've fixed a number of issues with that pipeline,
most recently last week. We understand the importance of this data to the
community.
The page view API project has been challenging for a number of reasons --
the size of the data, the fact that definitions of page views have not been
updated to stay in line with the changing traffic (mobile, bots, API
requests, etc) and the challenges in aggregating various aliases. We've
needed to revisit our definitions of page views in order to get this right
as well as design and build a global architecture for collecting these and
other metrics. In addition, we've tried to do this with a perspective of
privacy and respect for our users.
To this end, we presented an approach to measuring page views in MediaWiki
at FOSDEM in January and have made progress towards our new infrastructure
by deploying middleware delivering unsampled page view data from mobile
devices from our globally distributed datacenters to our compute cluster
for analysis.
However, these initiatives are complex and will take several months to
complete at the earliest. In the meantime, we're working with Henrik to
scale up stats.grok.se.
I also want to call out that the Analytics team has been supporting a wide
range of users and stakeholders during the year. We've developed
WikiMetrics, a tool for measuring editor productivity that is used by WMF
program evaluation and community members; provided dashboards and support
for Wikipedia Zero, our program to partner with our mobile partners to
enable mobile Wikipedia access free from data charges; and supported
product teams and researchers both inside and outside of the foundation.
We've been prioritizing and working on these projects as our resources
allow and it's important to understand that the team has not been idle.
While we've done a less than stellar job in communicating our progress to
the community, information on what we've been doing is available via
our planning
pages <https://www.mediawiki.org/wiki/Analytics/Prioritization_Planning> on
mediawiki. In the future, we will be more proactive in communicating with
the community regarding our goals and projects.
-Toby
I moved all coordination cards (think of them as “epics” or collections of tasks, with checklists pointing to individual cards) to a dedicated list: this should make it easier to follow what we’re working on.
Also, if you update a card, please move it to the top of the list (wondering if there’s an automatic way to do so).
Dario
The issue with missing server-side EventLogging events [1] has been resolved and data recovery from the raw logs into the log DB (i.e. the source used to populate most dashboards) is currently underway. It will take approximately 4 days to restore the complete logs.
Many thanks to Ori for helping with this.
Dario
[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=60550
Hello all.
I'm CCing the analytics list in case this question is also relevant for them. Not sure about research-l, so please forward this message if that's the case.
I have a question regarding the registration date for new user accounts (table user). The information is (apparently) public, as it can be retrieved from the Special:ListUsers:
http://en.wikipedia.org/w/index.php?title=Special%3AListUsers&username=DarT…
Furthermore, Dario uploaded to DataHub a CSV file with an hourly series of registration dates in 2008-2011, from enwiki:
http://datahub.io/dataset/wikipedia-new-user-registrations
It would be quite interesting to study the whole series (say, back to 2004) and compare it with other languages. However, this info is not available on the DB replicas in Tool-Labs (whole column 'user_registration' in table 'user' is NULL).
My question is: are there any reasons for redacting this (apparently public) info? I can't figure out why this could be sensitive data.
Thanks in advance, best regards.
Felipe.
Hey folks, it’s that time of the month. Please review and add anything worth reporting
http://etherpad.wikimedia.org/p/RD201401
As usual, we are not including minor internal items.
I’m planning to post this by EOD PT today.
Dario
Hi Everyone,
I'd like to introduce Leila Zia, a Research Scientist, to the Analytics
Team. We're really excited about bringing Leila's skills and expertise to
the Foundation!
In her own words:
Leila Zia's areas of expertise are data mining, linear and non-linear
optimization, dynamic programming, queueing theory, and game theory. She
holds a PhD degree in Management Science and Engineering from Stanford
University, a MSc degree from Rutgers University, and a BSc degree from
Sharif University of Technology. Leila's past research has focused on
predictive modeling and optimization in two healthcare applications,
namely, appointment non-attendance and colorectal cancer mortality. She is
excited to apply her research skills to Wikipedia-related questions. Leila can
be contacted at leila(a)wikimedia.org.
She'll be working out of the San Francisco office -- Welcome Leila!
-Toby
Hi all,
On the Growth team, we (and by we, I mean Aaron Halfaker) have been doing a
great deal of work to understand trends in new article creation,[1]
particularly from the new user perspective. Along with this and our launch
of the new Draft namespace, we've discovered that our current data sources
for tracking page creations, moves, and deletions are far too slow and
awkward to use on a daily or weekly basis.
To solve this problem and answer on-going questions about how many page
creators there are, how successful they are, and what workflows they use,
we've created three new schemas:
- https://meta.wikimedia.org/wiki/Schema:PageCreation
- https://meta.wikimedia.org/wiki/Schema:PageDeletion
- https://meta.wikimedia.org/wiki/Schema:PageMove
We envision using similar to how we're using schemas like
Schema:ServerSideAccountCreation and Schema:PrefUpdate. We will likely be
implementing these in our team's next sprint, starting on February 5th, so
if you have feedback please speak up soon. :)
1. https://meta.wikimedia.org/wiki/Research:Wikipedia_article_creation
--
Steven Walling,
Product Manager
https://wikimediafoundation.org/
I was a bit irritated yesterday to learn that we can automate the
creation of Limn graphs and speed up the process.
I had become so tired of manually copying and pasting existing graphs
and manually editing them to work for a new graph that I knocked up a
script to do this for me. The script simply took an SQL query and the
config file and generates all the necessary JSON files for it so that
it shows up on the Limn dashboard.
With this script I was able to generate 5 graphs in the time it takes
me to generate 1.
However since uploading the script [1] I have now learnt other scripts
like this exist. Please can we standardise on a way to generate these
graphs (either locally or on the server) and detail it in the README
to make this whole process of graph generation nicer for everyone
involved?
I've added some graphs (which should update soon) that show activity
in the left navigation menu, on the watchlist page and on the diff
page. We had this data so it seemed silly not to display it somewhere.
When the data becomes available you'll notice that interestingly
'Home' link in the main menu is our most widely used feature. It will
be great to see how that changes when search becomes available on
special pages. Likewise random is a very widely used feature - we
should continue experimenting with that and try and use it to engage
new editors.
[1] https://gerrit.wikimedia.org/r/#/c/110271/2/generate-graph.py
[2] http://mobile-reportcard.wmflabs.org/#other-graphs-tab