I'm looking for a method to determine the parameters of the distribution of
page views per visit. I would also love to know the distribution for the
length of time between visits. Does anyone know of any studies already done
on this topic? Google is not my friend today -- I haven't yet found
If the data doesn't exist, the best I think I can have is the average
number of page views per visit. I have a problem though: the comScore
numbers available at http://reportcard.wmflabs.org/ is broken out by
region; not by site. Using this data I'll only be able to get the average
for all our properties worldwide -- which is a little bit rough. Does
anyone have access to the raw data? If so -- does it tell us the number of
uniques per site, or is it really only by region?
Does anyone have any better ideas?
-- Context --
I'm trying to model some fundraising data to solve the optimal banner
distribution problem (effectively what's the best way to show people
banners) . Our data on the 'number of banner impressions till donation'
indicates that people are far more likely to donate on the first banner
impression. However, this decays over time. My hypothesis is that it's
there is no difference between showing a user only one banner per visit
over multiple visits and showing multiple banners 100% of the time.
If this hypothesis is true; it will lead into fundraising developing a
banner display function that will solve the following problem statement:
"show P percent of all unique visitors, under time T, N banners with M
banners displayed per session".
Fundraising Technology Team
On Sunday, February 17, 2013 at 10:56 PM, Tim Starling wrote:
> On 16/02/13 07:55, Steven Walling wrote:
> > I didn't see it in the docs above, so thought I'd ask... Is this going
> > to include rollout of the CodeEditor extension, or will that be done
> > separately?
> CodeEditor will be enabled, but with $wgCodeEditorEnableCore = false,
> i.e. in the Module namespace only, not in the MediaWiki namespace.
> This is the same way we deployed it to mediawiki.org (http://mediawiki.org)
> As I said to Ori when he asked me about this: I'm fine with it being
> deployed with $wgCodeEditorEnableCore = true, I just don't want to
> have to project manage it, since large JS apps are not the sort of
> thing I usually do. With $wgCodeEditorEnableCore = true, it's a fairly
> disruptive extension, so it would be good to have someone handling
> community notifications and bug reports.
I'm interested in seeing this through (= enabling CodeEditor on MediaWiki NS), but I need a bit more time with the extension first so I can size up how much ongoing work it will require. There's already a bug asking for it to be deployed: https://bugzilla.wikimedia.org/show_bug.cgi?id=39654. I propose we (meaning anyone interested in this deployment, Tim exempted per above) track progress there.
> > This is exciting! Do we have plans for further measurement when it
> > comes to Lua's impact on page load times/publishing any results so
> > far? In addition to the general benefit of not having to program using
> > wikitext/parser functions, I seem to remember the performance
> > improvements being the big selling point of Scribunto.
> It will be possible to gather some retrospective data from slow-parse.log.
To expand a little: slow-parse.log is a log file on fluorine (accessible to users with shell; the file is in /a/mw-log) that gets an entry every time an article takes more than three seconds to render. Each entry looks like this:
2013-02-18 12:55:18 mw1058 enwiki: 4.25 War_of_the_First_Coalition
The fields are (from left to right) current date, current time, host, wiki, rendering time, title.
We have six months' worth of logs, broken down by calendar day, in /a/mw-log/archive. The oldest is 2012-08-22. (we may have older files on tape backup). Log files are about 14-15 MB, gzipped. Six months' worth is 2.4 GB.
If this information were made more visible, it could give editors a palpable sense of accomplishment as expensive templates are ported to lua. I don't think the logs contain any sensitive data, so it should be doable to set up an rsync job to sync them to labs and thus make them available for people to analyze and visualize. Would anyone be interested in that? (I'm CCing the analytics list as well.)
As Rob notes, deployment of lua is not by itself expected to have an impact on rendering time; rather, it is the porting of templates to use it that will speed things up. The full picture will only emerge in the weeks / months + following deployment.
Anyways, I'm pretty excited about this. It's a big change. Congrats to all involved.
FYI - new relevant member working on data-type stuff :)
---------- Forwarded message ----------
From: Oona Castro <ocastro(a)wikimedia.org>
Date: Fri, Feb 15, 2013 at 6:09 AM
Subject: [Wmfall] Welcome Henrique Andrade! (our new data consultant for
the Brazil Catalyst Program)
To: "Staff (All)" <wmfall(a)lists.wikimedia.org>
I'm glad to announce our new hire in Brazil, Henrique Andrade, who will be
a Data and Experiments consultant. The position was created to 1) better
track/measure the impact and results of our activities in Brazil; and also
to 2) support the community with data so every change on policies or
engagement campaigns are carefully measured, so all can make more data
Henrique Andrade is an Information Technology (IT) professional and teacher
of Web Technologies, Distance Learning Skills, Entrepreneurship and Digital
Culture for graduate and undergraduate studants, and a researcher who is
interested in free software, free culture, distance learning,
entrepreneurship and databases.
He holds a bachelor's degree in Information Systems degree in UNIRIO
(Federal University of the State of Rio de Janeiro) and is a Masters's
student in Computing and Society in the Computing and Systems Engineering
Program, COPPE/UFRJ (Federal University of Rio de Janeiro). He has also
taken undergratuate course in Business at UFRRJ (Federal University of the
Countryside of Rio de Janeiro).
Henrique has been the DBA (DataBase Administrator) at UNIRIO and the Lead
Developer of e-UNI (eletronic UNIversity), a Learning Management System
customized for universities developed over the free software Moodle
platform. He is a registered user of Wikipedia since 2009, and is familiar
with wikis since 2003, using TWiki in nationwide projects.
Henrique has also a lot of experience in public speaking, and has spoken in
many major Brazilian IT conferences, such as FISL, LATINOWARE, CONSEGI,
ENTI and CSBC. He is been part of Brazillians free software communities,
such as PSL-Brasil and SLRJ and was a member of the Free Software
Implementation Technical Committee of the Brazilian Federal Government.
Since life is not only made of working hours, Henrique also likes to brew
his own beer (and drink it, I guess!) and plays basketball at the Campo
Grande Athletic Club amatour team (we're not going into details on physical
profile here, but I can assure that thanks to him we considerably increased
the average height of the Brazilian team).
I wish Henrique very good luck on his job. Let's welcome him and I hope we
can all work together on the challenges we'll face to improve our data
analyses capacities. He went through a long hiring process and I warmly
welcome him and wish him a great time with us and a lot of collaboration
About his name pronnounciation (as many asked us): "Enricke" - stress on
"i" and you don't need to pronounce the "h", just ignore it.
Please find below more about the job position and the background behind it.
You'll also find more details about what he's expected to do, how the
process took place and so on in the office Wiki
The Wikimedia Foundation is the non-profit organization that operates
Wikipedia, the free encyclopedia. Our commitment: Imagine a world in which
every single human being can freely share in the sum of all
knowledge. According to comScore Media Metrix, Wikipedia and the other
projects operated by the Wikimedia Foundation receive more than 482 million
unique visitors per month, making them the fifth-most popular web property
world-wide (comScore, January 2012). Available in 282 languages, Wikipedia
contains more than 21 million articles contributed by a global volunteer
community of more than 100,000 people. Based in San Francisco, California,
the Wikimedia Foundation is an audited, 501(c)(3) charity that is funded
primarily through donations and grants. The Wikimedia Foundation was
created in 2003 to manage the operation of Wikipedia and its sister
projects. It currently employs 150 staff members. Wikimedia works with
local chapter organizations in 39 countries or regions to advance the
mission of the Wikimedia movement.
How can we experiment, test and track solutions and best practices among
the editing community about the social norms, policies, and initiatives
that will create renewed openness and promote general community health? How
can we better fulfill the mission statement of the Foundation and attract
more editors, especially from under-represented demographics? What type of
initiative better fits improvements to be made and how can we make sure
they are reaching our expected goals? What are and how to develop new tools
for the community to make their editing experience better?
The need for this position emerged from a debate with the Brazilian
community regarding WMF activities in
While we had previously planned to have a community organizer, to catalyse
processes which had been stagnant in the Portuguese Wikimedia projects, the
community demanded as a priority position to provide support in collecting
data and allowing data driven decisions to be taken.
The Portuguese Wikipedia has been seen significant decrease of active
editors in the last 2 years – While the average of active editors was of
1678 per month in 2010, in 2011 this average was of 1588 – a decrease of
5,4%. The Brazil Catalyst Project was created to contribute enabling a
better environment so recovering losses and even growing becomes possible.
The Brazil Catalyst Program contractors, together with the community, have
worked throughout a planning of activities that would help improvements on
the Portuguese Wikipedia regarding collaboration, attracting new editors,
retaining new editors as well as improving quality of content. However, the
Brazil team has very much little data to 1) measure the impact of the its
projects and work; 2) identify pros and cons on experiments and changes on
the Portuguese Wikpedia; 3) lead rational and data driven discussions on
the impact of projects and policies.
For this reason, we agreed on hiring a data and experiment analyst, in
order to provide us with qualified information and data to develop projects
and address changes.
What will be Henrique's main goals and duties?
We expect Henrique to support the community and the Brazil Catalyst Program
contractors in tracking results of our actions, projects and experiments,
so we can better analyse and learn from them. Sometimes we develop
activities with little efforts to measure their impact and results in the
short and middle term, and we also struggle with data that may contribute
to long term results analyses.
The purpose of his job is therefore to work closely with the community, the
WMF staff and the Brazil Catalyst Program contractors in creating ways of
measuring the impact or our work and experiments, as well as identifying
trends within the community and editor engagement. He is meant to do that
by turning ideas from the community into some kind of reasonable
experimental design and by replacing anecdotes on the impact of feature or
policy changes with basic empirical evidence. We also expect he'll engage
with the community (and with volunteers already engaged in data analyses)
in order to build a plan for the next years and deploy it in a
Wmfall mailing list
Learning & Evaluation *
Imagine a world in which every single human being can freely share in
the sum of all knowledge. Help us make it a reality!
Donate to Wikimedia <https://donate.wikimedia.org/>
I've written up a little overview of the current and proposed Kraken architecture.
This is just an overview, so it does not try to explain how each of the individual pieces (Hadoop, Kafka, etc.) work. Comments and questions please!
I'm forwarding to the Analytics list, which is a better place to discuss
-------- Original Message --------
Subject: [Wikitech-l] Page view stats we can believe in
Date: Wed, 13 Feb 2013 22:18:44 +0100
From: Lars Aronsson <lars(a)aronsson.se>
Reply-To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
I stumbled on the Danish Wiktionary, of all projects.
Danish is the 68th biggest language of Wiktionary, and
has a little more than 8,000 articles in total.
Most of these articles are very short and provide no
value to a reader. There is no reason to link to them,
and so very unlikely that the next user should stumble
upon them unless they are me.
Yet, wikistats tries to make be believe that this tiny
project has 400,000 or 500,000 page views each month,
and has had so for a long time,
(I'm not talking about January 2012, which seems to have
been an error, and reports 2-3 times that many views.)
My guess is that da.wiktionary has 4,000 page views per
month, not 400,000. It's more likely that 400,000 is
some background noise, an offset number that should be
subtracted from the number of page views for any project.
If you look at the log files for just one day, you should
see my IP address (85.228.something) and 3-4 other users
who have been editing lately, and not many more people,
but perhaps a bunch of interwiki bots.
We need an explanation to these vastly inflated page view
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se
Wikitech-l mailing list
Victor is releasing a video tomorrow for valentines day and whilst I was
discussing it with him, the topic of "how many users actually watch our
videos" came up. Do we currently have a way of collecting play/click stats
for content played by the TimedMediaHandler off of commons?
If not, does anyone have any ideas on how to go about getting this
Fundraising Technology Team
After having spent some time reviewing the analytics github repo and
playing observer to the quarterly review last December, and today's
security/architecture mixup, I have a few opinions and suggestions that I'd
like to share. They may upset some or step on toes. Sorry about that.
Main suggestion - all logging, etl, storage, and compute infrastructure
should be owned, implemented, and maintained by the operations team. There
should be a clear set of deliverables for ops: the entirety of the current
udp stream ingested, processed via an extensible etl layer with a minimum
of IP anonymization in place, and stored in hdfs in a standardized format
with logical access controls. Technology and implementation choices should
ultimately rest with ops so long as all deliverables are met, though
external advice and assistance (including from industry experts outside of
wmf) will be welcome and solicited.
The analytics team owns everything above this. Is pig the best tool to
analyze log data in hdfs? Does hive make sense for some things? Want to
add and analyze wiki revisions via map reduce jobs? Visualize everything
imaginable? Add more sophisticated transforms to the etl pipeline? Go,
I see the work accomplished to date under the heading of kraken as falling
into three categories:
1) Data querying. This includes pig integration, repeatable queries run
via pig, and ad hoc map reduce jobs meant to analyze data written by folks
like Diederik. While modifications may be needed if there are changes to
how data is stored in hdfs (such as file name conventions or format) or to
access controls, this category of work isn't tied to infrastructure details
and should be reusable on any generic hadoop implementation containing wmf
2) Devops work. This includes everything Andrew Otto has done to puppetize
various pieces of the existing infrastructure. I'd consider all of this
experimental. Some might be reusable, some may need refactoring, some
should be chalked up as a learning exercise and abandoned. Even if the
majority was to fall under that last category, this has undoubtedly been a
valuable learning experience. Were Andrew to join the ops team and
collaborate with others on a from scratch implementation (let's say I'd
prefer us using the beta branch of actual apache hadoop instead of
cloudera), I'm sure the experience he's gained to date will be of use to
3) Bound for mordor. Never happened, never to be spoken of again. This
includes things like the map reduce job executed via cron to transfer data
from kafka to hdfs, and... oh wait, never happened, never to be spoken of
Unless I'm missing anything major, I don't see any reasons not to pursue
this new approach, nor does it appear that any significant amount of work
would be lost. Instead, the most useful bits (category 1) should still be
useful. And since that seems to be where analytics has been most
successful, perhaps it makes sense to let them focus fully on this sort of
thing instead of infrastructure.
Hi everybody, here's a quick summary of notes / take-aways from the
Analytics (Kraken) security review meeting.
* analytics1001 has been wiped and reimaged (restoring /home from backup)
* All proxies and externally-facing services have been disabled.
* Work is under way to bring everything that was puppetized under the
analytics1001 puppetmaster into operations-puppet after proper review.
Andrew is working closely with a number of people in ops to make this
* All future deployments to the cluster will be puppetized and go through
normal code review. Other than performance testing, these puppet confs will
be tested in labs.
* The rest of the cluster will be wiped and reimaged out of puppet; data in
HDFS will be preserved. This can be a rolling process allowing work to
proceed while its under way.
* Schedule an Architectural Review meeting sometime during the SF Ops
hackathon, including a look at additional services and auth methods that
provide access to internal dashboards like Hue &such.
* Ensure all current "application" code (stuff written by WMF) gets
** Cron doing HDFS import from Kafka
** Pig UDFs and other data tools used in processing
** Future: Storm ETL layer
We all agreed the overall goal is to get to an acceptable security state.
During that process, the Analytics team still needs to continue to meet
stakeholder needs and deliver on promises. We decided on keeping running a
"minimum viable cluster" while reimaging boxes and civilizing cluster
* Wall off some portion of the boxes to continue recieving data and running
jobs; all other boxes can be wiped (preserving HDFS partitions). Boxes
would be incrementally removed from the "unsanitary" cluster, reimaged, and
then added to the "sanitary" cluster. Stupid bathroom-related jokes to be
* Team Analytics to enumerate data processing jobs that will be running in
the intermediate period; their configurations and tooling will be reviewed.
* Analytics and Ops engineers continue to have shell access. Jobs can be
submitted and managed using the CLI tools; internal dashboards can be
accessed via SSH tunnelling. Analysts working on the cluster will be
approved for shell access on a case-by-case basis (afaik, just Evan Rosen
(full-time analyst for Grantmaking & Programs), and Stefan Petrea
(contractor for Analytics)). If more analysts desire access in the interim,
we can work it out in a case-by-case basis.
* No public, external access of any box in either zone (including proxied,
dashboards like Hue, or even static files) that hasn't gone through review.
* Analytics and Ops will work together to find a simple, acceptable
mechanism for data export.
== Next Steps ==
* Analytics puppet manfiests fully reviewed and merged into master
** Andrew to come pow-wow before the SF Ops hackathon and buddy it up with
ops to plow through some of this.
* Schedule Architecture Review
* Rolling reimaging of all analytics boxes (including hadoop data nodes but
preserving data) implementing this "minimal viable cluster" plan.
Questions very welcome! It's entirely possible I've missed things or
Any R nerds with Stat1 access: Stat1 now has the RMySQL package installed.
What this means is the following; if you have something from the db you
want to analyse, there's no need to faff around on the command line or
inside MySQL to manually export it as a file, import it into R, analyse it,
and then leave a big juicy TSV around for people to poke at when they're
bored. You can run SQL queries against the slaves from inside R, and have
these queries and associated data fall off and die as soon as you close the
session or run rm(). It's also good for replicability, since people will
now (largely) be able to retrieve the same data and run the same analysis
using a single script. Thanks to Jeff Green and Andrew Otto for getting it
up and running :).
Community Liaison, Product Development
Today we made a very small change to the logging format of the Nginx
servers. We replaced the string "FAKE_CACHE_STATUS" with a dash (-). The
reason is to save diskspace and reduce network traffic.
If you analyze Nginx server logs and you use a regular expression that
matches FAKE_CACHE_STATUS then you will need to update your regular
expression. But I don't expect anybody to be affected by this.