Analytics February 2013

analytics@lists.wikimedia.org

33 participants
26 discussions

comScore Stats / Information on Project Sessions
by Matthew Walker 19 Feb '13

19 Feb '13

Hey all, I'm looking for a method to determine the parameters of the distribution of page views per visit. I would also love to know the distribution for the length of time between visits. Does anyone know of any studies already done on this topic? Google is not my friend today -- I haven't yet found anything. If the data doesn't exist, the best I think I can have is the average number of page views per visit. I have a problem though: the comScore numbers available at http://reportcard.wmflabs.org/ is broken out by region; not by site. Using this data I'll only be able to get the average for all our properties worldwide -- which is a little bit rough. Does anyone have access to the raw data? If so -- does it tell us the number of uniques per site, or is it really only by region? Does anyone have any better ideas? Thanks! -- Context -- I'm trying to model some fundraising data to solve the optimal banner distribution problem (effectively what's the best way to show people banners) . Our data on the 'number of banner impressions till donation' indicates that people are far more likely to donate on the first banner impression. However, this decays over time. My hypothesis is that it's there is no difference between showing a user only one banner per visit over multiple visits and showing multiple banners 100% of the time. If this hypothesis is true; it will lead into fundraising developing a banner display function that will solve the following problem statement: "show P percent of all unique visitors, under time T, N banners with M banners displayed per session". ~Matt Walker Wikimedia Foundation Fundraising Technology Team

2 1

Re: [Analytics] [Wikitech-l] Lua rollout to en.wikipedia.org and a few others
by Ori Livneh 18 Feb '13

18 Feb '13

On Sunday, February 17, 2013 at 10:56 PM, Tim Starling wrote: > On 16/02/13 07:55, Steven Walling wrote: > > I didn't see it in the docs above, so thought I'd ask... Is this going > > to include rollout of the CodeEditor extension, or will that be done > > separately? > > > > > CodeEditor will be enabled, but with $wgCodeEditorEnableCore = false, > i.e. in the Module namespace only, not in the MediaWiki namespace. > This is the same way we deployed it to mediawiki.org (http://mediawiki.org) > > As I said to Ori when he asked me about this: I'm fine with it being > deployed with $wgCodeEditorEnableCore = true, I just don't want to > have to project manage it, since large JS apps are not the sort of > thing I usually do. With $wgCodeEditorEnableCore = true, it's a fairly > disruptive extension, so it would be good to have someone handling > community notifications and bug reports. > > I'm interested in seeing this through (= enabling CodeEditor on MediaWiki NS), but I need a bit more time with the extension first so I can size up how much ongoing work it will require. There's already a bug asking for it to be deployed: https://bugzilla.wikimedia.org/show_bug.cgi?id=39654. I propose we (meaning anyone interested in this deployment, Tim exempted per above) track progress there. > > > This is exciting! Do we have plans for further measurement when it > > comes to Lua's impact on page load times/publishing any results so > > far? In addition to the general benefit of not having to program using > > wikitext/parser functions, I seem to remember the performance > > improvements being the big selling point of Scribunto. > > > > > It will be possible to gather some retrospective data from slow-parse.log. To expand a little: slow-parse.log is a log file on fluorine (accessible to users with shell; the file is in /a/mw-log) that gets an entry every time an article takes more than three seconds to render. Each entry looks like this: 2013-02-18 12:55:18 mw1058 enwiki: 4.25 War_of_the_First_Coalition The fields are (from left to right) current date, current time, host, wiki, rendering time, title. We have six months' worth of logs, broken down by calendar day, in /a/mw-log/archive. The oldest is 2012-08-22. (we may have older files on tape backup). Log files are about 14-15 MB, gzipped. Six months' worth is 2.4 GB. If this information were made more visible, it could give editors a palpable sense of accomplishment as expensive templates are ported to lua. I don't think the logs contain any sensitive data, so it should be doable to set up an rsync job to sync them to labs and thus make them available for people to analyze and visualize. Would anyone be interested in that? (I'm CCing the analytics list as well.) As Rob notes, deployment of lua is not by itself expected to have an impact on rendering time; rather, it is the porting of templates to use it that will speed things up. The full picture will only emerge in the weeks / months + following deployment. Anyways, I'm pretty excited about this. It's a big change. Congrats to all involved. -Ori

3 2

Fwd: [Wmfall] Welcome Henrique Andrade! (our new data consultant for the Brazil Catalyst Program)
by Jessie Wild 15 Feb '13

15 Feb '13

FYI - new relevant member working on data-type stuff :) ---------- Forwarded message ---------- From: Oona Castro <ocastro(a)wikimedia.org> Date: Fri, Feb 15, 2013 at 6:09 AM Subject: [Wmfall] Welcome Henrique Andrade! (our new data consultant for the Brazil Catalyst Program) To: "Staff (All)" <wmfall(a)lists.wikimedia.org> Dear all, I'm glad to announce our new hire in Brazil, Henrique Andrade, who will be a Data and Experiments consultant. The position was created to 1) better track/measure the impact and results of our activities in Brazil; and also to 2) support the community with data so every change on policies or engagement campaigns are carefully measured, so all can make more data driven decisions. Henrique Andrade is an Information Technology (IT) professional and teacher of Web Technologies, Distance Learning Skills, Entrepreneurship and Digital Culture for graduate and undergraduate studants, and a researcher who is interested in free software, free culture, distance learning, entrepreneurship and databases. He holds a bachelor's degree in Information Systems degree in UNIRIO (Federal University of the State of Rio de Janeiro) and is a Masters's student in Computing and Society in the Computing and Systems Engineering Program, COPPE/UFRJ (Federal University of Rio de Janeiro). He has also taken undergratuate course in Business at UFRRJ (Federal University of the Countryside of Rio de Janeiro). Henrique has been the DBA (DataBase Administrator) at UNIRIO and the Lead Developer of e-UNI (eletronic UNIversity), a Learning Management System customized for universities developed over the free software Moodle platform. He is a registered user of Wikipedia since 2009, and is familiar with wikis since 2003, using TWiki in nationwide projects. Henrique has also a lot of experience in public speaking, and has spoken in many major Brazilian IT conferences, such as FISL, LATINOWARE, CONSEGI, ENTI and CSBC. He is been part of Brazillians free software communities, such as PSL-Brasil and SLRJ and was a member of the Free Software Implementation Technical Committee of the Brazilian Federal Government. Since life is not only made of working hours, Henrique also likes to brew his own beer (and drink it, I guess!) and plays basketball at the Campo Grande Athletic Club amatour team (we're not going into details on physical profile here, but I can assure that thanks to him we considerably increased the average height of the Brazilian team). I wish Henrique very good luck on his job. Let's welcome him and I hope we can all work together on the challenges we'll face to improve our data analyses capacities. He went through a long hiring process and I warmly welcome him and wish him a great time with us and a lot of collaboration across teams. About his name pronnounciation (as many asked us): "Enricke" - stress on "i" and you don't need to pronounce the "h", just ignore it. Please find below more about the job position and the background behind it. You'll also find more details about what he's expected to do, how the process took place and so on in the office Wiki https://office.wikimedia.org/wiki/User:Ocastro/Hirings/Henrique_Andrade Regards Oona Background The Wikimedia Foundation is the non-profit organization that operates Wikipedia, the free encyclopedia. Our commitment: Imagine a world in which every single human being can freely share in the sum of all knowledge. According to comScore Media Metrix, Wikipedia and the other projects operated by the Wikimedia Foundation receive more than 482 million unique visitors per month, making them the fifth-most popular web property world-wide (comScore, January 2012). Available in 282 languages, Wikipedia contains more than 21 million articles contributed by a global volunteer community of more than 100,000 people. Based in San Francisco, California, the Wikimedia Foundation is an audited, 501(c)(3) charity that is funded primarily through donations and grants. The Wikimedia Foundation was created in 2003 to manage the operation of Wikipedia and its sister projects. It currently employs 150 staff members. Wikimedia works with local chapter organizations in 39 countries or regions to advance the mission of the Wikimedia movement. How can we experiment, test and track solutions and best practices among the editing community about the social norms, policies, and initiatives that will create renewed openness and promote general community health? How can we better fulfill the mission statement of the Foundation and attract more editors, especially from under-represented demographics? What type of initiative better fits improvements to be made and how can we make sure they are reaching our expected goals? What are and how to develop new tools for the community to make their editing experience better? The need for this position emerged from a debate with the Brazilian community regarding WMF activities in Brazil<https://meta.wikimedia.org/wiki/Talk:Programa_Catalisador_do_Brasil/Planeja…>. While we had previously planned to have a community organizer, to catalyse processes which had been stagnant in the Portuguese Wikimedia projects, the community demanded as a priority position to provide support in collecting data and allowing data driven decisions to be taken. The Portuguese Wikipedia has been seen significant decrease of active editors in the last 2 years – While the average of active editors was of 1678 per month in 2010, in 2011 this average was of 1588 – a decrease of 5,4%. The Brazil Catalyst Project was created to contribute enabling a better environment so recovering losses and even growing becomes possible. The Brazil Catalyst Program contractors, together with the community, have worked throughout a planning of activities that would help improvements on the Portuguese Wikipedia regarding collaboration, attracting new editors, retaining new editors as well as improving quality of content. However, the Brazil team has very much little data to 1) measure the impact of the its projects and work; 2) identify pros and cons on experiments and changes on the Portuguese Wikpedia; 3) lead rational and data driven discussions on the impact of projects and policies. For this reason, we agreed on hiring a data and experiment analyst, in order to provide us with qualified information and data to develop projects and address changes. What will be Henrique's main goals and duties? We expect Henrique to support the community and the Brazil Catalyst Program contractors in tracking results of our actions, projects and experiments, so we can better analyse and learn from them. Sometimes we develop activities with little efforts to measure their impact and results in the short and middle term, and we also struggle with data that may contribute to long term results analyses. The purpose of his job is therefore to work closely with the community, the WMF staff and the Brazil Catalyst Program contractors in creating ways of measuring the impact or our work and experiments, as well as identifying trends within the community and editor engagement. He is meant to do that by turning ideas from the community into some kind of reasonable experimental design and by replacing anecdotes on the impact of feature or policy changes with basic empirical evidence. We also expect he'll engage with the community (and with volunteers already engaged in data analyses) in order to build a plan for the next years and deploy it in a collaborative way. _______________________________________________ Wmfall mailing list Wmfall(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wmfall -- *Jessie Wild Learning & Evaluation * *Wikimedia Foundation* * * Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality! Donate to Wikimedia <https://donate.wikimedia.org/>

3 3

Kraken Overview
by Andrew Otto 14 Feb '13

14 Feb '13

Hi yalls! I've written up a little overview of the current and proposed Kraken architecture. https://www.mediawiki.org/wiki/Analytics/Kraken/Overview This is just an overview, so it does not try to explain how each of the individual pieces (Hadoop, Kafka, etc.) work. Comments and questions please! Okthxbye. -Andrew O

2 1

Fwd: [Wikitech-l] Page view stats we can believe in
by Matthew Flaschen 14 Feb '13

14 Feb '13

I'm forwarding to the Analytics list, which is a better place to discuss this. Matt Flaschen -------- Original Message -------- Subject: [Wikitech-l] Page view stats we can believe in Date: Wed, 13 Feb 2013 22:18:44 +0100 From: Lars Aronsson <lars(a)aronsson.se> Reply-To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org> To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org> I stumbled on the Danish Wiktionary, of all projects. Danish is the 68th biggest language of Wiktionary, and has a little more than 8,000 articles in total. Most of these articles are very short and provide no value to a reader. There is no reason to link to them, and so very unlikely that the next user should stumble upon them unless they are me. Yet, wikistats tries to make be believe that this tiny project has 400,000 or 500,000 page views each month, and has had so for a long time, http://stats.wikimedia.org/wiktionary/EN/TablesPageViewsMonthly.htm (I'm not talking about January 2012, which seems to have been an error, and reports 2-3 times that many views.) My guess is that da.wiktionary has 4,000 page views per month, not 400,000. It's more likely that 400,000 is some background noise, an offset number that should be subtracted from the number of page views for any project. If you look at the log files for just one day, you should see my IP address (85.228.something) and 3-4 other users who have been editing lately, and not many more people, but perhaps a bunch of interwiki bots. We need an explanation to these vastly inflated page view statistics. -- Lars Aronsson (lars(a)aronsson.se) Aronsson Datateknik - http://aronsson.se _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

5 8

'Page View' Stats for the Timed Media Handler
by Matthew Walker 14 Feb '13

14 Feb '13

Greetings all, Victor is releasing a video tomorrow for valentines day and whilst I was discussing it with him, the topic of "how many users actually watch our videos" came up. Do we currently have a way of collecting play/click stats for content played by the TimedMediaHandler off of commons? If not, does anyone have any ideas on how to go about getting this information? ~Matt Walker Wikimedia Foundation Fundraising Technology Team

3 3

RFC: Analytics Reboot
by Asher Feldman 13 Feb '13

13 Feb '13

Howdy, After having spent some time reviewing the analytics github repo and playing observer to the quarterly review last December, and today's security/architecture mixup, I have a few opinions and suggestions that I'd like to share. They may upset some or step on toes. Sorry about that. Main suggestion - all logging, etl, storage, and compute infrastructure should be owned, implemented, and maintained by the operations team. There should be a clear set of deliverables for ops: the entirety of the current udp stream ingested, processed via an extensible etl layer with a minimum of IP anonymization in place, and stored in hdfs in a standardized format with logical access controls. Technology and implementation choices should ultimately rest with ops so long as all deliverables are met, though external advice and assistance (including from industry experts outside of wmf) will be welcome and solicited. The analytics team owns everything above this. Is pig the best tool to analyze log data in hdfs? Does hive make sense for some things? Want to add and analyze wiki revisions via map reduce jobs? Visualize everything imaginable? Add more sophisticated transforms to the etl pipeline? Go, go, analytics! I see the work accomplished to date under the heading of kraken as falling into three categories: 1) Data querying. This includes pig integration, repeatable queries run via pig, and ad hoc map reduce jobs meant to analyze data written by folks like Diederik. While modifications may be needed if there are changes to how data is stored in hdfs (such as file name conventions or format) or to access controls, this category of work isn't tied to infrastructure details and should be reusable on any generic hadoop implementation containing wmf log data. 2) Devops work. This includes everything Andrew Otto has done to puppetize various pieces of the existing infrastructure. I'd consider all of this experimental. Some might be reusable, some may need refactoring, some should be chalked up as a learning exercise and abandoned. Even if the majority was to fall under that last category, this has undoubtedly been a valuable learning experience. Were Andrew to join the ops team and collaborate with others on a from scratch implementation (let's say I'd prefer us using the beta branch of actual apache hadoop instead of cloudera), I'm sure the experience he's gained to date will be of use to all. 3) Bound for mordor. Never happened, never to be spoken of again. This includes things like the map reduce job executed via cron to transfer data from kafka to hdfs, and... oh wait, never happened, never to be spoken of again. Unless I'm missing anything major, I don't see any reasons not to pursue this new approach, nor does it appear that any significant amount of work would be lost. Instead, the most useful bits (category 1) should still be useful. And since that seems to be where analytics has been most successful, perhaps it makes sense to let them focus fully on this sort of thing instead of infrastructure. -Asher

3 3

Analytics Security Review Meeting Notes
by David Schoonover 13 Feb '13

13 Feb '13

Hi everybody, here's a quick summary of notes / take-aways from the Analytics (Kraken) security review meeting. * analytics1001 has been wiped and reimaged (restoring /home from backup) * All proxies and externally-facing services have been disabled. * Work is under way to bring everything that was puppetized under the analytics1001 puppetmaster into operations-puppet after proper review. Andrew is working closely with a number of people in ops to make this happen. * All future deployments to the cluster will be puppetized and go through normal code review. Other than performance testing, these puppet confs will be tested in labs. * The rest of the cluster will be wiped and reimaged out of puppet; data in HDFS will be preserved. This can be a rolling process allowing work to proceed while its under way. * Schedule an Architectural Review meeting sometime during the SF Ops hackathon, including a look at additional services and auth methods that provide access to internal dashboards like Hue &such. * Ensure all current "application" code (stuff written by WMF) gets reviewed: ** Cron doing HDFS import from Kafka ** Pig UDFs and other data tools used in processing ** Future: Storm ETL layer We all agreed the overall goal is to get to an acceptable security state. During that process, the Analytics team still needs to continue to meet stakeholder needs and deliver on promises. We decided on keeping running a "minimum viable cluster" while reimaging boxes and civilizing cluster configuration: * Wall off some portion of the boxes to continue recieving data and running jobs; all other boxes can be wiped (preserving HDFS partitions). Boxes would be incrementally removed from the "unsanitary" cluster, reimaged, and then added to the "sanitary" cluster. Stupid bathroom-related jokes to be avoided. * Team Analytics to enumerate data processing jobs that will be running in the intermediate period; their configurations and tooling will be reviewed. * Analytics and Ops engineers continue to have shell access. Jobs can be submitted and managed using the CLI tools; internal dashboards can be accessed via SSH tunnelling. Analysts working on the cluster will be approved for shell access on a case-by-case basis (afaik, just Evan Rosen (full-time analyst for Grantmaking & Programs), and Stefan Petrea (contractor for Analytics)). If more analysts desire access in the interim, we can work it out in a case-by-case basis. * No public, external access of any box in either zone (including proxied, dashboards like Hue, or even static files) that hasn't gone through review. * Analytics and Ops will work together to find a simple, acceptable mechanism for data export. == Next Steps == * Analytics puppet manfiests fully reviewed and merged into master operations-puppet repository ** Andrew to come pow-wow before the SF Ops hackathon and buddy it up with ops to plow through some of this. * Schedule Architecture Review * Rolling reimaging of all analytics boxes (including hadoop data nodes but preserving data) implementing this "minimal viable cluster" plan. Questions very welcome! It's entirely possible I've missed things or mistranslated them. Cheers, Team Analytics -- David Schoonover dsc(a)wikimedia.org

1 0

Attention, R nerds
by Oliver Keyes 12 Feb '13

12 Feb '13

Any R nerds with Stat1 access: Stat1 now has the RMySQL package installed. What this means is the following; if you have something from the db you want to analyse, there's no need to faff around on the command line or inside MySQL to manually export it as a file, import it into R, analyse it, and then leave a big juicy TSV around for people to poke at when they're bored. You can run SQL queries against the slaves from inside R, and have these queries and associated data fall off and die as soon as you close the session or run rm(). It's also good for replicability, since people will now (largely) be able to retrieve the same data and run the same analysis using a single script. Thanks to Jeff Green and Andrew Otto for getting it up and running :). -- Oliver Keyes Community Liaison, Product Development Wikimedia Foundation

5 7

Small change to Nginx log format
by Diederik van Liere 12 Feb '13

12 Feb '13

Heya, Today we made a very small change to the logging format of the Nginx servers. We replaced the string "FAKE_CACHE_STATUS" with a dash (-). The reason is to save diskspace and reduce network traffic. If you analyze Nginx server logs and you use a regular expression that matches FAKE_CACHE_STATUS then you will need to update your regular expression. But I don't expect anybody to be affected by this. Best, Diederik

1 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics February 2013