Analytics August 2014

analytics@lists.wikimedia.org

28 participants
34 discussions

Adventures in Clusterland 2014-08-18--2014-08-24
by Christian Aistleitner 26 Aug '14

26 Aug '14

Hi, since the work that happens around the Analytics Cluster and on the Ops side of Analytics is not too visible, it was suggested to improve visibility by having some weekly write-up. Posting it to the public list for a start, but if this is too much noise for you, please let us know. In the week from 2014-08-18–2014-08-24 Andrew, Jeff, and I worked on the following items around the Analytics Cluster and Analytics related Ops: * Hadoop worker memory limits now automatically configured * Automatic data removal was prepared and activated for webrequest data * Adjusting access to raw webrequest data * Learning from data ingestion alarms * Webstatscollector and kafka * Distupgrade on stat1003 * Packet loss alarm on oxygen on 2014-08-16 (Bug 69663) * Geowiki data aggregation failed on 2014-08-19 (Bug 69812) (details below) Have fun, Christian * Hadoop worker memory limits now automatically configured Previously, each worker had the same memory limit, regardless of how much resources the worker really had. By now allowing different memory limits on different workers, we can better utilize the resources of each worker. * Automatic data removal was prepared and activated for webrequest data Kraken's setup to remove raw webrequest data after a given number of days (currently: 31) was brought over to refinery and turned on. * Adjusting access to raw webrequest data In order to have proper privilege separation on the cluster, access paths have been split in different groups. * Learning from data ingestion alarms With the new monitoring in place, we started to look at the alarms and are trying to make sense of them. Monitoring seems to work fine, and the partitions that got flagged, really had issues. On the flip side, checking for some samples that passed monitoring, they look valid too. So monitoring seems effective in both directions. About the flagged partitions, most of them are races on varnish (Bug 69615). No log lines get lost or duplicated for such races. There was one incident, where a leader re-election caused a drop of a few hundred log lines (bug 69854). Leader re-election currently may cause such hiccups, but there is already a theory, what is the real root cause of such drops, and it should be fixable. The only other issue was one hour this Saturday (Bug 69971), which is still pending investigation. It seems it only esams, but all four sources. But a real investigation is still pending. So the raw data that is flowing into the cluster is generally good. And we're starting on ironing out the glitches exposed by the monitoring. * Webstatscollector and kafka We started to work on making webstatscollector consume from kafka. It's a bit more involved than we hoped (burstiness of kafka, buffer receive errors, other processes blocking I/O, ...), but the latest build and setup that is running since about midnight up to now worked without issues. *Knocking on wood* * Distupgrade on stat1003 stat1003 had it's distribution upgraded. New shiny software for researchers :-) * Packet loss alarm on oxygen on 2014-08-16 (Bug 69663) Packetloss was limited to two a few minutes long periods. Root cause for the issues was bug 69661, which backfired. * Geowiki data aggregation failed on 2014-08-19 (Bug 69812) A database connection got dropped, which made the aggregation fail on 2014-08-19. The root cause of the connection drop is unknown. Nothing noteworthy happened on used database server, neither on stat1003 (The distupgrade coincidently took place on the same day, but happened later in the day). Since this happened for the first time, we're writing it off as fluke for now. -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

1 0

Re: [Analytics] [Wikimedia-l] The reader, who doesn't exist
by Pine W 24 Aug '14

24 Aug '14

Yes, we could look at Google's infoboxes as doing us a favor because they decrease the load on our servers. We would need to account for those views in some way if we are interested in quantifying success in the sense of total views of our content regardless of where it is reproduced. However, I think Analytics said in a WMF Metrics Meeting presentation that the number of Google search referrals was not going down enough to explain the drop in pageviews. I'm copying this email to Analytics in the hope that they'll comment about the probable causes of the pageview decreases. Pine On Sun, Aug 24, 2014 at 6:06 PM, MZMcBride <z(a)mzmcbride.com> wrote: > Risker wrote: > >Given the mission is sharing information, I'd suggest that if we have a > >95% drop in readership, we're failing the mission. Donations are only a > >means to an end. > > I think this assumes a direct correlation between pageviews and sharing > information and I'm not sure such a direct correlation exists. > > When you do a Google search for "abraham lincoln", there's now an infobox > on the search results page with content from Wikipedia. This could easily > result in a drop in the number of Wikipedia pageviews, but does that mean > that Wikipedia is failing its mission? The goal is a world in which we > freely share in the sum of all human knowledge. If third parties are > picking up and re-using our free content (and they are), I think we're > certainly not losing. We may even be winning(!). > > We offer bulk-download options for our content, as well as the ability to > directly query for article content on-demand via the MediaWIki API. Both > of these access methods very likely result in 0 pageviews being > registered (XML dump downloads and api.php hits aren't considered > pageviews, as far as I'm aware), but we're directly sharing content. > > As a metric, pageviews are probably not very meaningful. One way we can > observe whether we're fulfilling our mission is to see how ubiquitous > our content has become. An even better metric might be the quality of the > articles we have. Anecdotal evidence suggests that higher article quality > is not really tied to the readership rate, though perhaps article size is. > > MZMcBride > > > > _______________________________________________ > Wikimedia-l mailing list, guidelines at: > https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines > Wikimedia-l(a)lists.wikimedia.org > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, > <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe> >

1 0

EL batch inserts
by Nuria Ruiz 22 Aug '14

22 Aug '14

Sean: Could explain a little bit why the following bug affects EL data going public (for the schemas that have public data and can be made public more easily than others) https://bugzilla.wikimedia.org/show_bug.cgi?id=67450 Thanks, Nuria

4 7

Data quality expectations for different EventLogging data representations
by Christian Aistleitner 22 Aug '14

22 Aug '14

Hi, TL;DR: When consuming EventLogging data, only rely on the 'log' database available from m2 replicas, like analytics-store.eqiad.wmnet. Other representations might not get updated, might not get fix-ups or may (on purpose) give you unvalidated data. ---------------------------------- Due to the versatile design of EventLogging, its data exists/existed in many different representations, which got me confused around the data quality expectations. Also I could not find them publicly documented. After talking about different aspects with a few people, I wanted to put my current understanding of it up for public discussion. Please let me know (either in private or on list), if something looks wrong or does not match your use of EventLogging data. * MySQL / MariaDB database on m2 This database is the best place to consume EventLogging data from. Available as 'log' database on m2 replicas, such as analytics-store.eqiad.wmnet. Only validated events enter the database. In case of bugs, this database is the only place that gets fixes like cleanup of historic data, or live fixes. * 'all-events' JSON log files [1] Use this data source only to debug issues around ingestion into the m2 database. Entries are JSON objects. Only validated events get written. In case of bugs, historic data does not get fixed. * Raw client and server side log files [2] Use this data source only to debug issues around ingestion into the m2 database. Entries are parameters to the event.gif's request. They are not decoded at all. In case of bugs, historic data does not get fixed. Neither need hot-fixes reach those files. * Kafka: EventLogging data is no longer fed into Kafka since 2014-06-12 [3]. The EventLogging data in Kafka had no users. Turning it on again is tracked in bug 66528 [4]. * MongoDB: EventLogging data is no longer fed into MongoDB since 2014-02-13 [5]. The EventLogging data in MongoDB did not appear to get used. I am not aware of plans to revive feeding the data into MongoDB. * ZMQ: ZMQ is available from vanadium. In case of bugs, historic data cannot get fixed :-) Data coming from the forwarders (ports 8421, 8422) is not validated and need not see hot-fixes. Data coming from processors (port 8521, 8522) and multiplexer (port 8600) is validated. Have fun, Christian [1] Available as stats1002:/a/eventlogging/archive/all-events.log-$DATE.gz stats1003:/srv/eventlogging/archive/all-events.log-$DATE.gz vanadium:/var/log/eventlogging/... [2] Available as stats1002:/a/eventlogging/archive/client-side-events.log-$DATE.gz stats1002:/a/eventlogging/archive/server-side-events.log-$DATE.gz stats1003:/srv/eventlogging/archive/client-side-events.log-$DATE.gz stats1003:/srv/eventlogging/archive/server-side-events.log-$DATE.gz vanadium:/var/log/eventlogging/... [3] https://git.wikimedia.org/commitdiff/operations%2Fpuppet.git/f85b1dbcd61bbb… [4] https://bugzilla.wikimedia.org/show_bug.cgi?id=66528 [5] https://git.wikimedia.org/commitdiff/operations%2Fpuppet.git/05b4027973c59b… -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

2 2

Expanding wikimetrics cohorts using CentralAuth
by Kevin Leduc 21 Aug '14

21 Aug '14

Hey Dan G and Analytics team, I wanted to continue and finish the discussion that happened during the Analytics showcase earlier today. We're implementing a new feature in Wikimetrics where you can upload a cohort and check a box so that every user's accounts on other wikis (projects) will be added to the cohort (using CentralAuth). The purpose is to see if editors are active on other projects. The research scientists pointed out that there are issues with CentralAuth and they are showing up in EventLogging ( https://bugzilla.wikimedia.org/show_bug.cgi?id=66101 ). Let me try to sum up the issue here: Suppose someone has an unattached account. She then went to an editathon and volunteered her name to be included in a cohort. The resulting cohort when expanded with CentralAuth would include users from other wikis. Dan pointed out that this would be extremely unlikely that a cohort expanded using CentralAuth would include unattached users. I'm inclined to not worry about the issue and move ahead with releasing the feature. Please discuss I'm missing something.

3 2

Hive on Spark
by Andrew Otto 20 Aug '14

20 Aug '14

OooOo, we should watch this: http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivatio… https://issues.apache.org/jira/browse/HIVE-7292

2 1

Monthly Research & Data Showcase this Wednesday
by Leila Zia 20 Aug '14

20 Aug '14

The next Research & Data showcase <https://www.mediawiki.org/wiki/Analytics/Research_and_Data/Showcase> will be live-streamed this Wednesday, 8/20 at 11.30 PT. The streaming link will be posted on the lists a few minutes before the showcase starts and as usual, you can join the conversation on IRC at #wikimedia-research. We look forward to seeing you! Leila This month: *Everything You Know About Mobile Is WrW^Right: Editing and Reading Pattern Variation Between User Types* By *Oliver Keyes*: Using new geolocation tools, we look at reader and editor behaviour to understand how and when people access and contribute to our content. This is largely exploratory research, but has potential implications for our A/B testing and how we understand both cultural divides between reader and editor groups from different countries, and how we understand the differences between types of edit and the editors who make them. *Wikipedia article curation: understanding quality, recommending tasks* By *Morten Warncke-Wang**: In this talk we look at article curation in Wikipedia through the lens of task suggestions and article quality. The first part of the talk presents SuggestBot, the Wikipedia article recommender. SuggestBot connects contributors with articles similar to those they previously edited. In the second part of the talk, we discuss Wikipedia article quality using “actionable” features, features that contributors can easily act upon to improve article quality. We will first discuss these features’ ability to predict article quality, before coming back to SuggestBot and show how these predictions and actionable features can be used to improve the suggestions. *Bio: Morten Warncke-Wang is a PhD student at the GroupLens research lab, University of Minnesota. His main research focus is artefact quality and task recommendations in peer production communities. On the task recommendation side he has maintained the Wikipedia article recommender SuggestBot (http://en.wikipedia.org/wiki/User:SuggestBot) since 2010, expanding it to support six languages and additional information about recommended articles. His work on artefact quality looks at understanding quality through features contributors can easily improve, using them to both predict Wikipedia article quality and suggest improvement tasks to Wikipedia contributors. You can find more information about his research on his homepage: http://www-users.cs.umn.edu/~morten/

1 1

PhD Dissertation
by ghasem hoseini 20 Aug '14

20 Aug '14

Hello, Ehsan Shahghasemi who is a PhD candidate in Communication is doing a research for his dissertation which is about cross cultural schemata Americans have of another nation. I will appreciate if you could kindly help him by answering his questionnaire. It doesn't take more than 4 minutes: https://docs.google.com/forms/d/1jnbxpxZdsUkJ7237bSL3daRBDNGEkqT1s8kqUobzAG… Thanks in advance

1 0

Analytics Dev Team Commitments 2014-08-07 -- 2014-08-19
by Kevin Leduc 19 Aug '14

19 Aug '14

Hi, the dev team has committed to the following user stories for the sprint starting today, ending August 19. Bug ID Component Summary Points 68731 Wikimetrics Backing up wikimetrics data fails if data is written while we back it up 5 68833 Wikimetrics session management 21 68840 EEVS Wikimetrics can't run a lot of recurrent reports at the same time 8 67806 Wikimetrics Story: EEVSUser loads static site in accordance to Pau's design 13 68507 Wikimetrics replication lag may affect recurrent reports 8 Total Points: 55 You can see the sprint here: http://sb.wmflabs.org/t/analytics-developers/2014-08-07/ Cheers, Kevin Leduc

1 2

Seeking Guidance for Wikimedia Performance Portal
by Shaifali Agrawal 19 Aug '14

19 Aug '14

Hello Everyone I want to work on a project mentioned at project_list <https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects> "Wikimedia Performance Portal <https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Wikime…>". It is aimed to present data in graphical manner such that it represent performance metrics about the Wikimedia cluster; and also to organize data so that important data doesn't get mixed with unimportant ones. I need to have access to data or at least some glimpse of it and its annotations/description to work on it. From where I can access them? I am new to FOSS world; want to work on this project because it is related with data analytics which always attracts me. I am not proficient in data analysis but yes want to be; so while doing this project I will be having good experience which lead to my goal. I have good hand in Python and Java, know basics of R, php, C, C++, javascript and also wiling to learn rest all whatever needed. I mailed about same to mentioned mentor for project but unfortunately didn't got any response probably because he would be busy. So please can I have guidance from where to start for this project and what it is all about more. Thanks!!! Shaifali Agrawal about.me/shaifaliagrawal [image: Shaifali Agrawal on about.me] <http://about.me/shaifaliagrawal>

2 4

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics August 2014