Analytics July 2014

analytics@lists.wikimedia.org

25 participants
32 discussions

Analytics Dev Team Commitments 2014-07-10 -- 2014-07-22
by Christian Aistleitner 15 Jul '14

15 Jul '14

Hi, for the sprint 2014-07-10–2014-07-22 the dev team committed to the following features: +--------+--------+-------------+------------------------------------+ | Bug Nr | Points | Component | Bug Summary | +--------+--------+-------------+------------------------------------+ | 67172 | Spike | EEVS | AnalyticsEng decide on stack | | | | | for EEVS dashboard (60 hours) | | 67128 | 34 | Refinery | Story: Admin has duplicate | | | | | monitoring in Icinga | | 67129 | 8 | Refinery | Story: Admin has versioned and | | | | | sync'ed files in HDFS | | 67458 | 13 | Wikimetrics | Story:a WikimetricsUser runs | | | | | 'Rolling Monthly Active Editors' | | | | | report | +--------+--------+-------------+------------------------------------+ Points Total: 55 + Spike (60 hours) Points per Component: (Spike) EEVS 42 Refinery 13 Wikimetrics You can follow the current sprint at: http://sb.wmflabs.org/t/analytics-developers/2014-07-10/ Have fun, Christian P.S.: Bugs 67128 and 67129 have been carried over from previous sprint. 67129 has been done in the meantime, and 67128 has seen a requirement change and got bumped from 5 to 34 points. I'll add details on the commitment email from last sprint and bug 67128. -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

1 0

Dashboard-like frontend for graphite
by Steven Walling 10 Jul '14

10 Jul '14

Hey all, Sam Smith (phuedx) on the Growth team pointed me to Tessera, a new Apache-licensed frontend to graphite that makes dashboard creation easier. Considering the pain and suffering Limn causes us, this seems like an interesting avenue to explore for internal dashboard needs. Repo: https://github.com/urbanairship/tessera Blog post: http://urbanairship.com/blog/2014/06/30/introducing-tessera-a-graphite-fron… -- Steven Walling, Product Manager https://wikimediafoundation.org/

4 8

Storing intermediate results for a limn1 -> analytics-store DB query
by Gergo Tisza 10 Jul '14

10 Jul '14

Hi, I am working on adding MediaViewer opt-out result tracking to multimedia-metrics.wmflabs.org (which maps to limn1). Opt-out data is stored in the mediawiki databases, but only the current state, so I have to store the daily results somewhere to be able to show a timechart. I'm asking for advice on the best way to do that. The two obvious approaches are: - store the results in mysql on the same server that holds the wiki db (analytics-store.eqiad.wmnet) - store them in mysql locally, on the limn1 instance The first seems easier to me, since the second would mean transferring data between different DB servers, which is awkward in MySQL; but I don't know well the setup of limn1 and analytics-store. Is there any reason to take the other route (or some third way)? If not, what's the way to get a new DB created on analytics-store where I can store the results? thanks Gergő

3 3

Analytics Team Showcase
by Kevin Leduc 10 Jul '14

10 Jul '14

Hi all, The slides from the analytics showcase are publicly available: https://docs.google.com/presentation/d/1Y2uI_oOhXGpcn8y-EHBAAqzxS2Lp5OFEXIk… I added a few screenshots to them to highlight what was showcased live.

1 0

Monitoring for Event Logging on Graphite
by Nuria Ruiz 09 Jul '14

09 Jul '14

Hello, We have restored "per schema" monitoring for Event Logging in graphite. Users of the Event Logging system can use the schema monitoring to see how big (or small) is the their usage of EventLogging compared to the total throughput of events. See for example the overall rate of incoming events versus some Mobile and MediaViewer events: *http://tinyurl.com/nwdx7w7 <http://tinyurl.com/nwdx7w7>* Main graphite instance can be accessed here: https://graphite.wikimedia.org Thanks, Nuria

2 1

analytics-store outage
by Sean Pringle 09 Jul '14

09 Jul '14

FYI, I just managed to crash the TokuDB storage engine on analytics-store MariaDB while replicating some schema changes for mediawiki. It's busy recovering from the transaction log now. Probably an hour or so delay for that, plus the replication lag catch up. s1-analytics-slave is unaffected if you need emergency access to m2 (eventlogging), s1, or s2. Luckily we have a stack trace and a candidate upstream bug fix for next time. Sean -- DBA @ WMF

1 1

Re: [Analytics] Analytics Digest, Vol 29, Issue 7
by Anasuya Sengupta 08 Jul '14

08 Jul '14

Hi Pine, Many thanks for this. I've had some trouble with wifi while traveling, so will probably be able to send you the final responses as soon as I'm back in SF (I'm in Frankfurt en route to SF). Apologies for the delay! Hope you're well, Anasuya On Jul 7, 2014 2:01 PM, <analytics-request(a)lists.wikimedia.org> wrote: > Send Analytics mailing list submissions to > analytics(a)lists.wikimedia.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.wikimedia.org/mailman/listinfo/analytics > or, via email, send a message with subject or body 'help' to > analytics-request(a)lists.wikimedia.org > > You can reach the person managing the list at > analytics-owner(a)lists.wikimedia.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Analytics digest..." > > > Today's Topics: > > 1. Re: [Wiki-research-l] We need overview quality-minded metrics > for different language versions of Wikipedia. (Pine W) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sun, 6 Jul 2014 10:57:03 -0700 > From: Pine W <wiki.pine(a)gmail.com> > To: Research into Wikimedia content and communities > <wiki-research-l(a)lists.wikimedia.org>, "A mailing list for the > Analytics Team at WMF and everybody who has an interest in > Wikipedia > and analytics." <analytics(a)lists.wikimedia.org> > Subject: Re: [Analytics] [Wiki-research-l] We need overview > quality-minded metrics for different language versions of > Wikipedia. > Message-ID: > <CAF= > dyJiZNBO6sw+g-R7n77EkPatyd18oRPUmJDFH8z2dJ3AjhQ(a)mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Forwarding to Analytics in case anyone there is interested. Please discuss > on the Research list. > > Thanks, > > Pine > > > On Sun, Jul 6, 2014 at 6:21 AM, Anders Wennersten < > mail(a)anderswennersten.se> > wrote: > > > A standard on measurement quality levels on articles would be excellent > > and enable much better comparisons between language versions. > > > > I give some ideas of quality levels below, but I also want to stress that > > I believe q also is related to coverage. En wp has most 100% q articles > in > > many subject areas like films, and albums. But they have low coverage on > > poets whos work is not available in English, worse the dewp for example - > > how to evaluate something like that > > > > My intuitive quality levels on articles are > > -1 - Non acceptable quality > > Machine translated articles, vandalinfested articles, severe POV > > content, shorter the 300 characters with no sources etc. No bot should be > > allowed to generate, such lousy articles. They ought all to be > deleted, > > and I would expect there to be no articles at all of this inferior > quality > > on the bigger versions. > > 0 - Missing articles, that ought to exist > > 1 - Rudimentary articles > > Articles but with proper sources, categories and infoboxes but short > in > > substance. Articles with proper substance but missing appropriate > > sources. Most proper botgenerated articles fall in this level > > 2 - OK articles > > Have both proper substance and sources, but is not complete, do not > > cover all aspects of subject. Some few botgenerated articles fall in > this > > level > > 3 - Good articles > > Cover the subject > > > > For each of these levels it should be possible to develop detailed > > criteria which would enable us to machineread articles and classify them > > on their qlevel as of above > > > > Anders > > > > Han-Teng Liao (OII) skrev 2014-07-06 13:29: > > > > We need overview quality-minded metrics on different language versions of > > Wikipedias. Otherwise, the current "number games" played by bots across > > certain language versions have distorted the direction and focus of the > > editorial developments. I thereby propose an altmetric of > > "do-not-spread-oneself-too-thin" to counterbalance the situation. > > > > (Sorry I was late in engaging the conversation of "[Wiki-research-l] > Quality > > on different language version > > < > http://www.mail-archive.com/wiki-research-l@lists.wikimedia.org/msg03168.ht… > >". > > It is a follow-up reply and a suggestion to this discussion thread.) > > > > For example, in the Chinese Wikipedia community, there are current > > discussions talking about the current ranking of Chinese Wikipedia in > terms > > of number of articles, and how the *neighboring* versions (those who have > > similar numbers of articles) use bots to generate new articles. > > > > # The stats report generated and used by the Chinese community to > > compare itself against neighboring language versions: > > #* Link > > < > http://zh.wikipedia.org/wiki/Wikipedia:%E7%BB%9F%E8%AE%A1/%E4%B8%8E%E9%82%B… > > > > > > #* Google translated > > < > https://translate.google.com/translate?hl=en&sl=zh-CN&tl=en&u=http%3A%2F%2F… > > > > > > # One current discussion: > > #* Link > > < > http://zh.wikipedia.org/wiki/Wikipedia:%E4%BA%92%E5%8A%A9%E5%AE%A2%E6%A0%88… > > > > #* Google translated > > < > https://translate.google.com/translate?sl=auto&tl=en&js=y&prev=_t&hl=en&ie=… > > > > # One recently archived discussion: > > #* Link > > < > http://zh.wikipedia.org/wiki/Wikipedia:%E4%BA%92%E5%8A%A9%E5%AE%A2%E6%A0%88… > > > > #* Google translated > > < > https://translate.google.com/translate?hl=en&sl=zh-CN&tl=en&u=http%3A%2F%2F… > > > > > > To counterbalance the situation of such nonsensical comparison and > > competition, I personally think it is better to have an altmetric in > place > > of the crude (and often distorting) measure of the number of articles. > > > > One would expect a better encyclopedia to contain a set of core articles > > of human knowledge. > > > > Indeed the meta has a list of 1000 articles that "every Wikipedia should > > have". > > > http://meta.wikimedia.org/wiki/List_of_articles_every_Wikipedia_should_have > > > > We can use this to generate a quantifiable metric of the development of > > the core articles in each language version, perhaps using the following > > numbers: > > > > * number of references (total and per article) > > * number of footnotes (total and per article) > > * number of citations (total and per article) > > * number of distinct wiki internal links to other articles > > * number of good and feature articles (judged by each language version > > community) > > > > Based on the above numbers, it is conceivable to come up with a metric > > that measure both the depth and breadth of the quality of the core > > articles. I admit that other measurements can and should be applied, but > > still the above numbers have the following merits: > > > > * they reflect the nature of Wikipedia as dependent on other reliable > > secondary and primary information couces. > > * they can be applied across languages automatically without the need to > > analyze texts, which requires more tools and engenders issues of > > comparability. > > > > For the sake of simplicity, let us say that one language version > > (possibly English or German) has the highest number of scores, then that > > language version can then be served as baseline for comparison. Say this > > benchmark language version has: > > > > # the quality-metric number of QUAL (from the vital 1000) > > # the quantity number of total articles QUAN (from the existing metric) > > > > Then the "do-not-spread-oneself-too-thin" quality metric can be > > calculated as: > > > > QUAL/QUAN > > > > (It can be further discussed whether logarithmic scales should be > > applied here.) > > > > The gist of this "quality metric" is to reverse the obsession with the > > number of articles towards the important core articles, hoping to get > more > > references, footnotes, citations, internal links and good/feature > articles > > for the core 1000. It will hopefully indicate which language version is > too > > "watery", or simply spreading oneself too thin with inconsequential short > > articles. > > > > Let us have a discussion here [Wiki-research-l], before we extend the > > conversation to [Wikimedia-i]. > > > > Best, > > han-teng liao > > > > > > -- > > han-teng liao > > > > "[O]nce the Imperial Institute of France and the Royal Society of London > > begin to work together on a new encyclopaedia, it will take less than a > > year to achieve a lasting peace between France and England." - Henri > > Saint-Simon (1810) > > > > "A common ideology based on this Permanent World Encyclopaedia is a > > possible means, to some it seems the only means, of dissolving human > > conflict into unity." - H.G. Wells (1937) > > > > > > _______________________________________________ > > Wiki-research-l mailing listWiki-research-l@lists.wikimedia.orghttps:// > lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > > > > > > _______________________________________________ > > Wiki-research-l mailing list > > Wiki-research-l(a)lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > > >

1 0

Multimedia's EventLogging consumption
by Gilles Dubuc 08 Jul '14

08 Jul '14

Hello analytics, My last round of reducing our EL consumption was right after the launch of Media Viewer to all wikis. I see now that the per-schema graphite stats are back: http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1404769845… Is that a reasonable amount of EventLogging usage for MediaViewer? Or should we do another pass of reducing our usage?

2 1

Re: [Analytics] [Wiki-research-l] We need overview quality-minded metrics for different language versions of Wikipedia.
by Pine W 06 Jul '14

06 Jul '14

Forwarding to Analytics in case anyone there is interested. Please discuss on the Research list. Thanks, Pine On Sun, Jul 6, 2014 at 6:21 AM, Anders Wennersten <mail(a)anderswennersten.se> wrote: > A standard on measurement quality levels on articles would be excellent > and enable much better comparisons between language versions. > > I give some ideas of quality levels below, but I also want to stress that > I believe q also is related to coverage. En wp has most 100% q articles in > many subject areas like films, and albums. But they have low coverage on > poets whos work is not available in English, worse the dewp for example - > how to evaluate something like that > > My intuitive quality levels on articles are > -1 - Non acceptable quality > Machine translated articles, vandalinfested articles, severe POV > content, shorter the 300 characters with no sources etc. No bot should be > allowed to generate, such lousy articles. They ought all to be deleted, > and I would expect there to be no articles at all of this inferior quality > on the bigger versions. > 0 - Missing articles, that ought to exist > 1 - Rudimentary articles > Articles but with proper sources, categories and infoboxes but short in > substance. Articles with proper substance but missing appropriate > sources. Most proper botgenerated articles fall in this level > 2 - OK articles > Have both proper substance and sources, but is not complete, do not > cover all aspects of subject. Some few botgenerated articles fall in this > level > 3 - Good articles > Cover the subject > > For each of these levels it should be possible to develop detailed > criteria which would enable us to machineread articles and classify them > on their qlevel as of above > > Anders > > Han-Teng Liao (OII) skrev 2014-07-06 13:29: > > We need overview quality-minded metrics on different language versions of > Wikipedias. Otherwise, the current "number games" played by bots across > certain language versions have distorted the direction and focus of the > editorial developments. I thereby propose an altmetric of > "do-not-spread-oneself-too-thin" to counterbalance the situation. > > (Sorry I was late in engaging the conversation of "[Wiki-research-l] Quality > on different language version > <http://www.mail-archive.com/wiki-research-l@lists.wikimedia.org/msg03168.ht…>". > It is a follow-up reply and a suggestion to this discussion thread.) > > For example, in the Chinese Wikipedia community, there are current > discussions talking about the current ranking of Chinese Wikipedia in terms > of number of articles, and how the *neighboring* versions (those who have > similar numbers of articles) use bots to generate new articles. > > # The stats report generated and used by the Chinese community to > compare itself against neighboring language versions: > #* Link > <http://zh.wikipedia.org/wiki/Wikipedia:%E7%BB%9F%E8%AE%A1/%E4%B8%8E%E9%82%B…> > > #* Google translated > <https://translate.google.com/translate?hl=en&sl=zh-CN&tl=en&u=http%3A%2F%2F…> > > # One current discussion: > #* Link > <http://zh.wikipedia.org/wiki/Wikipedia:%E4%BA%92%E5%8A%A9%E5%AE%A2%E6%A0%88…> > #* Google translated > <https://translate.google.com/translate?sl=auto&tl=en&js=y&prev=_t&hl=en&ie=…> > # One recently archived discussion: > #* Link > <http://zh.wikipedia.org/wiki/Wikipedia:%E4%BA%92%E5%8A%A9%E5%AE%A2%E6%A0%88…> > #* Google translated > <https://translate.google.com/translate?hl=en&sl=zh-CN&tl=en&u=http%3A%2F%2F…> > > To counterbalance the situation of such nonsensical comparison and > competition, I personally think it is better to have an altmetric in place > of the crude (and often distorting) measure of the number of articles. > > One would expect a better encyclopedia to contain a set of core articles > of human knowledge. > > Indeed the meta has a list of 1000 articles that "every Wikipedia should > have". > http://meta.wikimedia.org/wiki/List_of_articles_every_Wikipedia_should_have > > We can use this to generate a quantifiable metric of the development of > the core articles in each language version, perhaps using the following > numbers: > > * number of references (total and per article) > * number of footnotes (total and per article) > * number of citations (total and per article) > * number of distinct wiki internal links to other articles > * number of good and feature articles (judged by each language version > community) > > Based on the above numbers, it is conceivable to come up with a metric > that measure both the depth and breadth of the quality of the core > articles. I admit that other measurements can and should be applied, but > still the above numbers have the following merits: > > * they reflect the nature of Wikipedia as dependent on other reliable > secondary and primary information couces. > * they can be applied across languages automatically without the need to > analyze texts, which requires more tools and engenders issues of > comparability. > > For the sake of simplicity, let us say that one language version > (possibly English or German) has the highest number of scores, then that > language version can then be served as baseline for comparison. Say this > benchmark language version has: > > # the quality-metric number of QUAL (from the vital 1000) > # the quantity number of total articles QUAN (from the existing metric) > > Then the "do-not-spread-oneself-too-thin" quality metric can be > calculated as: > > QUAL/QUAN > > (It can be further discussed whether logarithmic scales should be > applied here.) > > The gist of this "quality metric" is to reverse the obsession with the > number of articles towards the important core articles, hoping to get more > references, footnotes, citations, internal links and good/feature articles > for the core 1000. It will hopefully indicate which language version is too > "watery", or simply spreading oneself too thin with inconsequential short > articles. > > Let us have a discussion here [Wiki-research-l], before we extend the > conversation to [Wikimedia-i]. > > Best, > han-teng liao > > > -- > han-teng liao > > "[O]nce the Imperial Institute of France and the Royal Society of London > begin to work together on a new encyclopaedia, it will take less than a > year to achieve a lasting peace between France and England." - Henri > Saint-Simon (1810) > > "A common ideology based on this Permanent World Encyclopaedia is a > possible means, to some it seems the only means, of dissolving human > conflict into unity." - H.G. Wells (1937) > > > _______________________________________________ > Wiki-research-l mailing listWiki-research-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > >

1 0

Webstatscollector (used for stats.grok.se, stats.wikimedia.org, ...) not counting SSL traffic
by Christian Aistleitner 04 Jul '14

04 Jul '14

Hi, since mid-April 2014, logs from SSL terminators did no longer get routed into the machine that is running webstatscollector. As a result webstatscollector output does not contain SSL traffic since mid-April 2014. This issue affects http://dumps.wikimedia.org/other/pagecounts-raw/ and all consumers of that data, as for example http://dumps.wikimedia.org/other/pagecounts-ez/ http://stats.grok.se/ http://tools.wmflabs.org/wikiviewstats/ http://stats.wikimedia.org/ (only non-squid part) http://reportcard.wmflabs.org/ . The bug is getting tracked at https://bugzilla.wikimedia.org/show_bug.cgi?id=67456 Best regards, Christian -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

2 3

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics July 2014