Analytics June 2018

analytics@lists.wikimedia.org

16 participants
17 discussions

Beeline as Hive client
by Madhumitha Viswanathan 03 Oct '18

03 Oct '18

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

3 3

Re: [Analytics] Monitor the number of Wikipedia sites and the number of articles in each site
by Dan Andreescu 01 Aug '18

01 Aug '18

Forwarding this question to the public Analytics list, where it's good to have these kinds of discussions. If you're interested in this data and how it changes over time, do subscribe and watch for updates, notices of outages, etc. Ok, so on to your question. You'd like the *total # of articles for each wiki*. I think the simplest way right now is to query the AQS (Analytics Query Service) API, documented here: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2 To get the # of articles for a wiki, let's say en.wikipedia.org, you can get the timeseries of new articles per month since the beginning of time: *https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900 <https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org…>* And to get a list of all wikis, to plug into that URL instead of " en.wikipedia.org", the most up-to-date information is here: https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or via the mediawiki API: https://meta.wikimedia.org/w/api.php?action=sitematrix&formatversion=2&form…. Sometimes new sites won't have data in the AQS API for a month or two until we add them and start crunching their stats. The way I figured this out is to look at how our UI uses the API: https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/new-pages. So if you were interested in something else, you can browse around there and take a look at the XHR requests in the browser console. Have fun! On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) <zzn(a)google.com > wrote: > Hi Dan, > > How are you! This is Victor, It's been a while since we meet at the 2018 > Wikimedia Dev Summit. I hope you are doing great. > > As I mentioned to you, my team works on extracting the knowledge from > Wikipedia. Currently it's undergoing a project that expands language > coverage. My teammate Yuan Gao(cc'ed here) is tech leader of this > project.She plans to *monitor the list of all the current available > wikipedia's sites and the number of articles for each language*, so that > we can compare with our extraction system's output to sanity-check if there > is a massive breakage of the extraction logic, or if we need to add/remove > languages in the event that a new wikipedia site is introduced to/remove > from the wikipedia family. > > I think your team at Analytics at Wikimedia probably knows the best where > we can find this data. Here are 4 places we already know, but doesn't seem > to have the data. > > > - https://en.wikipedia.org/wiki/List_of_Wikipedias. has the > information we need, but the list is manually edited, not automatic > - https://stats.wikimedia.org/EN/Sitemap.htm, has the full list, but > the information seems pretty out of date(last updated almost a month ago) > - StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I can't > find the full list nor the number of articles > - API https://wikimedia.org/api/rest_v1/ suggested by elukey on > #wikimedia-analytics channel, it doesn't seem to have # of article > information > > Do you know what is a good place to find this information? Thank you! > > Victor > > > > * • **Zainan Zhou(**周载南**) a.k.a. "Victor" * <http://who/zzn> > * • *Software Engineer, Data Engine > * •* Google Inc. > * • *zzn(a)google.com <ecarmeli(a)google.com> - 650.336.5691 > * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043 > > ---------- Forwarded message ---------- > From: Yuan Gao <gaoyuan(a)google.com> > Date: Wed, Mar 28, 2018 at 4:15 PM > Subject: Monitor the number of Wikipedia sites and the number of articles > in each site > To: Zainan Victor Zhou <zzn(a)google.com> > Cc: Wenjie Song <wenjies(a)google.com>, WikiData <wikidata(a)google.com> > > > Hi Victor, > as we discussed in the meeting, I'd like to monitor: > 1) the number of Wikipedia sites > 2) the number of articles in each site > > Can you help us to contact with WMF to get a realtime or at least daily > update of these numbers? What we can find now is > https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of > Wikipedia sites is manually updated, and possibly out-of-date. > > > The monitor can help us catch such bugs. > > -- > Yuan Gao > >

8 13

Wikimedia Video tracking tool
by Agnes Bruszik 11 Jul '18

11 Jul '18

Dear All, We are experiencing problems with this tool: https://tools.wmflabs.org/commons-video-clicks/help.html#, the tool for measuring the number of plays is not functioning at present. Can you please suggest alternative tools to see the number of plays for these 14 x OER MOOC video snippets attached we have uploaded to Wikimedia Commons and inserted into relevant Wikipedia pages? Thank you very much! Agnes WMUK -- Best, *Agnes Bruszik - *Programme Evaluation Assistant Wikimedia UK +44 2033720769 *Wikimedia UK* is the national chapter for the global Wikimedia open knowledge movement. We rely on donations from individuals to support our work to make knowledge open for all. Have you considered supporting Wikimedia? https://donate.wikimedia.org.uk Wikimedia UK is a Company Limited by Guarantee registered in England and Wales, Registered No. 6741827. Registered Charity No.1144513. Registered Office 4th Floor, Development House, 56-64 Leonard Street, London EC2A 4LT. United Kingdom. Wikimedia UK is the UK chapter of a global Wikimedia movement. The Wikimedia projects are run by the Wikimedia Foundation (who operate Wikipedia, amongst other projects). *Wikimedia UK is an independent non-profit charity with no legal control over Wikipedia nor responsibility for its contents.*

4 5

dumps.wikimedia.org web and rsync services down
by Bryan Davis 28 Jun '18

28 Jun '18

The https://dumps.wikimedia.org web interface for downloading various dump files is currently offline. The rsync service for external mirroring is as well. Local network NFS consumers may or may not be working depending on which server the consumer is attached to. This unexpected outage is the result of hardware issues following a short planned maintenance. We are currently investigating the root cause of the outage and will post additional updates as they become available. Thanks for your patience. Bryan -- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> [[m:User:BDavis_(WMF)]] Manager, Cloud Services Boise, ID USA irc: bd808 v:415.839.6885 x6855

1 1

Piwik maintenance ongoing
by Luca Toscano 28 Jun '18

28 Jun '18

Hi everybody, as FYI piwik.wikimedia.org will be in maintenance mode for a couple of hours due to a software upgrade. More info in https://phabricator.wikimedia.org/T192298 Thanks! Luca

1 1

Defining "public" and "content" wikis
by Neil Patel Quinn 27 Jun '18

27 Jun '18

Hey everyone! As you probably know, the Wikimedia cluster includes not just "normal" wikis like English Wikipedia and Albanian Wiktionary, but odd ones like the Wikimedia Belgium chapter website <https://be.wikimedia.org>, Test Wikidata <https://test.wikidata.org/wiki/Wikidata:Main_Page>, and the English Wikipedia Working Group on Ethnic and Cultural Edit Wars wiki <https://wg-en.wikipedia.org/wiki/Main_Page>. As far as I know, however, there's no standard definition of "normal" wiki to use when doing analysis. So I've started meta:Research:Wiki <https://meta.wikimedia.org/wiki/Research:Wiki> with draft definitions of "public wikis" and "content wikis", along with some initial documentation about wiki metadata (names, project groups, etc.) which I plan to continue to work on. I encourage you to edit or comment on the talk page! -- Neil Patel Quinn <https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF> (he/him/his) product analyst, Wikimedia Foundation

4 3

temporary drop in pageviews to ig.wikipedia
by Amir E. Aharoni 23 Jun '18

23 Jun '18

Hi, I was browsing Turnilo, and found an odd thing. Filter by: 1. Time, January 2017 to May 2018 2. Country: Nigeria 3. Project: ig.wikipedia.org Split by: Time Select line chart You'll immediately notice that there are almost zero pageviews from February 12 until April 15. What could be the reason? Thanks! -- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com ‪“We're living in pieces, I want to live in peace.” – T. Moore‬

4 4

Re: [Analytics] Statistical point of view to the visitors and promotional user names
by Leila Zia 20 Jun '18

20 Jun '18

[coming back from a private response to the public list, with Ladislav's permission.] On Fri, Jun 1, 2018 at 5:03 AM Ladislav Nešněra <ladislav.nesnera(a)otevrenamesta.cz> wrote: > > Re: to https://lists.wikimedia.org/pipermail/analytics/2018-May/006349.html > > Hi Leila, > I'm sorry for the delay but I'm not subscriber of this forum and registration doesn't work for me :-O (https://lists.wikimedia.org/mailman/listinfo/analytics). It means - I had no signal about your answer. No worries. I understand you're in now. :) > Yes, I'd like to know something about user behaviour. This paragraph limits user names based on potential abuse for promotional purpose. Would be fine to know how many users reach the pages where they can see user names (=discussion + history of the article + history of the discussion). Ideally separate human readers and editors. Is it possible? There is no immediate data available I can think of to point you, too (others should feel free, of course, to provide pointers): * If you're interested in anecdotal evidence: The easiest way I can think of that you can visually see this information would be using pageviews analysis tool [1]. * A properly set-up analysis will need to look at the pageviews to the destinations you mentioned before/after the change, controlling for seasonality, pageview changes over time, etc. * Separating editor pageviews vs. reader pageviews will be hard and that's by design. Even if we can set aside time to run this analysis for you, this can only be done over the data in the past 90 days (at most) and I would need to see a relatively strong editor community support for doing it. (We generally don't do in-depth analysis of what editors read, so some discussion is needed to make that happen.) * If you decide to pursue the above, please communicate the priority of this question on your end to help us prioritize. For example, is there a community discussion pending on this result? How important is that discussion? etc. > And one related question - Wikidata Query and Wikipedia statistics. [Link: https://www.wikidata.org/wiki/Wikidata:Request_a_query#Wikidata_Query_and_W…] I'll let others who know more answer the question above, my guess would be what's already said in the link above. Best, Leila > Thank you in advance for your time ;? [1] http://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-a… > > > On 2018-05-29 00:59, Ladislav Nešněra wrote: > > Dear analytic team ;), > > I'd like to know if the policy about promotional user names solve real problem or if this problem was only anticipated in 2007. > Is it possible to get statistic which distinguish between article visitors and people which can see potential promotion names i.e. discussion + history of the article + history of discussion? It'd by ideal exclude editors (it's hard to influence a persons familiar with the subject contrariwise they've be annoyed by promoting) but I don't believe it would be a significant difference. > Can you help me or can you direct me into the right way, please? > > Thank you in advance for your time > > Ladislav Nešněra ;? > +420 721 658 256 > >

2 1

Data ingestion issue with Webrequest 2018-06-14-11
by Luca Toscano 18 Jun '18

18 Jun '18

Hi everybody, we have been working on an issue while refining webrequest data for the 2018-06-14-11 hour, tracked in https://phabricator.wikimedia.org/T197281. We have a fix that will be deployed on Monday, so we apologize in advance if today and during the weekend some data will be missing. Brian (Cced) reported in another thread that hour 12 is missing from https://dumps.wikimedia.org/other/pageviews/2018/2018-06/: this is the same problem, the pageviews dumps follow another naming scheme but the missing hour is the same. Until we deploy the fix, we'll miss this hour from webrequest, pageviews hourly and daily. Please follow the Phabricator task for more info during the next days, or ping the analytics team on IRC (#wikimedia-analytics on Freenode). Thanks! Luca

1 1

Fwd: Modeling interactions on talk pages and detecting early signs of conversational failure: Research Showcase - June 18, 2018 (11:30 AM PDT| 18:30 UTC)
by Dario Taraborelli 18 Jun '18

18 Jun '18

Hey everyone, we're hosting a dedicated session in June on our joint work with Cornell and Jigsaw on predicting conversational failure <https://arxiv.org/abs/1805.05345> on Wikipedia talk pages. This is part of our contribution to WMF's Anti-Harassment program. The showcase <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#June_2018> will be live-streamed <https://www.youtube.com/watch?v=m4vzI0k4OSg> on *Monday, June 18, 2018* at 11:30 AM (PDT), 18:30 (UTC). (Please note this falls on a Monday this month). Conversations Gone Awry. Detecting Early Signs of Conversational FailureBy *Justine Zhang and Jonathan Chang, Cornell University*One of the main challenges online social systems face is the prevalence of antisocial behavior, such as harassment and personal attacks. In this work, we introduce the task of predicting from the very start of a conversation whether it will get out of hand. As opposed to detecting undesirable behavior after the fact, this task aims to enable early, actionable prediction at a time when the conversation might still be salvaged. To this end, we develop a framework for capturing pragmatic devices—such as politeness strategies and rhetorical prompts—used to start a conversation, and analyze their relation to its future trajectory. Applying this framework in a controlled setting, we demonstrate the feasibility of detecting early warning signs of antisocial behavior in online discussions. Building a rich conversation corpus from Wikipedia Talk pagesWe present a corpus of conversations that encompasses the complete history of interactions between contributors to English Wikipedia's Talk Pages. This captures a new view of these interactions by containing not only the final form of each conversation but also detailed information on all the actions that led to it: new comments, as well as modifications, deletions and restorations. This level of detail supports new research questions pertaining to the process (and challenges) of large-scale online collaboration. As an example, we present a small study of removed comments highlighting that contributors successfully take action on more toxic behavior than was previously estimated. YouTube stream: https://www.youtube.com/watch?v=m4vzI0k4OSg As usual, you can join the conversation on IRC at #wikimedia-research. And, you can watch our past research showcases here <https://www.youtube.com/playlist?list=PLhV3K_DS5YfLQLgwU3oDFiGaU3K7pUVoW>. Hope to see you there on June 18! Dario

1 1

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics June 2018