Analytics May 2018

analytics@lists.wikimedia.org

17 participants
16 discussions

by Madhumitha Viswanathan

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

5 years, 8 months

Re: [Analytics] Monitor the number of Wikipedia sites and the number of articles in each site

by Dan Andreescu

Forwarding this question to the public Analytics list, where it's good to have these kinds of discussions. If you're interested in this data and how it changes over time, do subscribe and watch for updates, notices of outages, etc. Ok, so on to your question. You'd like the *total # of articles for each wiki*. I think the simplest way right now is to query the AQS (Analytics Query Service) API, documented here: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2 To get the # of articles for a wiki, let's say en.wikipedia.org, you can get the timeseries of new articles per month since the beginning of time: *https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900 <https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org…>* And to get a list of all wikis, to plug into that URL instead of " en.wikipedia.org", the most up-to-date information is here: https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or via the mediawiki API: https://meta.wikimedia.org/w/api.php?action=sitematrix&formatversion=2&form…. Sometimes new sites won't have data in the AQS API for a month or two until we add them and start crunching their stats. The way I figured this out is to look at how our UI uses the API: https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/new-pages. So if you were interested in something else, you can browse around there and take a look at the XHR requests in the browser console. Have fun! On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) <zzn(a)google.com > wrote: > Hi Dan, > > How are you! This is Victor, It's been a while since we meet at the 2018 > Wikimedia Dev Summit. I hope you are doing great. > > As I mentioned to you, my team works on extracting the knowledge from > Wikipedia. Currently it's undergoing a project that expands language > coverage. My teammate Yuan Gao(cc'ed here) is tech leader of this > project.She plans to *monitor the list of all the current available > wikipedia's sites and the number of articles for each language*, so that > we can compare with our extraction system's output to sanity-check if there > is a massive breakage of the extraction logic, or if we need to add/remove > languages in the event that a new wikipedia site is introduced to/remove > from the wikipedia family. > > I think your team at Analytics at Wikimedia probably knows the best where > we can find this data. Here are 4 places we already know, but doesn't seem > to have the data. > > > - https://en.wikipedia.org/wiki/List_of_Wikipedias. has the > information we need, but the list is manually edited, not automatic > - https://stats.wikimedia.org/EN/Sitemap.htm, has the full list, but > the information seems pretty out of date(last updated almost a month ago) > - StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I can't > find the full list nor the number of articles > - API https://wikimedia.org/api/rest_v1/ suggested by elukey on > #wikimedia-analytics channel, it doesn't seem to have # of article > information > > Do you know what is a good place to find this information? Thank you! > > Victor > > > > * • **Zainan Zhou(**周载南**) a.k.a. "Victor" * <http://who/zzn> > * • *Software Engineer, Data Engine > * •* Google Inc. > * • *zzn(a)google.com <ecarmeli(a)google.com> - 650.336.5691 > * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043 > > ---------- Forwarded message ---------- > From: Yuan Gao <gaoyuan(a)google.com> > Date: Wed, Mar 28, 2018 at 4:15 PM > Subject: Monitor the number of Wikipedia sites and the number of articles > in each site > To: Zainan Victor Zhou <zzn(a)google.com> > Cc: Wenjie Song <wenjies(a)google.com>, WikiData <wikidata(a)google.com> > > > Hi Victor, > as we discussed in the meeting, I'd like to monitor: > 1) the number of Wikipedia sites > 2) the number of articles in each site > > Can you help us to contact with WMF to get a realtime or at least daily > update of these numbers? What we can find now is > https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of > Wikipedia sites is manually updated, and possibly out-of-date. > > > The monitor can help us catch such bugs. > > -- > Yuan Gao > >

5 years, 10 months

Fwd: Modeling interactions on talk pages and detecting early signs of conversational failure: Research Showcase - June 18, 2018 (11:30 AM PDT| 18:30 UTC)

by Dario Taraborelli

Hey everyone, we're hosting a dedicated session in June on our joint work with Cornell and Jigsaw on predicting conversational failure <https://arxiv.org/abs/1805.05345> on Wikipedia talk pages. This is part of our contribution to WMF's Anti-Harassment program. The showcase <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#June_2018> will be live-streamed <https://www.youtube.com/watch?v=m4vzI0k4OSg> on *Monday, June 18, 2018* at 11:30 AM (PDT), 18:30 (UTC). (Please note this falls on a Monday this month). Conversations Gone Awry. Detecting Early Signs of Conversational FailureBy *Justine Zhang and Jonathan Chang, Cornell University*One of the main challenges online social systems face is the prevalence of antisocial behavior, such as harassment and personal attacks. In this work, we introduce the task of predicting from the very start of a conversation whether it will get out of hand. As opposed to detecting undesirable behavior after the fact, this task aims to enable early, actionable prediction at a time when the conversation might still be salvaged. To this end, we develop a framework for capturing pragmatic devices—such as politeness strategies and rhetorical prompts—used to start a conversation, and analyze their relation to its future trajectory. Applying this framework in a controlled setting, we demonstrate the feasibility of detecting early warning signs of antisocial behavior in online discussions. Building a rich conversation corpus from Wikipedia Talk pagesWe present a corpus of conversations that encompasses the complete history of interactions between contributors to English Wikipedia's Talk Pages. This captures a new view of these interactions by containing not only the final form of each conversation but also detailed information on all the actions that led to it: new comments, as well as modifications, deletions and restorations. This level of detail supports new research questions pertaining to the process (and challenges) of large-scale online collaboration. As an example, we present a small study of removed comments highlighting that contributors successfully take action on more toxic behavior than was previously estimated. YouTube stream: https://www.youtube.com/watch?v=m4vzI0k4OSg As usual, you can join the conversation on IRC at #wikimedia-research. And, you can watch our past research showcases here <https://www.youtube.com/playlist?list=PLhV3K_DS5YfLQLgwU3oDFiGaU3K7pUVoW>. Hope to see you there on June 18! Dario

5 years, 12 months

EventStreams offset reset - June 5 2018

by Andrew Otto

Hi all! *If you are not an active user of the EventStreams service, you can ignore this email.* We’re in the process of upgrading <https://phabricator.wikimedia.org/T152015> the backend infrastructure that powers the EventStreams service. When we switch EventStreams to the new infrastructure <https://phabricator.wikimedia.org/T185225>, the ‘offsets’ AKA Last-Event-IDs will change. Connected EventStreams SSE clients will reconnect and not be able to automatically consume from the exact position in the stream where they left off. Instead, reconnecting clients will begin consuming from the latest messages in the stream. This means that connected clients will likely miss any messages that occurred during the reconnect period. Hopefully this will be a very small number of messages, as your SSE client should reconnect quickly. This switch is scheduled to happen on June 5 2018, at around 17:30 UTC. Let us know if you have any questions. Thanks! - Andrew Otto Senior Systems Engineer, WMF

6 years

Statistical point of view to the visitors and promotional user names

by Ladislav Nešněra

Dear analytic team ;), I'd like to know if the policy about promotional user names <https://en.wikipedia.org/wiki/Wikipedia:Username_policy#Promotional_names> solve real problem or if this problem was only anticipated in 2007 <https://en.wikipedia.org/wiki/Wikipedia_talk:User_account_policy/Archive_4#…>. Is it possible to get statistic which distinguish between article visitors and people which can see potential promotion names i.e. discussion + history of the article + history of discussion? It'd by ideal exclude editors (it's hard to influence a persons familiar with the subject contrariwise they've be annoyed by promoting) but I don't believe it would be a significant difference. Can you help me or can you direct me into the right way, please? Thank you in advance for your time Ladislav Nešněra <https://cs.wikipedia.org/wiki/Wikipedista:Nesnera> ;? +420 721 658 256

6 years

EventLogging MariaDB indexes

by Gilles Dubuc

Hi, I see that some EventLogging tables have custom indexes. What's the process to get indexes added to a couple of schemas I need extra DB indexes for? The "research" user on the analytics slave doesn't have ALTER rights and I couldn't find any documentation about that topic.

6 years

Re: [Analytics] Jeff Levesque: List of Articles By Categories (College Project)

by Leila Zia

+ Analytics, our public analytics related mailing list [1] Hi Jeff, Let me give it a try: * Re pageviews: a lot has changed since the Kaggle contest days you refer to. :) I highly recommend you check out https://dumps.wikimedia.org/other/pagecounts-ez/ where our hourly pageviews per article live. In case you need it, abbreviations used in the file names are documented. [2] * Can you expand more what you are trying to do? The short answer for your category related question is that you have to parse XML dumps, but we may have some good pointers for you to save you from that. If you tell us more, we're more likely to be able to help. * And, if you decide to continue research on Wiki(m|p)edia data (which I hope you do:), consider signing up in our public research list at https://lists.wikimedia.org/mailman/listinfo/wiki-research-l Best, Leila [1] https://lists.wikimedia.org/mailman/listinfo/analytics [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews -- Leila Zia Senior Research Scientist, Lead Wikimedia Foundation On Wed, May 23, 2018 at 3:22 PM, Wikimedia Answers <answers(a)wikimedia.org> wrote: > Forwarding for your evaluation :) Feel free to include the wider Research > team. > > best, > Joe > > ---------- Forwarded message ---------- > From: Jeffrey Levesque <jlevesqu(a)syr.edu> > Date: Tue, May 22, 2018 at 7:48 AM > Subject: Re: Jeff Levesque: List of Articles By Categories (College Project) > To: "info-en(a)wikimedia.org" <info-en(a)wikimedia.org> > Cc: "answers(a)wikimedia.org" <answers(a)wikimedia.org> > > > Hi, > Is there a known API, where I can supply the article name, and attain the > corresponding "category" the article belongs to? I'm thinking I could write > a python script and iterate the kaggle dataset, then send some POST request > to hopefully some existing API, to determine the articles "category". > > Thank you, > > Jeff Levesque > https://github.com/jeff1evesque > > On May 22, 2018, at 10:37 AM, Jeffrey Levesque <jlevesqu(a)syr.edu> wrote: > > Hi, > Do you guys have a more recent time series of Wikipedia article traffic. I'm > noticing that the kaggle dataset does not have a lot of articles that are on > Wikipedia. Do you guys have a good idea of how I can categorize the dataset > I have? > > Thank you, > > Jeff Levesque > https://github.com/jeff1evesque > > On May 22, 2018, at 8:40 AM, Jeffrey Levesque <jlevesqu(a)syr.edu> wrote: > > Hi, > > I am masters student at Syracuse University. For my data science class, I am > doing a project trying to analyze traffic patterns for Wikipedia. I’ve > attained the Kaggle dataset for 2015-2016 data: > > > > https://www.kaggle.com/headsortails/wiki-traffic-forecast-exploration-wtf-e… > > > > However, the dataset only provides the frequency of visits to particular > pages on a given day. Could I request to attain a list of articles grouped > by “Categories”? I’ve tried to use the API (i.e. > https://en.wikipedia.org/wiki/Special:Export). But, that doesn’t seem to > generate a full output. Additionally, in the list it supplies subcategories. > So, I tried using the URL API (i.e. > https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitl…). > But, that also seems to return an even shorter result set: > > > > {"batchcomplete":"","continue":{"cmcontinue":"page|2d2941313f2b292d3d0447454f31434f39293f011701dc16|55503653","continue":"-||"},"query":{"categorymembers":[{"pageid":22939,"ns":0,"title":"Physics"},{"pageid":24489,"ns":0,"title":"Outline > of physics"},{"pageid":3445246,"ns":0,"title":"Glossary of classical > physics"},{"pageid":1653925,"ns":100,"title":"Portal:Physics"},{"pageid":50926902,"ns":0,"title":"Action > angle > coordinates"},{"pageid":9079863,"ns":0,"title":"Aerometer"},{"pageid":52657328,"ns":0,"title":"Bayesian > model of computational anatomy"},{"pageid":49342572,"ns":0,"title":"Group > actions in computational > anatomy"},{"pageid":50724262,"ns":0,"title":"Blasius\u2013Chaplygin > formula"},{"pageid":33327002,"ns":0,"title":"Cabbeling"}]}} > > > > > > Thank you, > > Jeff Levesque > > (603) 969-5363 > >

6 years

Pivot is now Turnilo!

by Andrew Otto

Hi all! Your beloved Pivot may not be dying after all… :) It has been forked (and forked again) and been resurrected as an open source project, now named Turnilo. The fork seems to be backwards compatible, and much faster. It is available now at turnilo.wikimedia.org. We will soon be configuring a redirect from pivot.wikimedia.org to turnilo.wikimedia.org. Any bookmarked links you have should transparently redirect and work i Turnilo. If not, let us know! We will be configuring the redirect this week on Wednesday May 23. - Andrew Otto

6 years

Position for Data Engineer at Wikimedia Foundation

by Adam Baso

Hello! We're hiring for a data engineer at the Wikimedia Foundation. This is a cool new position in the infrastructure team within Audiences (product). https://boards.greenhouse.io/wikimedia/jobs/1158882?gh_src=9dd669a31 We welcome qualified applicants. Thanks! -Adam

6 years

Reinstall of stat1004 scheduled for Tuesday May 22

by Andrew Otto

Hi all, We are slowly reinstalling the operating systems on analytics servers to upgrade to Debian Stretch. We’d like to do stat1004 tomorrow, Tuesday May 22. We will preserve all home directories there. Downtime should only be an hour or two. Please plan to use stat1005 instead of stat1004 on May 22. Thanks! - Andrew Otto https://phabricator.wikimedia.org/T192640

6 years

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics May 2018