Analytics July 2018

analytics@lists.wikimedia.org

16 participants
12 discussions

Beeline as Hive client
by Madhumitha Viswanathan 03 Oct '18

03 Oct '18

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

3 3

Re: [Analytics] Monitor the number of Wikipedia sites and the number of articles in each site
by Dan Andreescu 01 Aug '18

01 Aug '18

Forwarding this question to the public Analytics list, where it's good to have these kinds of discussions. If you're interested in this data and how it changes over time, do subscribe and watch for updates, notices of outages, etc. Ok, so on to your question. You'd like the *total # of articles for each wiki*. I think the simplest way right now is to query the AQS (Analytics Query Service) API, documented here: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2 To get the # of articles for a wiki, let's say en.wikipedia.org, you can get the timeseries of new articles per month since the beginning of time: *https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900 <https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org…>* And to get a list of all wikis, to plug into that URL instead of " en.wikipedia.org", the most up-to-date information is here: https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or via the mediawiki API: https://meta.wikimedia.org/w/api.php?action=sitematrix&formatversion=2&form…. Sometimes new sites won't have data in the AQS API for a month or two until we add them and start crunching their stats. The way I figured this out is to look at how our UI uses the API: https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/new-pages. So if you were interested in something else, you can browse around there and take a look at the XHR requests in the browser console. Have fun! On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) <zzn(a)google.com > wrote: > Hi Dan, > > How are you! This is Victor, It's been a while since we meet at the 2018 > Wikimedia Dev Summit. I hope you are doing great. > > As I mentioned to you, my team works on extracting the knowledge from > Wikipedia. Currently it's undergoing a project that expands language > coverage. My teammate Yuan Gao(cc'ed here) is tech leader of this > project.She plans to *monitor the list of all the current available > wikipedia's sites and the number of articles for each language*, so that > we can compare with our extraction system's output to sanity-check if there > is a massive breakage of the extraction logic, or if we need to add/remove > languages in the event that a new wikipedia site is introduced to/remove > from the wikipedia family. > > I think your team at Analytics at Wikimedia probably knows the best where > we can find this data. Here are 4 places we already know, but doesn't seem > to have the data. > > > - https://en.wikipedia.org/wiki/List_of_Wikipedias. has the > information we need, but the list is manually edited, not automatic > - https://stats.wikimedia.org/EN/Sitemap.htm, has the full list, but > the information seems pretty out of date(last updated almost a month ago) > - StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I can't > find the full list nor the number of articles > - API https://wikimedia.org/api/rest_v1/ suggested by elukey on > #wikimedia-analytics channel, it doesn't seem to have # of article > information > > Do you know what is a good place to find this information? Thank you! > > Victor > > > > * • **Zainan Zhou(**周载南**) a.k.a. "Victor" * <http://who/zzn> > * • *Software Engineer, Data Engine > * •* Google Inc. > * • *zzn(a)google.com <ecarmeli(a)google.com> - 650.336.5691 > * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043 > > ---------- Forwarded message ---------- > From: Yuan Gao <gaoyuan(a)google.com> > Date: Wed, Mar 28, 2018 at 4:15 PM > Subject: Monitor the number of Wikipedia sites and the number of articles > in each site > To: Zainan Victor Zhou <zzn(a)google.com> > Cc: Wenjie Song <wenjies(a)google.com>, WikiData <wikidata(a)google.com> > > > Hi Victor, > as we discussed in the meeting, I'd like to monitor: > 1) the number of Wikipedia sites > 2) the number of articles in each site > > Can you help us to contact with WMF to get a realtime or at least daily > update of these numbers? What we can find now is > https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of > Wikipedia sites is manually updated, and possibly out-of-date. > > > The monitor can help us catch such bugs. > > -- > Yuan Gao > >

8 13

Eventlogging parsing problem, down during the weekend
by Luca Toscano 30 Jul '18

30 Jul '18

Hi everybody, there is an ongoing task for Eventlogging ( https://phabricator.wikimedia.org/T200630) related to the parsing of weird user agents. The issue caused Eventlogging to be down during the weekend, but we hope to make it work again today. This implies that recent data is delayed, apologies for the inconvenience. For any question feel free to reach out to the Analytics team via the usual channels (IRC #wikimedia-analytics, emails, etc..). Thanks! Luca

1 0

New files for geo coded Wikimedia stats
by Erik Zachte 27 Jul '18

27 Jul '18

Today I released two new json files [2][4]. Both complement visualization 'Wikipedia Views Visualized' [1] (aka WiViVi), but both can be useful in other contexts as well. 1) File 'demographics_from_world_bank_for_wikimedia.json' [2] resulted from harvesting World Bank API files. It contains yearly figures for four metrics: (more could be added rather easily): - population counts, - percentage internet users, - percentage mobile subscriptions, - GDP per capita. The following static demographics charts on meta are also based on these metrics: [3] 2) File 'datamaps-data.json' [4] contains the equivalent of 3 rather complex (*) csv files which feed WiViVi. This brings together demographics data and pageviews (by country, by region, and by language), and also adds additional meta info. This json file is meant for external use, as it's much easier to parse than the 3 csv files WiViVi uses itself [5]. (*) complex , as the csv files use a hierarchy based on nested delimiters -- Details: World Bank files have different formats (some csv, some json) and use a variety of indexes (some use ISO 3166-1 alpha-2 codes, others ..-alpha-3). Script 1) first does normalization, then data are aggregated, filtered, indexed. Json file 1) replaces two csv files which up to now were filled from Wikipedia pages [6][7]. Also, although Wikipedia lists nowadays also use World Bank data, this is not consistently done, see [8][9]. [1] Viz: https://stats.wikimedia.org/wikimedia/animations/wivivi/wivivi.html [2] Json: https://stats.wikimedia.org/wikimedia/animations/wivivi/world-bank-demograp… Script: https://github.com/wikimedia/analytics-wikistats/tree/master/worldbank [3] Charts: https://meta.wikimedia.org/wiki/World_Bank_demographics [4] Json: https://stats.wikimedia.org/wikimedia/animations/wivivi/datamaps-data.json Script: https://github.com/wikimedia/analytics-wikistats/tree/master/traffic [5] Syntax: https://stats.wikimedia.org/wikimedia/animations/wivivi/data.html [6] Article: https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_populat… [7] Article: https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users [8] Talk page: https://bit.ly/2L5Z2P4 section 'Wikipedia vs Worldbank population counts' [9] Talk page: https://bit.ly/2NJUoIu section 'Wikipedia vs Worldbank internet percentages'

2 1

Fwd: [Wikitech-l] hewiki dump to be added to 'big wikis' and run with multiple processes
by Pine W 23 Jul '18

23 Jul '18

Forwarding in case this is of interest to anyone on the Analytics or Research lists who doesn't subscribe to Wikitech-l or Xmldatadumps-l. Pine ( https://meta.wikimedia.org/wiki/User:Pine ) ---------- Forwarded message ---------- From: Ariel Glenn WMF <ariel(a)wikimedia.org> Date: Fri, Jul 20, 2018 at 5:53 AM Subject: [Wikitech-l] hewiki dump to be added to 'big wikis' and run with multiple processes To: Wikipedia Xmldatadumps-l <Xmldatadumps-l(a)lists.wikimedia.org>, Wikimedia developers <wikitech-l(a)lists.wikimedia.org> Good morning! The pages-meta-history dumps for hewiki take 70 hours these days, the longest of any wiki not already running with parallel jobs. I plan to add it to the list of 'big wikis' starting August 1st, meaning that 6 jobs will run in parallel producing the usual numbered file output; look at e.g. frwiki dumps for an example. Please adjust any download/processing scripts accordingly. Thanks! Ariel _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2 1

Re: [Analytics] Kafka Main Eqiad outage and failover of Eventbus/Eventstreams to codfw
by Luca Toscano 12 Jul '18

12 Jul '18

[Adding some other mailing lists in Cc] Hi everybody, as a lot of you have probably already noticed yesterday reading the operations@ mailing list, we had an outage of the Kafka Main eqiad cluster that forced us to switch the Eventbus and Eventstreams services to codfw. All the precise timings will be listed in https://wikitech.wikimedia.org/wiki/Incident_documentation/20180711-kafka-e…, but for a quick glimpse: 2018-07-11 17:00 UTC - Eventbus service switched to codfw 2018-07-11 18:44 UTC - Eventstreams service switched to codfw We are going to switch back those services to eqiad during the next couple of hours. The consumers of the Eventstreams service may get some failures or data drops, apologies in advance for the trouble. Cheers, Luca Il giorno gio 12 lug 2018 alle ore 00:00 Luca Toscano < ltoscano(a)wikimedia.org> ha scritto: > Hi everybody, > > as you might have seen from the operations' channel on IRC the Kafka Main > Eqiad cluster (kafka100[1-3].eqiad.wmnet) suffered a long outage due to new > topics pushed out with too long names (causing fs operation issues, etc..). > I'll update this email thread tomorrow EU time with more details, tasks, > precise root cause, etc.., but the important bit to know is that Eventbus > and Eventstreams have been failed over to the Kafka Main Codfw cluster. > This should be transparent to everybody but please let us know otherwise. > > Thanks for the patience! > > (a very sleepy :) Luca > >

1 0

Wikistats2 Better maps and new metric: Legacy Pageviews (a.k.a Pagecounts)
by Nuria Ruiz 11 Jul '18

11 Jul '18

Hello! Just a brief note to announce that we have two new things in Wikistats2 this quarter. We have reviewed maps and we now report more precise pageviews per country. Check, for example, pageviews for Portuguese Wikipedia on the world for last month: https://stats.wikimedia.org/v2/#/pt.wikipedia.org/reading/page-views-by-cou… Also, we have included legacy pageviews in the UI, we used to call these pagecounts and prior to June 2015 this is the metric that we reported as pageviews for all wikimedia sites. See, for example, pagecounts for portuguese wikipedia from 2008 to 2016: https://stats.wikimedia.org/v2/#/pt.wikipedia.org/reading/legacy-page-views… Info about metric: https://wikitech.wikimedia.org/wiki/Analytics/Archive/Data/Pagecounts-raw Also, all urls are now bookmarkable. As always suggestions welcome, please file bug reports on phabricator. Thanks, Nuria

2 1

Wikimedia Research Showcase July 11, 2018 (11:30 AM PDT| 18:30 UTC)
by Sarah R 11 Jul '18

11 Jul '18

Hi Everyone, The next Wikimedia Research Showcase will be live-streamed Wednesday, July 11, 2018 at 11:30 AM (PDT) 18:30 UTC. YouTube stream: https://www.youtube.com/watch?v=uK7AvNKq0sg As usual, you can join the conversation on IRC at #wikimedia-research. And, you can watch our past research showcases here. <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#Upcoming_Showcase> Hope to see you there! This month's presentations: Mind the (Language) Gap: Neural Generation of Multilingual Wikipedia Summaries from Wikidata for ArticlePlaceholdersBy *Lucie-Aimée Kaffee*While Wikipedia exists in 287 languages, its content is unevenly distributed among them. It is therefore of the utmost social and cultural interests to address languages for which native speakers have only access to an impoverished Wikipedia. In this work, we investigate the generation of summaries for Wikipedia articles in underserved languages, given structured data as an input. In order to address the information bias towards widely spoken languages, we focus on an important support for such summaries: ArticlePlaceholders, which are dynamically generated content pages in underserved Wikipedia versions. They enable native speakers to access existing information in Wikidata, a structured Knowledge Base (KB). Our system provides a generative neural network architecture, which processes the triples of the KB as they are dynamically provided by the ArticlePlaceholder, and generate a comprehensible textual summary. This data-driven approach is tested with the goal of understanding how well it matches the communities' needs on two underserved languages on the Web: Arabic, a language with a big community with disproportionate access to knowledge online, and Esperanto. With the help of the Arabic and Esperanto Wikipedians, we conduct an extended evaluation which exhibits not only the quality of the generated text but also the applicability of our end-system to any underserved Wikipedia version. Token-level change tracking: data, tools and insightsBy *Fabian Flöck*This talk first gives an overview of the WikiWho infrastructure, which provides tracking of changes to single tokens (~words) in articles of different Wikipedia language versions. It exposes APIs for accessing this data in near-real time, and is complemented by a published static dataset. Several insights are presented regarding provenance, partial reverts, token-level conflict and other metrics that only become available with such data. Lastly, the talk will cover several tools and scripts that are already using the API and will discuss their application scenarios, such as investigation of authorship, conflicted content and editor productivity.

1 1

Wikimedia Video tracking tool
by Agnes Bruszik 11 Jul '18

11 Jul '18

Dear All, We are experiencing problems with this tool: https://tools.wmflabs.org/commons-video-clicks/help.html#, the tool for measuring the number of plays is not functioning at present. Can you please suggest alternative tools to see the number of plays for these 14 x OER MOOC video snippets attached we have uploaded to Wikimedia Commons and inserted into relevant Wikipedia pages? Thank you very much! Agnes WMUK -- Best, *Agnes Bruszik - *Programme Evaluation Assistant Wikimedia UK +44 2033720769 *Wikimedia UK* is the national chapter for the global Wikimedia open knowledge movement. We rely on donations from individuals to support our work to make knowledge open for all. Have you considered supporting Wikimedia? https://donate.wikimedia.org.uk Wikimedia UK is a Company Limited by Guarantee registered in England and Wales, Registered No. 6741827. Registered Charity No.1144513. Registered Office 4th Floor, Development House, 56-64 Leonard Street, London EC2A 4LT. United Kingdom. Wikimedia UK is the UK chapter of a global Wikimedia movement. The Wikimedia projects are run by the Wikimedia Foundation (who operate Wikipedia, amongst other projects). *Wikimedia UK is an independent non-profit charity with no legal control over Wikipedia nor responsibility for its contents.*

4 5

Inspire Campaign on Measuring Community Health starts today!
by Sydney Poore 09 Jul '18

09 Jul '18

Hello Wikimedians, I'm happy to announce the launch of the Inspire Campaign on Measuring Community Health.[1] The goal of this campaign is to gather your ideas on approaches to measure or evaluate the experience and quality of participating and interacting with others in Wikimedia projects. So what is community health? Healthy projects promote high quality content creation, respectful collaboration, efficient workflows, and effective conflict resolution. Tasks and experiences that result in patterns of editor frustration, poor editor retention, harassment, broken workflows, and unresolved conflicts are unhealthy for a project. As a movement, Wikimedians have always measured aspects of their communities. Data points, such as editor activity levels, are regularly collected. While these metrics provide some useful indications about the health of a project, they do not give major insights into challenges and specific areas needing improvement or what areas have been successful. We want to hear from you what specific areas on your Wikimedia project should be evaluated or measured, and how it should be done. Share your ideas, contribute to other people’s submissions, and get involved in the new Inspire Campaign. After the campaign, grants and other paths are available to support the formal development of these measures and evaluation techniques.[2] Warm regards, Sydney [1] https://meta.wikimedia.org/wiki/Special:MyLanguage/Grants:IdeaLab/Inspire [2] https://meta.wikimedia.org/wiki/Grants:IdeaLab/Develop-- Sydney Poore Trust and Safety Specialist Wikimedia Foundation Trust and Safety team; Anti-harassment tools team

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics July 2018