Analytics March 2017

analytics@lists.wikimedia.org

27 participants
25 discussions

Beeline as Hive client
by Madhumitha Viswanathan 02 Oct '18

02 Oct '18

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

3 3

Wikipedia aggregate clickstream data released
by Dario Taraborelli 16 Jan '18

16 Jan '18

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

4 3

EventStreams launch and RCStream deprecation
by Andrew Otto 10 Jul '17

10 Jul '17

Hi everyone! Wikimedia is releasing a new service today: EventStreams <https://wikitech.wikimedia.org/wiki/EventStreams>. This service allows us to publish arbitrary streams of JSON event data to the public. Initially, the only stream available will be good ol’ RecentChanges <https://www.mediawiki.org/wiki/Manual:RCFeed>. This event stream overlaps functionality already provided by irc.wikimedia.org and RCStream <https://wikitech.wikimedia.org/wiki/RCStream>. However, this new service has advantages over these (now deprecated) services. 1. We can expose more than just RecentChanges. 2. Events are delivered over streaming HTTP (chunked transfer) instead of IRC or socket.io. This requires less client side code and fewer special routing cases on the server side. 3. Streams can be resumed from the past. By using EventSource, a disconnected client will automatically resume the stream from where it left off, as long as it resumes within one week. In the future, we would like to allow users to specify historical timestamps from which they would like to begin consuming, if this proves safe and tractable. I did say deprecated! Okay okay, we may never be able to fully deprecate irc.wikimedia.org. It’s used by too many (probably sentient by now) bots out there. We do plan to obsolete RCStream, and to turn it off in a reasonable amount of time. The deadline iiiiiis July 7th, 2017. All services that rely on RCStream should migrate to the HTTP based EventStreams service by this date. We are committed to assisting you in this transition, so let us know how we can help. Unfortunately, unlike RCStream, EventStreams does not have server side event filtering (e.g. by wiki) quite yet. How and if this should be done is still under discussion <https://phabricator.wikimedia.org/T152731>. The RecentChanges data you are used to remains the same, and is available at https://stream.wikimedia.org/v2/stream/recentchange. However, we may have something different for you, if you find it useful. We have been internally producing new Mediawiki specific events <https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema…> for a while now, and could expose these via EventStreams as well. Take a look at these events, and tell us what you think. Would you find them useful? How would you like to subscribe to them? Individually as separate streams, or would you like to be able to compose multiple event types into a single stream via an API? These things are all possible. I asked for a lot of feedback in the above paragraphs. Let’s try and centralize this discussion over on the mediawiki.org EventStreams talk page <https://www.mediawiki.org/wiki/Talk:EventStreams>. In summary, the questions are: - What RCStream clients do you maintain, and how can we help you migrate to EventStreams? <https://www.mediawiki.org/wiki/Topic:Tkjkee2j684hkwc9> - Is server side filtering, by wiki or arbitrary event field, useful to you? <https://www.mediawiki.org/wiki/Topic:Tkjkabtyakpm967t> - Would you like to consume streams other than RecentChanges? <https://www.mediawiki.org/wiki/Topic:Tkjk4ezxb4u01a61> (Currently available events are described here <https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema…> .) Thanks! - Andrew Otto

5 6

Wikipedia Detox: Scaling up our understanding of harassment on Wikipedia
by Ellery Wulczyn 22 Jun '17

22 Jun '17

Today we are announcing <https://blog.wikimedia.org/2017/02/07/scaling-understanding-of-harassment/> the first results of the collaboration between Wikimedia Research and Jigsaw on modeling personal attacks and other forms of harassment on English Wikipedia. We have released <https://figshare.com/projects/Wikipedia_Talk/16731> a corpus of 95M user and article talk page comments as well as over 1M human labels produced by 4000 crowd-workers for a set of 100k comments. Documentation on our methodology and future work can be found in our paper Ex Machina: Personal Attacks Seen at Scale <https://arxiv.org/abs/1610.08914> (to appear at WWW2017) and on our project page on meta <https://meta.wikimedia.org/wiki/Research:Detox>. If you are interested in contributing to the project, please get in touch via the project talk page <https://meta.wikimedia.org/wiki/Research_talk:Detox>. Another great way to get involved is to label a set of comment in the Wikilabels discussion quality campaign <http://labels.wmflabs.org/ui/enwiki/>.

6 6

Top editors in a certain namespace across sites?
by Andre Klapper 02 Jun '17

02 Jun '17

Hi, Does anyone know of a way to look up the top editors for a certain namespace (like "Module") across all Wikimedia sites? I'm asking as I'm wondering how to get more aware of developer activity outside of Wikimedia Git/Gerrit. Thanks for any ideas (or pointing out a better place to ask)! Cheers, andre -- Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/

7 10

Fwd: Fwd: follow-up on editors
by Aaron Halfaker 25 Apr '17

25 Apr '17

Here's a graph of the retention rates of new editors in English Wikipedia.

5 11

Can't package/contribute to latest analytics/refinery/source
by Mikhail Popov 31 Mar '17

31 Mar '17

Hello! Has anyone experienced issues `mvn package`-ing analytics/refinery/source on a local machine? Wikimedia Analytics Refinery Jobs fails for me as of "Add mediawiki history spark jobs to refinery-job" (https://gerrit.wikimedia.org/r/#/c/325312/) Here's my `mvn --version`: Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T08:41:47-08:00) Maven home: /usr/local/Cellar/maven/3.3.9/libexec Java version: 1.8.0_121, vendor: Oracle Corporation Java home: /Library/Java/JavaVirtualMachines/jdk1.8.0_121.jdk/Contents/Home/jre Default locale: en_US, platform encoding: UTF-8 OS name: "mac os x", version: "10.12.4", arch: "x86_64", family: "mac" When I set HEAD to the commit prior to that everything succeeds. Any commit after that one makes Jobs tests fail with warnings and errors like: [INFO] Checking for multiple versions of scala [WARNING] Expected all dependencies to require Scala version: 2.10.4 [WARNING] com.twitter:chill_2.10:0.5.0 requires scala version: 2.10.4 [WARNING] org.spark-project.akka:akka-actor_2.10:2.2.3-shaded-protobuf requires scala version: 2.10.4 [WARNING] org.spark-project.akka:akka-remote_2.10:2.2.3-shaded-protobuf requires scala version: 2.10.4 [WARNING] org.spark-project.akka:akka-slf4j_2.10:2.2.3-shaded-protobuf requires scala version: 2.10.4 [WARNING] org.apache.spark:spark-core_2.10:1.6.0-cdh5.10.0 requires scala version: 2.10.4 [WARNING] org.json4s:json4s-jackson_2.10:3.2.10 requires scala version: 2.10.4 [WARNING] org.json4s:json4s-core_2.10:3.2.10 requires scala version: 2.10.4 [WARNING] org.json4s:json4s-ast_2.10:3.2.10 requires scala version: 2.10.4 [WARNING] org.json4s:json4s-core_2.10:3.2.10 requires scala version: 2.10.0 [WARNING] Multiple versions of scala libraries detected! TestDenormalizedRevisionsBuilder: 17/03/31 13:38:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable populateDeleteTime java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:317) at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:219) at org.xerial.snappy.Snappy.<clinit>(Snappy.java:44) at org.apache.spark.io.SnappyCompressionCodec$.liftedTree1$1(CompressionCodec.scala:169) at org.apache.spark.io.SnappyCompressionCodec$.org$apache$spark$io$SnappyCompressionCodec$$version$lzycompute(CompressionCodec.scala:168) at org.apache.spark.io.SnappyCompressionCodec$.org$apache$spark$io$SnappyCompressionCodec$$version(CompressionCodec.scala:168) at org.apache.spark.io.SnappyCompressionCodec.<init>(CompressionCodec.scala:152) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:72) at org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:65) at org.apache.spark.broadcast.TorrentBroadcast.org $apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73) at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:80) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1334) at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1006) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:924) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:923) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:923) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:861) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1611) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1603) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1592) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867) at java.lang.Runtime.loadLibrary0(Runtime.java:870) at java.lang.System.loadLibrary(System.java:1122) at org.xerial.snappy.SnappyNativeLoader.loadLibrary(SnappyNativeLoader.java:52) ... 33 more - should put max rev ts when no page state match *** FAILED *** org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.reflect.InvocationTargetException

2 2

Maintenance on Eventlogging Database - Renaming of tables
by Nuria Ruiz 29 Mar '17

29 Mar '17

Team: We need to do some maintenance work to the evenlogging database that requires renaming and archiving of all tables that are currently receiving events. Tables will be renamed from, for example: WikipediaZeroUsage_14574251 to WikipediaZeroUsage_14574251_15423246; 15423246 is the capsule schema version. As new events are come for 'WikipediaZeroUsage' schema with schema version 14574251 the table WikipediaZeroUsage_14574251 will get recreated again, the only difference is that this new table will have different column length for varchar fields. Details of this change are here: https://phabricator.wikimedia.org/T160454 We will start our maintenance event on Thursday morning around 11 am PST time and it would last couple hours, no data loss on eventlogging will happen as part of this event. Reportupdater queries that use eventlogging would need to be updated to select from the newly renamed tables (and we will take care of doing that) but other than those we do not think there are any other automated scripts that would need updating. Please let us know otherwise. Also, please be so kind to let us know if this renaming is disruptive in any way. We can also do this changes at a later time next week. Thanks, Nuria

3 9

Re: [Analytics] Fwd: follow-up on editors
by Aaron Halfaker 29 Mar '17

29 Mar '17

Hi Shannon, I'm not sure about the Geographical Distribution, but I agree that the information at the top of the page seems to imply that the data is from 2013. Regretfully, my retention query was killed because it was running too long. https://quarry.wmflabs.org/query/17500 I'll start a new one on our analytics servers. I'm flying to the same conference you are so are today so hopefully, I'll be able to give you some good news when I land (tomorrow afternoonish). -Aaron On Tue, Mar 28, 2017 at 6:25 PM, Shannon Keith <shannon(a)williamsworks.com> wrote: > Thank you all, this has been very helpful. > > > > A couple follow-up questions: > > · The geographical distribution > <https://web.archive.org/web/20161024063241/https:/stats.wikimedia.org/wikim…> > is from 2013 correct? > > · Would someone be able to create an updated graph of editor > retention on English Wikipedia up to 2016? It doesn’t have to be just > ‘good-faith’ editors, I know that would be much more work. This is the one > we currently have, but it would be great to have it show recent years: > https://meta.wikimedia.org/wiki/Research:The_Rise_and_ > Decline#/media/File:Desirable_newcomer_survival_over_time.png > > This will be part of a presentation from the strategy team, so if it’s > possible to do by this Thursday, I would be so grateful. > > > > *From:* Aaron Halfaker [mailto:ahalfaker@wikimedia.org] > *Sent:* Thursday, March 23, 2017 7:59 AM > *To:* A mailing list for the Analytics Team at WMF and everybody who has > an interest in Wikipedia and analytics. <analytics(a)lists.wikimedia.org> > *Cc:* Shannon Keith <shannon(a)williamsworks.com> > *Subject:* Re: [Analytics] Fwd: follow-up on editors > > > > Note that "editing session" is not at all similar to an "edit session" > from this: > > > > Geiger, R. S., & Halfaker, A. (2013, February). Using edit sessions to > measure participation in Wikipedia. In *Proceedings of the 2013 > conference on Computer supported cooperative work* (pp. 861-870). ACM. > http://www-users.cs.umn.edu/~halfak/publications/Using_Edit_Sessions_to_ > Measure_Participation_in_Wikipedia/geiger13using-preprint.pdf > > > > On Thu, Mar 23, 2017 at 9:55 AM, Nuria Ruiz <nuria(a)wikimedia.org> wrote: > > >· Average hours of spent by editors by segment (5+ edits and > 100+ edits)? > > We do not keep track of session length per editor thus this data is not > available. > > The most similar thing I can think of is session length of editing session > which is quite different. That is a heavily sampled metric that is reported > via eventlogging, some related data is reported here: > https://edit-analysis.wmflabs.org/compare/ > > > > On Wed, Mar 22, 2017 at 5:03 PM, Federico Leva (Nemo) <nemowiki(a)gmail.com> > wrote: > > Aaron Halfaker, 22/03/2017 22:43: > > __· __Number of editors who contribute 1 edit per month? __ > > > First column of https://stats.wikimedia.org/EN/TablesWikimediaAllProjects. > htm . > > __· Is it possible/feasible to run editor retention metrics > globally (versus just based on a single project? > > > This depends if one just wants to (deduplicate and) sum different > projects, or also consider interwiki events (such as a person stopping > activity on a wiki but resuming activity on another wiki). I remember > something was done a few years ago to see if Wikidata removed active > editors from other projects and a few "migration" paths were identified in > all directions. I can't find the chart/table now though. > > __· __Total number of editors on all projects over the past 16 > years (not just ENWP)?____ > > > For a quick estimate I usually make a proportion > https://stats.wikimedia.org/EN/TablesWikipediaEN.htm# > editor_activity_levels : https://stats.wikimedia.org/ > EN/TablesWikimediaAllProjects.htm ~ https://stats.wikimedia.org/ > EN/TablesWikipediaEN.htm#editdistribution : x . If the active editors in > most classes are about 1 : 2, then probably the total number of editors in > all Wikipedias + all other project is more than twice the English > Wikipedia's total e.g. over 10 millions (or over 2 millions if you consider > the usual 10 edits threshold). > > > __· __Global distribution of editors by region (or country), > 2016 > > > https://web.archive.org/web/20161024063241/https://stats. > wikimedia.org/wikimedia/squids/SquidReportPageEditsPerCountryOverview.htm > <https://web.archive.org/web/20161024063241/https:/stats.wikimedia.org/wikim…> > + https://web.archive.org/web/20161002042707/https://stats. > wikimedia.org/wikimedia/squids/SquidReportPageEditsPerCountryBreakdown.htm > <https://web.archive.org/web/20161002042707/https:/stats.wikimedia.org/wikim…> > ? > > Nemo > > > > > > _______________________________________________ > Analytics mailing list > Analytics(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > _______________________________________________ > Analytics mailing list > Analytics(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > > >

1 0

Fwd: [Wiki-research-l] Research Scientist position at WMF
by Pine W 28 Mar '17

28 Mar '17

Forwarding. Pine ---------- Forwarded message ---------- From: Leila Zia <leila(a)wikimedia.org> Date: Tue, Mar 28, 2017 at 10:36 AM Subject: [Wiki-research-l] Research Scientist position at WMF To: Research into Wikimedia content and communities < wiki-research-l(a)lists.wikimedia.org> Hi all, The Research team at the Wikimedia Foundation has just opened a full-time research scientist position <https://boards.greenhouse.io/wikimedia/jobs/640434?gh_src= 7t836o1#.WNqPliHyvCI>. In the past years, the team has worked on a variety of projects, including: building ML-based scoring systems for Wikipedia and Wikidata <https://blog.wikimedia.org/2015/11/30/artificial-intelligence-x-ray-specs/> , recommendations systems for article creation <https://blog.wikimedia.org/2016/04/27/article-recommendation-system/>, models to detect harassment and personal attacks <https://blog.wikimedia.org/2017/02/07/scaling-understanding-of-harassment/ >, and more. we are looking to add one more full-time role to our team to expand our research capacity and strengthen our collaborations with academia and industry. If this is the kind of job you're interested in, please consider applying. If you know people in your network who may be a good fit, please encourage them to apply. Best, Leila -- Leila Zia Senior Research Scientist Wikimedia Foundation _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics March 2017