Analytics March 2016

analytics@lists.wikimedia.org

33 participants
23 discussions

Research showcase: Evolution of privacy loss in Wikipedia
by Dario Taraborelli 17 Mar '16

17 Mar '16

This month, our research showcase <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#March_2016> hosts Andrei Rizoiu (Australian National University) to talk about his work <http://cm.cecs.anu.edu.au/post/wikiprivacy/> on *how private traits of Wikipedia editors can be exposed from public data* (such as edit histories) using off-the-shelf machine learning techniques. (abstract below) If you're interested in learning what the combination of machine learning and public data mean for privacy and surveillance, come and join us this *Wednesday March 16* at *1pm Pacific Time*. The event will be recorded and publicly streamed <https://www.youtube.com/watch?v=Xle0oOFCNnk>. As usual, we will be hosting the conversation with the speaker and Q&A on the #wikimedia-research channel on IRC. Looking forward to seeing you there, Dario Evolution of Privacy Loss in WikipediaThe cumulative effect of collective online participation has an important and adverse impact on individual privacy. As an online system evolves over time, new digital traces of individual behavior may uncover previously hidden statistical links between an individual’s past actions and her private traits. To quantify this effect, we analyze the evolution of individual privacy loss by studying the edit history of Wikipedia over 13 years, including more than 117,523 different users performing 188,805,088 edits. We trace each Wikipedia’s contributor using apparently harmless features, such as the number of edits performed on predefined broad categories in a given time period (e.g. Mathematics, Culture or Nature). We show that even at this unspecific level of behavior description, it is possible to use off-the-shelf machine learning algorithms to uncover usually undisclosed personal traits, such as gender, religion or education. We provide empirical evidence that the prediction accuracy for almost all private traits consistently improves over time. Surprisingly, the prediction performance for users who stopped editing after a given time still improves. The activities performed by new users seem to have contributed more to this effect than additional activities from existing (but still active) users. Insights from this work should help users, system designers, and policy makers understand and make long-term design choices in online content creation systems. *Dario Taraborelli *Head of Research, Wikimedia Foundation wikimediafoundation.org • nitens.org • @readermeter <http://twitter.com/readermeter>

3 3

All the nodes in the Analytics Hadoop cluster will be rebooted today
by Luca Toscano 16 Mar '16

16 Mar '16

Hi folks! Due to a kernel upgrade for a security fix we need to reboot each node of the Hadoop cluster. The task will be started later on today and it will be done in small batches to avoid causing major delays to outstanding jobs. Please contact me if you notice any major issue (elukey or the #wikimedia-analytics channel on freenode). Thanks! Regards, Luca

1 0

Upcoming reboots of stat1002/1003
by Moritz Muehlenhoff 15 Mar '16

15 Mar '16

Hi, here's a headsup for people with long-running queries/jobs: I need to reboot stat1002 and stat1003 for a kernel update tomorrow morning (16th of March) at 9am UTC. Please ping me if that should happen to be a really bad time and we can possibly reschedule. Cheers, Moritz

1 0

Fwd: Usage of correct diacritics in readers of Romanian Wikipedia
by Strainu 15 Mar '16

15 Mar '16

Hi, I'm forwarding this email here, in the hope I can gather more feedback and to explore whether Event Logging could be the right choice for gathering the data (the NDA for access does not look good :P) Thanks, Strainu ---------- Forwarded message ---------- From: Strainu <strainu10(a)gmail.com> Date: 2016-03-11 14:44 GMT+02:00 Subject: Usage of correct diacritics in readers of Romanian Wikipedia To: mobile-l(a)lists.wikimedia.org Hi, I have proposed a new research project about the support for correct diacritics in the readers of the Romanian Wikipedia [1]. The plan I made is (probably) limited to the desktop site, but Adam suggested there might be some overlap with the work you are doing around emerging communities. So, if someone is interested in extending the study to mobile users or you have any feedback on the project, please leave a message on the talk page or contact me by email. Thanks, Strainu P.S. Please keep me in the CC for any responses, as I don't get emails from mobile-l. [1] https://meta.wikimedia.org/wiki/Research:Usage_of_correct_diacritics_in_rea…

3 2

[Eventlogging] Dropping Client IPs from EventCapsule
by Madhumitha Viswanathan 10 Mar '16

10 Mar '16

Hi all, The analytics team, in an effort to collect sensitive data less, plans to drop the clientIP field from the EventCapsule( https://meta.wikimedia.org/wiki/Schema:EventCapsule), which is the wrapper for all events flowing into Eventlogging (Currently IPs and User Agents get purged after the 90 days mark). The field was originally meant only for debugging, but has served some research usecases. Most of these cases have been wrapped up at this point. It has also been used as a proxy to count number of devices visiting sites like our blog - and since IP's are not a good measure of that anyway - we plan to move such cases to use Piwik. The rollout of the change will happen in stages (Drop clientIPs first on the EL end, then the EventCapsule in meta, and finally on the VarnishKafka end). It should be a clean deployment and there's no scheduled downtime - EL will keep working as is. What does change? ClientIP's will start being set as NULL in your mysql tables. If you update the Eventlogging schema you maintain - causing new tables to be created, the new tables will not have the clientIp field in them. The change is planned to be rolled out the week of 11th or 18th March '16, pending the completion of data collection for the ongoing QuickSurveys based research work. Let us know if you have any questions/concerns on the list or on #wikimedia-analytics. The related phab ticket is here - https://phabricator.wikimedia.org/T128407. Thanks, Madhu Viswanathan Software Engineer, Analytics

3 4

[wmf.webrequest data] one-time access
by Michal Bystricky 08 Mar '16

08 Mar '16

Hello Analytics Team, We would like to have one-time access to wmf.webrequest data. What is the correct way of accessing the data? In our research group, we want to simulate the requests for specific version of WikiMedia. Thanks, Michal Bystricky

3 2

Spotify Kafka -> Google Pub/Sub article
by Andrew Otto 07 Mar '16

07 Mar '16

https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the… Interesting! Especially their experiences with MirrorMaker.

3 2

Hadoop - Last week data needs to be backfilled
by Joseph Allemandou 07 Mar '16

07 Mar '16

Hi, *TL,DR: Please don't use hive / spark / hadoop before next week.* Last week the Analytics Team performed an upgrade to the Hadoop Cluster. It went reasonably well except for many of the hadoop processes were launched with a special option to NOT use utf-8 as default encoding. This issue caused trouble particularly in page title extraction and was detected last sunday (many kudos to the people having filled bugs on Analytics API about encoding :) We found the bug and fixed it yesterday, and backfill starts today, with the cluster recomputing every dataset starting 2016-02-23 onward. This means you shouldn't query last week data during this week, first because it is incorrect, and second because you'll curse the cluster for being too slow :) We are sorry for the inconvenience. Don't hesitate to contact us if you have any question -- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal

7 16

New maintenance window Mar 4 1 - 4 pm UTC (was Re: dataset1001 (dumps.wikimedia.org) maintenance window March 2 1-4pm UTC)
by Ariel Glenn WMF 04 Mar '16

04 Mar '16

Fallback is: cable up the old 1GB nic (Chris has done this and set up the port), PXE install on that, move to 10gb NIC once we're back up. Gross but it gets the job done. Set for tomorrow (Friday) 1 to 4 pm UTC, this time should be much smoother. Same caveats apply as before. Ariel On Wed, Mar 2, 2016 at 8:47 PM, Ariel Glenn WMF <ariel(a)wikimedia.org> wrote: > PXE boot from non-embedded nic failed spectacularly despite our best > efforts. This means we'll have to schedule another window once we have > someting new to try. I apologize for the extra inconvenience. All services > are back exactly the way they were. > > Ariel > > On Wed, Mar 2, 2016 at 6:01 PM, Ariel Glenn WMF <ariel(a)wikimedia.org> > wrote: > >> Extending this downtime window because we ran into unexpected issues with >> PXE boot. >> >> On Tue, Mar 1, 2016 at 3:53 PM, Ariel Glenn WMF <ariel(a)wikimedia.org> >> wrote: >> >>> Dataset1001, the host which serves dumps and other datasets to the >>> public, as well as providing access to various datasets directly on >>> stats100x, will be unavailable tomorrow for an upgrade to jessie. While I >>> don't expect to need nearly 3 hours for the upgrade, better safe than >>> sorry. In the meantime all files will be accessible via >>> ms1001.wikimedia.org via the web, and all dumps and page view files >>> from our mirrors as well. >>> >>> Thanks for your understanding. >>> >>> Ariel Glenn >>> >>> >>> >> >

1 1

Re: [Analytics] Requesting access to Wikimedia Pageview Dumps for Research
by Nuria Ruiz 03 Mar '16

03 Mar '16

cc-ing Analytics list and Ariel who maintains dumps. On Wed, Mar 2, 2016 at 8:31 AM, Gonzalo Diaz <gonzalo.diaz(a)cs.ox.ac.uk> wrote: > Dear Nuria Ruiz, > > My name is Gonzalo Diaz, and I am a PhD student of Computer Science at the > University of Oxford. You can see my profile here: > https://www.cs.ox.ac.uk/people/gonzalo.diaz/ > > I am writing because I am currently working on a research project which > would benefit from processing Wikipedia pagecount files. > > On Monday, 29 February 2016, we began downloading pagecount files from > http://dumps.wikimedia.org/other/pagecounts-raw/. For the next 48 hours > we managed to download ~15 months of raw pagecount files, using 3 different > computers, and 3 instances of "wget" on each computer (for a total of 9 > concurrent downloads at any given moment). > > Since this morning, however, we are no longer able to download the > pagecount files. Furthermore, the site dumps.wikimedia.org seems down. > > Hopefully, our downloads are not responsible for this. If they are, > however, we would like to apologise for the inconvenience. > > In any case, we would like to request permission to continue downloading > the raw pagecount files, as soon as the site is back online. > > I thank you very much for your time! > > Kindest regards, > Gonzalo Diaz > John Mittermeier > > > > > > > > >

7 7

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics March 2016