After a break in September, we’re resuming our monthly Research and Data showcase. The next showcase will be live-streamed tomorrow Wednesday October 15 at 11.30 PT. As usual you can join the conversation via IRC on freenode.net by joining the #wikimedia-research channel.
We look forward to seeing you there,
Dario
This month:
Emotions under Discussion: Gender, Status and Communication in Wikipedia
By David Laniado: I will present a large-scale analysis of emotional expression and communication style of editors in Wikipedia discussions. The talk will focus especially on how emotion and dialogue differ depending on the status, gender, and the communication network of the about 12000 editors who have written at least 100 comments on the English Wikipedia's article talk pages. The analysis is based on three different predefined lexicon-based methods for quantifying emotions: ANEW, LIWC and SentiStrength. The results unveil significant differences in the emotional expression and communication style of editors according to their status and gender, and can help to address issues such as gender gap and editor stagnation.
Wikipedia as a socio-technical system
By Aaron Halfaker: Wikipedia is a socio-technical system. In this presentation, I'll explain how the integration of human collective behavior ("social") and information technology ("technical") has lead to a phenomena that, while being massively productive, is poorly understood due to lack of precedence. Based on my work in this area, I'll describe five critical functions that healthy, Wikipedia-like socio-technical systems must serve in order to continue to function: allocation, regulation, quality control, community management and reflection. Next I'll argue the Wikimedia Foundation's analytics strategy currently focuses on outcomes related to a relatively narrow aspect of system health and all but completely ignores productivity. Finally, I'll conclude with an overview of three classes of new projects that should provide critical opportunities to both practically and academically understand the maintenance of Wikipedia's socio-technical fitness.
Hi all --
I have some good news to share. At the beginning of the month, we announced
that mobile page views were available on our servers. Somewhat belatedly, I
have follow up information available here:
<goog_271963740>
https://wikitech.wikimedia.org/wiki/Analytics/Pagecounts-all-sites
Thanks to the good work of the analytics development and WMF operations
team who made this happen.
We are in process of reaching out to developers who consume this data to
let them know about the new stream.
We have a follow up question -- we are considering leaving up the original
stream up until the end of the quarter. Does this seem reasonable? It is a
maintenance burden as it is created via different infrastructure.
thanks,
-Toby
(+CC: Analytics)
On Wed, Oct 15, 2014 at 8:54 AM, Giuseppe Lavagetto
<glavagetto(a)wikimedia.org> wrote:
> So, the somewhat-anticipated new SSL vulnerability is out, CVE-2014-3566.
> To make the long story short:
> - - SSL v 3.0 operating with CBC-mode chiphers is vulnerable to this
> attack and could allow an attacker to get plaintext info on the
> traffic being exchanged with the server
> - - It's fairly easy for an attacker to induce a downgrade from more
> modern version of TLS to SSL 3.0
>
> There are some patches that would eliminate the downgrading issues
> *for chrome users only at the moment*, but I'm not that happy with the
> idea of patching openssl and maintaining the patch.
>
> Another possibility would be to force use of RC4 in SSL 3.0, which as
> Brandon put it on IRC is almost as using rot13.
>
> So, the easiest (and best) way of getting rid of this vulnerability
> (and a bunch of others, to be honest) would be to drop SSL 3.0
> support. That would mean dropping HTTPS support for IE6 users, which
> is a decision we can't make lightly, but keeping SSL 3.0 exposes the
> vast majority of our users to this vulnerability.
>
> What should we do? This is not as serious as heartbleed but shouldn't
> be taken lightly anyway.
So to sum it up again:
SSL 3.0 is vulnerable/weak, enabling RC4 doesn't help much. Keeping
SSL 3.0 enabled with RC4 for clients that only support SSL 3.0 (mostly
IE 6 and below it seems) -might- be acceptable, if we could warn those
users when they use it, and if we can prevent every other browser from
downgrading to it. But that doesn't seem possible at the moment;
Google's TLS_FALLBACK_SCSV needs all clients and servers patched,
which will obviously take a long while and has maintainability issues.
Also depends on whether OpenSSL and/or Debian will incorporate this
nonstandard patch... at least this affects a lot of people.
Disabling SSL 3.0 could break SSL completely for something in the
order of 1.5% of HTML page requests, according to
http://stats.wikimedia.org/wikimedia/squids/SquidReportClients.htm
Perhaps the Analytics team could do some additional, specific
investigation on this to aid this decision?
I think if we can't allow secure HTTPS by keeping SSL 3.0 enabled, we
should probably disable it even if it breaks a small number of
requests. 1.5% doesn't seem insignificant though. Obviously we'd need
to communicate that very well, as we don't seem to have good options
for doing any sort of fallback to HTTP. I guess it also depends on
what the public awareness of this issue will be...
Hi,
in the week from 2014-10-06–2014-10-12 Andrew, Jeff, and I worked on
the following items around the Analytics Cluster and Analytics related
Ops:
* ULSFO outage affecting webrequest logs (Bug 71876, Bug 71879)
* Revoked default Push grant for Analytics on gerrit's analytics/* projects
* Wikimetrics showing many requests to internal files
* Counting pageviews for the pages “undefined” / “Undefined” (Bug 66532)
* Counting redirect pageviews for Webstatscollector (Bug 71790)
* Reworking webstatscollector's build system
* Puppetization of MaxMind's Connection Type databases
* Wikihadoop now available on the Analytics Cluster
* Analytics Mini-Hackathon in San Francisco
(details below)
Have fun,
Christian
* ULSFO outage affecting webrequest logs (Bug 71876, Bug 71879)
It seems there have been connection issues from ULSFO, which caused a
minor hiccup in the webrequest logs on both udp2log and kafka [1]. Due
to kafka's buffering, kafka could nicely bridge the shorter dropouts,
and in total only a few minutes of data have been lost on kafka, while
udp2log was shaky for up to 2 hours.
* Revoked default Push grant for Analytics on gerrit's analytics/* projects
Per default, all Analytics members had Push permission on all of
gerrit's analytics/* project. As accidental pushes caused pain again,
we now revoked the default Push grant, and made sure that our bots
still had necessary permission to do their duty.
* Wikimetrics showing many requests to internal files
A fix for the mis-redirection of those monitoring requests has been
implemented (but it's not yet deployed).
* Counting pageviews for the pages “undefined” / “Undefined” (Bug 66532)
A short increase on requests for the pages “undefined” and “Undefined”
impacted pageview trend graphs. So after the initial push-back that
bug 66532 received, it was picked up again, and we prepared patches
for both the C and Hive implementation of webstatscollector's pageview
definition to not count such requests. Deployment of those patches is
likely to happen around 2014-10-15.
* Counting redirect pageviews for Webstatscollector (Bug 71790)
Ever since, the webstatscollector pageview definition has been
counting redirects, and was hence overcounting.
Since, we're about to deploy a webstatscollector anyways, we prepared
changes to fix this longstanding miscounting.
* Reworking webstatscollector's build system
Fresh compilations of webstatscollector's C implementation gave
executables that segfaulted. So we fixed some NULL dereferences, fixed
the build system, made it capable of compiling with optimization
turned on, and built a rudimentary testsuite for the collector
process. Thereby, we can now again build the collector executable, and
can automatically verify that it's working.
* Puppetization of MaxMind's Connection Type databases
MaxMind's Connection Type (NetSpeed) databases have been
puppetized. They are available for example on stat1002, and stat1003
at
/usr/share/GeoIP/GeoIPNetSpeedCell.dat
/usr/share/GeoIP/GeoIPNetSpeed.dat
.
* Wikihadoop now available on the Analytics Cluster
This allows for easier parsing of Mediawiki xml revision dumps.
* Analytics Mini-Hackathon in San Francisco
During this week, the Analytics Mini-Hackathon took place, and
more prototyping around
** Scoop and Oozification
** Streaming data into HDFS
happened, and some time was spend on hunting down the kafkatee issues.
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Hi,
in the week from 2014-09-29–2014-10-05 Andrew, Jeff, and I worked on
the following items around the Analytics Cluster and Analytics related
Ops:
* LDAP server renaming
* Esams bits having problems to produce to kafka
* Wikimetrics showing many requests to internal files
* Released new dataset pagecounts-all-sites
(details below)
Have fun,
Christian
* LDAP server renaming
WMF changed its LDAP server names.
For production machines, this was handled nicely through puppet.
Same for labs instances that run of puppet head.
But many of our instances fall in neither of those two categories. So
we had to bring in the necessary changes by hand, and make sure
nothing broke.
* Esams bits having problems to produce to kafka (Bug 71435)
We were seeing both duplicates and missing log lines from esams bits
caches on 2014-10-28 and 2014-10-29. It appeared that big caches for
Kafka and our ACK setup back-fired upon too high load. So Kafka
caching got adjusted a bit, and we took a closer look at the ACKs
handling.
This problem might occur again, so we might do further adjustments
there.
* Wikimetrics showing many requests to internal files (Bug 71606)
Reoccurring requests to internal resources of Wikimetrics have been
found in the logs. They turned out to be mislead requests from
monitoring of puppet's apache default setup.
* Released new dataset pagecounts-all-sites
We released the first Analytics cluster generated dataset
pagecounts-all-sites. Announcement is at
https://lists.wikimedia.org/pipermail/analytics/2014-October/002597.html
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
Ada Lovelace Day is celebrated on October 14 this year.
Augusta Ada King, Countess of Lovelace (born in the year 1815) was a
mathematician and computer programmer who worked on Charles Babbage's
Analytical Engine. She foresaw how computers could evolve into devices that
perform tasks more sophisticated than simple calculations. She is
controversially credited with authoring the world's first computer
program, and certainly worked extensively with Babbage. [1]
Ada Lovelace Day celebrates women's contributions to science, technology,
engineering, and mathematics.
Wikimedia Commons, English Wikipedia, and Persian Wikipedia have designated
a watercolor portrait of Lovelace as a featured picture. [2]
Happy Ada Lovelace Day,
Pine
[1] https://en.wikipedia.org/wiki/Ada_Lovelace
[2] https://commons.m.wikimedia.org/wiki/File:Ada_Lovelace_portrait.jpg
I've noticed in my time on this list that it seems like the reliability of
Wikimedia Analytics services is a bit spotty. I've worked in IT services,
and our web and email servers' reliability seemed pretty good, comparable
to the Wikimedia content delivery services. I'm curious if there is
something about the nature of Analytics services that makes them inherently
fragile, or if there is something about Wikimedia's particular
configuration that is an issue. This isn't intended as criticism; I'm just
curious.
Thanks for your work keeping this place running.
Pine
Thanks for this. I'm forwarding to the Analytics and Research lists.
Pine
---------- Forwarded message ----------
From: Rachel Farrand <rfarrand(a)wikimedia.org>
Date: Mon, Oct 6, 2014 at 1:12 PM
Subject: Re: [Wikitech-l] Tech Talk: The Dashboarding Problem: October 6
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Thank you for the great turnout today!
If you would like to view the recording of the talk, here is the link:
http://www.youtube.com/watch?v=hzMwwLfvh5g
If you have any questions about today's talk please feel free to get in
touch with Dan Andreescu <dandreescu(a)wikimedia.org> and Nuria Ruiz <
nuria(a)wikimedia.org>
You can check out past tech talk recondrings at the MediaWiki YouTube page
here: http://www.youtube.com/channel/UCg4wlhlN8RjP6_e_vMC4CTA
If you would like to nominate future tech talks or see what we have coming
up, go here:
https://www.mediawiki.org/wiki/Project:Calendar/How_to_schedule_an_event/Te…
Thanks!
On Mon, Oct 6, 2014 at 11:03 AM, Rachel Farrand <rfarrand(a)wikimedia.org>
wrote:
> Reminder: This tech talk starts in 1 hour
>
> On Wed, Oct 1, 2014 at 12:01 PM, Rachel Farrand <rfarrand(a)wikimedia.org>
> wrote:
>
>> Please join us for the following tech talk:
>>
>> Tech Talk: *The Dashboarding Problem*
>> Date: October 6
>> Time: 1900 UTC
>> <
http://www.timeanddate.com/worldclock/fixedtime.html?msg=Tech+Talk%3A+The+D…
>
>> Link to live YouTube stream <http://www.youtube.com/watch?v=hzMwwLfvh5g>
>> IRC channel for questions: #wikimedia-office
>> Google+ page
>> <
https://plus.google.com/u/0/b/103470172168784626509/events/ch8uuivq05nqejql…>,
another
>> place for questions
>>
>> Talk description:
>> The Analytics team has been busy exploring dashboarding and visualizing
>> editor engagement data. We found that while most people focus on
>> visualization, data access and information architecture are just as
>> important and separate problems.
>> Mike Bostock solved visualization and the design team took care of
>> information architecture, so we built a dashboard around their work.
>> In this talk we share our learnings from developing dashiki, our new
>> dashboard stack. We will talk about why we believe a server-less
javascript
>> app was the right architecture for the problem, how with about 900 lines
of
>> javascript we transform data into Vega grammar, and how knockout
components
>> helped us stay modular.
>>
>> While we'll look at some javascript, the talk is high level, about 30
>> minutes long, and everyone that is interested in dashboarding,
>> visualization, and modularity is welcome to attend.
>>
>> Dashiki Code: https://github.com/wikimedia/analytics-dashiki
>>
>> Editor Dashboard: https://metrics.wmflabs.org/static/public/dash/
>>
>
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l