Amir Aharoni and I thought that this might be interesting for people here.
We wanted to answer the following question: for each language, how many of
the articles in the main namespace that appear in one Wikipedia (e.g., FR)
also appear in another (e.g., EN). We calculated this as the percentage of
articles that exist in two languages from the total number of articles of
one of the languages (taken from ). That is, we calculated the
intersection(EN, FR)/Count(FR)). We did this for all of the languages
1. The co-exist matrix of counts can be found Google Spreadsheet
- It was generated on 01/09/2015 using the langlinks table of every wiki.
The underlining query is -based on this code: (%s is the wiki code)
SELECT '%s' as source, ll_lang as target, COUNT(*) as count FROM
%s_p.langlinks LEFT JOIN %s_p.page
ON page_id = ll_from
WHERE page_namespace = 0
GROUP BY ll_lang;
- The links are not symmetrical. there is on average less than one percent
difference between the links from lang A to B compared to lang B to A.
- However, it wasn't perfect. Wikis with less than 3500 links (that means
the has less than 100 articles) have on average more than 20% out links
(that is, taken from that language langlink table) than in links (other
wikis pointing at that language).
- As the number of langlinks gets bigger (and for most cases, the side of
the wiki), the difference and variance between the in and out links gets
- Some out links pointed to mistakes (zh-cn, zh-tw, nn) - is fixed.
- The raw data can be sent on request.
2. A heat map of the co-exist wikis with more than 50,000 articles. It is
ordered by size. As I mentioned, the above triangles are not symmetrical
because the counts (which are themselves not equal but are close enough)
are divided by the number of articles in each wiki. The heat map is between
Red - high level of congruence to Yellow - low level.
[image: Inline image 1]
Points to notice:
1. Most languages have strong connections with English.
2. There is a group of interconnected wikis that are based on Swedish
(Dutch, Waray-Waray, Cebuano, Vietnamese, Indonesian, Minangkabau).
3. Piedmontese is highly interconnected with Latin languages, as do Latin
itself. On the other hand, Chechen is mostly connected to Russian.
4. a. Arabic has 8% more in links than out.There isn't one Wiki that caused
this difference, so it's not a bot.
5. Telugu doesn't have many interlinks, not to English, Hindi or Bengali.
6. There are other visible strong connections (as Serbian and
Serbo-Croatian) but they are not as surprising.
 meta.wikimedia.org/wiki/List_of_Wikipedias updated on 01/12/2015.
 You might be wondering why did we calculated both EN-> FR and FR-> EN
as there is a 1 to 1 connection between the interlanguage links in Wikidata?
We used the data from the langlinks table for every Wikipedia and not from
the wiki interlanguage link table. We did so for two reasons: 1) it
was computationally easier 2) we wanted to see if there are any irregulars
in the data.
This depends on  so we're not going to need that immediately, but in
order to help Erik Zachte with his RfC  to track unique media views in
Media Viewer, I'm going to need to use something almost exactly like
EventLogging. The main difference being that it should skip writing to the
database and write to a log file instead.
That's because we'll be recording around 20-25M image views per day, which
would needlessly overload EventLogging for little purpose since the data
will be used for offline stats generation and doesn't need to be made
available in a relational database. Of course if storage space and
EventLogging capacity were no object, we could just use EL and keep the
ever-growing table forever, but I have the impression that we want to be
reasonable here and only write to a log, since that's what Erik needs.
So here's the question: for a specific schema, can EventLogging work the
way it does but only record hits to a log file (maybe it already does that
before hitting the DB?) and not write to the DB? If not, how difficult
would it be to make EL capable of doing that?
I’m in the middle of a (slow) upgrade process for the Hadoop cluster. Currently, we are running CDH 5.0.2, and would like to upgrade to CDH 5.3. There are several steps to this process, the first of which is upgrading our OS to Ubuntu Trusty.
Along the way, I’m replacing our current NameNodes with different hardware. I am ready to do this now. I don’t see much opportunity to schedule this over the next couple of weeks, due to All-Hands travel, so I’d like to schedule this for tomorrow morning (Friday January 16th).
I expect this to be relatively simple downtime, that will only take a few minutes. Just in case, I’d like to reserve 2 hours of time.
So, unless there are serious objections, plan for Hadoop to be offline from
2015-01-16 15:45 - 17:45 UTC
Also, please don’t start jobs before this time slot that you think will take a long time. If there are running jobs, I either can’t shut down the cluster, or I will have to kill the jobs. If I see running jobs, I’ll try to reach out to you before I kill anything.
If anyone is interested in a rough migration plan, it is here:
On Jan 13th 2015 between 22:20 and 23:18 UTC (~1 hour) stat1002 ceased
receiving TSV data from udp2log for the following data streams:
- Mobile requests stream
- Requests stream
- Zero requests stream
The reason for that were routing problems in the firewall introduced by a
change to the way iptables rules are created for udp2log.
The problem was quickly resolved.
Here is Phab task: https://phabricator.wikimedia.org/T86973
Apologies for the late email,
Please join the Analytics Engineering team for...
Office Hours: EventLogging & Dashboarding
Hosts: Dan and Nuria
Date: January 14
Time: 20:00 UTC - Convert to Local Time
Teams need metrics on how their product or feature is performing, then they
need to visualize those metrics. This is accomplished with instrumenting
code with EventLogging, mashing data with some queries and setting up a
Limn Dashboard. The Analytics Engineering team is open for office hours to
answer questions about the process, help solve any issues and listen to
feedback on the process. Feel free to drop in the Goolge Hangout linked
above or ask questions on the IRC channel during our Office Hours.
Are there metrics about which links in each article are the most clicked?
I can think there's a lot to be learned from it:
* Data-driven suggestions for manual of style about linking (too much and
too few links are a perennial topic of argument)
* How do people traverse between topics.
* Which terms in the article may need a short explanation in parentheses
rather than just a link.
* How far down into the article do people bother to read.
Anyway, I can think that accessibility to such data can optimize both
readership and editing.
And maybe this can be just taken right from the logs, without any
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
“We're living in pieces,
I want to live in peace.” – T. Moore
just a quick heads up, that Ops are about to add a “php” key to the
X-Analytics header (i.e.: for sampled-1000 logs, hive, ...):
This header will hold the used PHP implementation .
Planned deployment is between 2014-09-01 and 2014-09-02.
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
The upcoming Wikimedia Research showcase <https://www.mediawiki.org/wiki/Analytics/Research_and_Data/Showcase> (Wednesday January 14, 11.30 PT) will host two guest speakers: Felipe Ortega <https://en.wikipedia.org/wiki/User:GlimmerPhoenix> (University of Madrid) and Benjamin Mako Hill <https://en.wikipedia.org/wiki/User:Benjamin_Mako_Hill> (University of Washington).
As usual, the showcase will be broadcast on YouTube (the livestream link will follow on the list) and we’ll host the QA on the #wikimedia-research IRC channel on freenode.
We look forward to seeing you there.
Functional roles and career paths in Wikipedia
By Felipe Ortega <https://www.mediawiki.org/wiki/User:GlimmerPhoenix>
An understanding of participation dynamics within online production communities requires an examination of the roles assumed by participants. Recent studies have established that the organizational structure of such communities is not flat; rather, participants can take on a variety of well-defined functional roles. What is the nature of functional roles? How have they evolved? And how do participants assume these functions? Prior studies focused primarily on participants' activities, rather than functional roles. Further, extant conceptualizations of role transitions in production communities, such as the Reader to Leader framework, emphasize a single dimension: organizational power, overlooking distinctions between functions. In contrast, in this paper we empirically study the nature and structure of functional roles within Wikipedia, seeking to validate existing theoretical frameworks. The analysis sheds new light on the nature of functional roles, revealing the intricate “ areer paths" resulting from participants' role transitions.
Free Knowledge Beyond Wikipedia
A conversation facilitated by Benjamin Mako Hill <https://www.mediawiki.org/wiki/User:Benjamin_Mako_Hill>
In some of my research with Leah Buechley <http://mako.cc/academic/buechley_hill_DIS_10.pdf>, I’ve explored the way that increasing engagement and diversity in technology communities often means not just attacking systematic barriers to participation but also designing for new genres and types of engagement. I hope to facilitate a conversation about how WMF might engage new readers by supporting more non-encyclopedic production. I'd like to call out some examples from the new Wikimedia project proposals list <https://meta.wikimedia.org/wiki/Proposals_for_new_projects>, encourage folks to share entirely new ideas, and ask for ideas about how we could dramatically better support Wikipedia's sister projects.