The Wikimedia Language Engineering team  invites everyone to join
the team’s monthly office hour on July 10, 2013 at 1700 UTC/ 1000 PDT
on #wikimedia-office. During this session we would be talking about
some of our recent activities, including the Universal Language
Selector (ULS) rollout and updates from the ongoing projects.
See you all at the IRC office hour!
Date: July 10, 2013 (Wednesday)
Time: 1700-1800 UTC, 1000-1100 AM PDT
IRC channel: #wikimedia-office on irc.freenode.net
1. ULS Rollout
2. Other updates
3. Q/A - We shall be taking questions during the session. Questions
can also be sent to runa at wikimedia dot org or siebrand at wikimedia
dot org before the event and can be addressed during the office-hour.
Language Engineering - Outreach and QA Coordinator
A reply to all those who basically want to keep the current XML dumps:
I have decided to change the primary way of reading the dumps: it will now
be a command line application that outputs the data as uncompressed XML, in
the same format as current dumps.
This way, you should be able to use the new dumps with minimal changes to
Keeping the dumps in a text-based format doesn't make sense, because that
can't be updated efficiently, which is the whole reason for the new dumps.
On Mon, Jul 1, 2013 at 11:10 PM, Byrial Jensen <byrial(a)vip.cybercity.dk>wrote:
> As a regular of user of dump files I would not want a "fancy" file format
> with indexes stored as trees etc.
> I parse all the dump files (both for SQL tables and the XML files) with a
> one pass parser which inserts the data I want (which sometimes is only a
> small fraction of the total amount of data in the file) into my local
> database. I will normally never store uncompressed dump files, but pipe the
> uncompressed data directly from bunzip or gunzip to my parser to save disk
> space. Therefore it is important to me that the format is simple enough for
> a one pass parser.
> I cannot really imagine who would use a library with object oriented API
> to read dump files. No matter what it would be inefficient and have fewer
> features and possibilities than using a real database.
> I could live with a binary format, but I have doubts if it is a good idea.
> It will be harder to take sure that your parser is working correctly, and
> you have to consider things like endianness, size of integers, format of
> floats etc. which give no problems in text formats. The binary files may be
> smaller uncompressed (which I don't store anyway) but not necessary when
> compressed, as the compression will do better on text files.
> - Byrial
> Xmldatadumps-l mailing list
> Xmldatadumps-l(a)lists.**wikimedia.org <Xmldatadumps-l(a)lists.wikimedia.org>
The problem is that appending is not enough, especially if you want to keep
the current format.
1. With the current format you almost could append new pages, but not new
revisions of existing pages, because they belong in the middle of the XML.
2. We also need to handle deletions (and undeletions) of pages and
3. There are also "current" dumps, which always contain only the most
recent revision of a page.
And another advantage of the binary format is that you *can* seek easily.
If you're looking for a specific page or revision, you don't have to go
through the whole file, you can tell the application what you want, it will
look it up and output only that.
Also, even if you couldn't seek, I don't see how is this any worse than the
current situation, when you also can't seek into a specific position of the
compressed XML (unless you use multistream dumps).
On Wed, Jul 3, 2013 at 4:45 PM, Giovanni Luca Ciampaglia <
> Petr, could you please elaborate more on this last claim? If turning the
> dump generation into an incremental process is the task you are interested
> in solving, then I don't understand how text constitutes a problem. Text
> files can be appended to as any regular file and it shouldn't be difficult
> to do this in a way that preserves the XML structure valid.
> As I said, having the possibility to seek and inspect the files manually
> is a tremendous boon when debugging your code. With what you propose that
> would be possible but more complicate, since one cannot seek at a specific
> position of stdout without going through the whole contents.
> On Jul 3, 2013 4:05 PM, "Petr Onderka" <gsvick(a)gmail.com> wrote:
>> A reply to all those who basically want to keep the current XML dumps:
>> I have decided to change the primary way of reading the dumps: it will
>> now be a command line application that outputs the data as uncompressed
>> XML, in the same format as current dumps.
>> This way, you should be able to use the new dumps with minimal changes to
>> your code.
>> Keeping the dumps in a text-based format doesn't make sense, because that
>> can't be updated efficiently, which is the whole reason for the new dumps.
>> Petr Onderka
>> On Mon, Jul 1, 2013 at 11:10 PM, Byrial Jensen <byrial(a)vip.cybercity.dk>wrote:
>>> As a regular of user of dump files I would not want a "fancy" file
>>> format with indexes stored as trees etc.
>>> I parse all the dump files (both for SQL tables and the XML files) with
>>> a one pass parser which inserts the data I want (which sometimes is only a
>>> small fraction of the total amount of data in the file) into my local
>>> database. I will normally never store uncompressed dump files, but pipe the
>>> uncompressed data directly from bunzip or gunzip to my parser to save disk
>>> space. Therefore it is important to me that the format is simple enough for
>>> a one pass parser.
>>> I cannot really imagine who would use a library with object oriented API
>>> to read dump files. No matter what it would be inefficient and have fewer
>>> features and possibilities than using a real database.
>>> I could live with a binary format, but I have doubts if it is a good
>>> idea. It will be harder to take sure that your parser is working correctly,
>>> and you have to consider things like endianness, size of integers, format
>>> of floats etc. which give no problems in text formats. The binary files may
>>> be smaller uncompressed (which I don't store anyway) but not necessary when
>>> compressed, as the compression will do better on text files.
>>> - Byrial
>>> Xmldatadumps-l mailing list
>> Xmldatadumps-l mailing list
Over the past few months a number of people have been poking at Composer
 support for MediaWiki . Today I had a look at this and found that
though we are close to getting this to work, there are a few remaining
problems to be tackled.
1. MediaWiki needs to load the composer autoloader when present.
There currently already is some code in core to do this, though this code
is placed at a to early point, resulting in things registered by extensions
getting overridden by DefaultSettings.
I made an attemtp to fix this  by placing the inclusion of the
autoloader after the one of LocalSettings (see patchset 2). As Hashar
pointed out, this prevents one from changing the default configuration of
extensions. Inclusion of the autoloader thus needs to happen before the end
of LocalSettings. It also needs to happen after the start of LocalSettings,
since some extensions need the core configuration to be set already or
require the user to define things before they get included. This basically
means the only place where we can put it is in LocalSettings, on the same
place where people typically include extensions. This is what is done in
PS3. Does anyone see a better way to do this?
2. Installing extensions leaves the composer.json file modified.
When installing one or more extensions via Composer, they will get added to
the require section in composer.json. composer.json is not in the gitignore
list. So you might well run into conflicts here, and in any case will have
a modified file that is tracked in your git repo, which is annoying. In
case of extension installation via LocalSettings, this stuff is in a file
that is in gitignore. We could just add composer.json there as well, but
this means that when we make changes to it on master, people will not get
them any time soon. This is problematic in case we where to make MW core
dependent on other packages, though this seems unlikely to happen, and is
thus perhaps just a theoretical concern. Does anyone see a way around this
problem better than putting composer.json in gitignore? Any concerns with
putting it in gitignore?
3. Not clear how to best install an extension
The best command to use for installation of an extension when you already
have MediaWiki seems to be "composer require", for instance "composer
require ask/ask:dev-master". When using this command, apparently the
require-dev packages are also installed. Since MW specifies PHPUnit in
require-dev, it is installed, together with all of its dependencies (quite
some code) for no good reason. In case of the require command, there
appears to be no way to specify it should not get the dev packages.
An alternate approach to installation is to add the things to install
manually in the require section of composer.json and do a "composer update
--no-dev". This approach might be fine when doing a manual install, though
it clearly does not work well when you want to automate it (ie in a CI
Again, does anyone know of a better way to do such an install? And if not,
perhaps we can simply get rid of the require-dev section in our
composer.json file so we do not run into the problem its causing when using
Jeroen De Dauw
Don't panic. Don't be evil. ~=[,,_,,]:3
AbuseFilter does not match word boundaries in devanagari script which is
logged at https://bugzilla.wikimedia.org/46773 (has some unit test
The root cause is that the regex pattern are not in unicode mode ('u'
regexp flag) and thus \b is being dumb.
The fix would be to set the preg_match in AbuseFilter to unicode mode,
but I am worried about the performances implications. I once wrote a
patch that used unicode properties and that made the parser
Maybe the AbuseFilter code path is not that critical for performances :)
Antoine "hashar" Musso
On 01/07/13 23:21, Nicolas Torzec wrote:
> Hi there,
> In principle, I understand the need for binary formats and compression in a context with limited resources.
> On the other hand, plain text formats are easy to work with, especially for third-party users and organizations.
> Playing the devil advocate, I could even argue that you should keep the data dumps in plain text, and keep your processing dead simple, and then let distributed processing systems such as Hadoop MapReduce (or Storm, Spark, etc.) handle the scale and compute diffs whenever needed or on the fly.
> Reading the Wiki mentioned at the beginning of this thread, it is not clear to me what the requirements are for this new incremental update format, and why?
> Therefore, it is not easy to provide input and help.
> - Nicolas Torzec.
The simplest possible dump format is the best, and there's already a
thriving ecosystem around the current XML dumps, which would be broken
by moving to a binary format. Binary file formats and APIs defined by
code are not the way to go if you want long-term archival that can
endure through decades of technological change.
If more money is needed for dump processing, it should be budgeted for
and added to the IT budget, instead of over-optimizing by using a
potentially fragile, and therefore risky, binary format.
Archival in a stable format is not a luxury or an optional extra; it's a
core part of the Foundation's mission. The value is in the data, which
is priceless. Computers and storage are (relatively) cheap by
comparison, and Wikipedia is growing significantly more slowly than the
year-on-year improvements in storage, processing and communication
links. Moreover, re-making the dumps every time provides defence in
depth against subtle database corruption that might slowly corrupt a
Please keep the dumps themselves simple and their format stable, and, as
Nicolas says, do the clever stuff elsewhere, in which you can use
whatever efficient representation you like to do the processing.
We have a very early version of a community metrics dashboard! See the
details below or jump directly to
Even if the main discussion and work will happen at the Analytics
mailing list, there are some initial questions that you should be aware
of, and you should helps answering:
* Do we need to scan ALL the git repositories at wikimedia.org or are
there any that we could/should avoid?
* Do we need to scan ALL the Bugzilla products or... (the same).
* What mailing lists are worth scanning?
At some point we will also have individual statistics. We plan to
identify WMF employees to help answering the old question about how many
WMF-nonWMF contributors we have and what is the trend. Do you think it
is a good idea to allow or encourage everybody to define their org?
How do you feel about statistics per country, meaning that contributors
would define where are they based?
These are the bigger questions. There are more at
Please answer there. Thank you!
-------- Original Message --------
Subject: Introducing Wikimedia tech community metrics
Date: Mon, 01 Jul 2013 14:29:28 -0700
From: Quim Gil <qgil(a)wikimedia.org>
To: A mailing list for the Analytics Team at WMF and everybody who has
an interest in Wikipedia and analytics. <analytics(a)lists.wikimedia.org>,
Hi, today we are starting officially a new project to gather automatic
metrics from the Wikimedia tech / MediaWiki community:
We have a dashboard in an interim location, moving to Labs soon:
The immediate steps (end of this week?) are:
* Moving to Labs. :)
* Scan the sources we want to scan for git repositories, Bugzilla
products and mailing lists.
Then we will follow with (end of this month?)
* Agreeing with the community what data to gather about contributors.
* Polish the list of contributors e.g. assigning to people their
Gerrit and IRC metrics are on the way (end of August?). I also expect
improvements in the interface, based on our feedback... and our patches
This dashboard is based on the open source projects Metrics Grimoire and
VizGrimoire. Other organizations (prominently OpenStack) are using it as
well so more features might come from other stakeholders.
We are working with Bitergia (maintainers or the Grimoire software and
other FLOSS metrics related projects) as contractors. Álvaro del
Castillo (CCed) is the main contact with the team.
The current contract will run during the next 12 months. The Engineering
Community team of the WMF is pushing this effort, and I'm coordinating
it. However, this clearly belongs to the Analytics team area. We have
agreed to hand over the project at some point during the next 6-12 months.
We will use the Analytics mailing list to inform and discuss this
project. Your feedback and help is welcome!
Technical Contributor Coordinator @ Wikimedia Foundation