Wikitech-l July 2013

wikitech-l@lists.wikimedia.org

158 participants
163 discussions

The image URL vs. AdBlockPlus
by Strainu 04 Jul '13

04 Jul '13

Hi, I've hit an interesting problem with the URL of an image (not the description page) on the OSM wiki. I could not see the image from http://wiki.openstreetmap.org/wiki/File:731.jpg until I actually clicked on the "full resolution" link. The reason for that was that the image URL was http://wiki.openstreetmap.org/w/images/a/ad/731.jpg , and one adblock rule was blocking "/ad/". That particular rule was from my personal list, but I've noticed that EasyList, the "basic" subscription for AdBlockPlus users, is also blocking lots of strings that start with /ad/. Has someone else notice such a problem on WMF wikis? Do you think it would be worth it to change MediaWiki so that it doesn't generates URLs containing /ad/, /ads/ or something else along these lines? Thanks, Strainu

3 2

A small note on your composer talk
by Виталий Филиппов 04 Jul '13

04 Jul '13

Hi there! I didn't read the whole discussion but I have a small note - please don't let composer to become a mandatory dependency for MW :-) OK if you're not talking of this anyway :-)

2 1

[Language Engineering] Office hour on July 10, 2013 at 1700 UTC/1000 PDT
by Runa Bhattacharjee 03 Jul '13

03 Jul '13

Hello, The Wikimedia Language Engineering team [1] invites everyone to join the team’s monthly office hour on July 10, 2013 at 1700 UTC/ 1000 PDT on #wikimedia-office. During this session we would be talking about some of our recent activities, including the Universal Language Selector (ULS) rollout and updates from the ongoing projects. See you all at the IRC office hour! regards, Runa Event Details: ========== Date: July 10, 2013 (Wednesday) Time: 1700-1800 UTC, 1000-1100 AM PDT IRC channel: #wikimedia-office on irc.freenode.net Agenda: 1. ULS Rollout 2. Other updates 3. Q/A - We shall be taking questions during the session. Questions can also be sent to runa at wikimedia dot org or siebrand at wikimedia dot org before the event and can be addressed during the office-hour. [1] http://wikimediafoundation.org/wiki/Language_Engineering_team -- Language Engineering - Outreach and QA Coordinator Wikimedia Foundation

1 0

Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps
by Petr Onderka 03 Jul '13

03 Jul '13

A reply to all those who basically want to keep the current XML dumps: I have decided to change the primary way of reading the dumps: it will now be a command line application that outputs the data as uncompressed XML, in the same format as current dumps. This way, you should be able to use the new dumps with minimal changes to your code. Keeping the dumps in a text-based format doesn't make sense, because that can't be updated efficiently, which is the whole reason for the new dumps. Petr Onderka On Mon, Jul 1, 2013 at 11:10 PM, Byrial Jensen <byrial(a)vip.cybercity.dk>wrote: > Hi, > > As a regular of user of dump files I would not want a "fancy" file format > with indexes stored as trees etc. > > I parse all the dump files (both for SQL tables and the XML files) with a > one pass parser which inserts the data I want (which sometimes is only a > small fraction of the total amount of data in the file) into my local > database. I will normally never store uncompressed dump files, but pipe the > uncompressed data directly from bunzip or gunzip to my parser to save disk > space. Therefore it is important to me that the format is simple enough for > a one pass parser. > > I cannot really imagine who would use a library with object oriented API > to read dump files. No matter what it would be inefficient and have fewer > features and possibilities than using a real database. > > I could live with a binary format, but I have doubts if it is a good idea. > It will be harder to take sure that your parser is working correctly, and > you have to consider things like endianness, size of integers, format of > floats etc. which give no problems in text formats. The binary files may be > smaller uncompressed (which I don't store anyway) but not necessary when > compressed, as the compression will do better on text files. > > Regards, > - Byrial > > > ______________________________**_________________ > Xmldatadumps-l mailing list > Xmldatadumps-l(a)lists.**wikimedia.org <Xmldatadumps-l(a)lists.wikimedia.org> > https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**l<https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l> >

4 4

Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps
by Petr Onderka 03 Jul '13

03 Jul '13

The problem is that appending is not enough, especially if you want to keep the current format. 1. With the current format you almost could append new pages, but not new revisions of existing pages, because they belong in the middle of the XML. 2. We also need to handle deletions (and undeletions) of pages and revisions. 3. There are also "current" dumps, which always contain only the most recent revision of a page. And another advantage of the binary format is that you *can* seek easily. If you're looking for a specific page or revision, you don't have to go through the whole file, you can tell the application what you want, it will look it up and output only that. Also, even if you couldn't seek, I don't see how is this any worse than the current situation, when you also can't seek into a specific position of the compressed XML (unless you use multistream dumps). Petr Onderka On Wed, Jul 3, 2013 at 4:45 PM, Giovanni Luca Ciampaglia < glciampagl(a)gmail.com> wrote: > Petr, could you please elaborate more on this last claim? If turning the > dump generation into an incremental process is the task you are interested > in solving, then I don't understand how text constitutes a problem. Text > files can be appended to as any regular file and it shouldn't be difficult > to do this in a way that preserves the XML structure valid. > > As I said, having the possibility to seek and inspect the files manually > is a tremendous boon when debugging your code. With what you propose that > would be possible but more complicate, since one cannot seek at a specific > position of stdout without going through the whole contents. > > Best > > Giovanni > On Jul 3, 2013 4:05 PM, "Petr Onderka" <gsvick(a)gmail.com> wrote: > >> A reply to all those who basically want to keep the current XML dumps: >> >> I have decided to change the primary way of reading the dumps: it will >> now be a command line application that outputs the data as uncompressed >> XML, in the same format as current dumps. >> >> This way, you should be able to use the new dumps with minimal changes to >> your code. >> >> Keeping the dumps in a text-based format doesn't make sense, because that >> can't be updated efficiently, which is the whole reason for the new dumps. >> >> Petr Onderka >> >> >> On Mon, Jul 1, 2013 at 11:10 PM, Byrial Jensen <byrial(a)vip.cybercity.dk>wrote: >> >>> Hi, >>> >>> As a regular of user of dump files I would not want a "fancy" file >>> format with indexes stored as trees etc. >>> >>> I parse all the dump files (both for SQL tables and the XML files) with >>> a one pass parser which inserts the data I want (which sometimes is only a >>> small fraction of the total amount of data in the file) into my local >>> database. I will normally never store uncompressed dump files, but pipe the >>> uncompressed data directly from bunzip or gunzip to my parser to save disk >>> space. Therefore it is important to me that the format is simple enough for >>> a one pass parser. >>> >>> I cannot really imagine who would use a library with object oriented API >>> to read dump files. No matter what it would be inefficient and have fewer >>> features and possibilities than using a real database. >>> >>> I could live with a binary format, but I have doubts if it is a good >>> idea. It will be harder to take sure that your parser is working correctly, >>> and you have to consider things like endianness, size of integers, format >>> of floats etc. which give no problems in text formats. The binary files may >>> be smaller uncompressed (which I don't store anyway) but not necessary when >>> compressed, as the compression will do better on text files. >>> >>> Regards, >>> - Byrial >>> >>> >>> ______________________________**_________________ >>> Xmldatadumps-l mailing list >>> Xmldatadumps-l(a)lists.**wikimedia.org<Xmldatadumps-l(a)lists.wikimedia.org> >>> https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**l<https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l> >>> >> >> >> _______________________________________________ >> Xmldatadumps-l mailing list >> Xmldatadumps-l(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l >> >>

1 0

Composer support for installing MediaWiki extensions
by Jeroen De Dauw 03 Jul '13

03 Jul '13

Hey, Over the past few months a number of people have been poking at Composer [0] support for MediaWiki [1]. Today I had a look at this and found that though we are close to getting this to work, there are a few remaining problems to be tackled. 1. MediaWiki needs to load the composer autoloader when present. There currently already is some code in core to do this, though this code is placed at a to early point, resulting in things registered by extensions getting overridden by DefaultSettings. I made an attemtp to fix this [2] by placing the inclusion of the autoloader after the one of LocalSettings (see patchset 2). As Hashar pointed out, this prevents one from changing the default configuration of extensions. Inclusion of the autoloader thus needs to happen before the end of LocalSettings. It also needs to happen after the start of LocalSettings, since some extensions need the core configuration to be set already or require the user to define things before they get included. This basically means the only place where we can put it is in LocalSettings, on the same place where people typically include extensions. This is what is done in PS3. Does anyone see a better way to do this? 2. Installing extensions leaves the composer.json file modified. When installing one or more extensions via Composer, they will get added to the require section in composer.json. composer.json is not in the gitignore list. So you might well run into conflicts here, and in any case will have a modified file that is tracked in your git repo, which is annoying. In case of extension installation via LocalSettings, this stuff is in a file that is in gitignore. We could just add composer.json there as well, but this means that when we make changes to it on master, people will not get them any time soon. This is problematic in case we where to make MW core dependent on other packages, though this seems unlikely to happen, and is thus perhaps just a theoretical concern. Does anyone see a way around this problem better than putting composer.json in gitignore? Any concerns with putting it in gitignore? 3. Not clear how to best install an extension The best command to use for installation of an extension when you already have MediaWiki seems to be "composer require", for instance "composer require ask/ask:dev-master". When using this command, apparently the require-dev packages are also installed. Since MW specifies PHPUnit in require-dev, it is installed, together with all of its dependencies (quite some code) for no good reason. In case of the require command, there appears to be no way to specify it should not get the dev packages. An alternate approach to installation is to add the things to install manually in the require section of composer.json and do a "composer update --no-dev". This approach might be fine when doing a manual install, though it clearly does not work well when you want to automate it (ie in a CI build process). Again, does anyone know of a better way to do such an install? And if not, perhaps we can simply get rid of the require-dev section in our composer.json file so we do not run into the problem its causing when using "composer require"? [0] http://getcomposer.org/ [1] https://www.mediawiki.org/wiki/Composer [2] https://gerrit.wikimedia.org/r/#/c/71616/ Cheers -- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --

5 11

PCRE unicode mode performances?
by Antoine Musso 02 Jul '13

02 Jul '13

Hello, AbuseFilter does not match word boundaries in devanagari script which is logged at https://bugzilla.wikimedia.org/46773 (has some unit test result attached). The root cause is that the regex pattern are not in unicode mode ('u' regexp flag) and thus \b is being dumb. The fix would be to set the preg_match in AbuseFilter to unicode mode, but I am worried about the performances implications. I once wrote a patch that used unicode properties and that made the parser significantly slower. Maybe the AbuseFilter code path is not that critical for performances :) Any thoughts? -- Antoine "hashar" Musso

1 0

PHP question re. isset( $array['key'][0] ) versus array_key_exists( 0, $array['key'] )
by Thomas Gries 02 Jul '13

02 Jul '13

PHP codestyle question re. isset( $array['key'][0] ) versus array_key_exists( 0, $array['key'] ) On a recent patch [1], we had a discussion what is - or may be - better * isset( $array['key'][0] ) * array_key_exists( 0, $array['key'] ) This is still unclear after consultation of these pages * http://php.net/isset <http://php.net/isset,> * http://php.net/array_key_exists and * https://www.mediawiki.org/wiki/Manual:Coding_conventions/PHP#Pitfalls . ... I would prefer/array_key_exists/ . Perhaps someone can explain why which one of the methods should be preferred ? The https://www.mediawiki.org/wiki/Manual:Coding_conventions/PHP#Pitfalls section should be improved with the explanation. [1] https://gerrit.wikimedia.org/r/#/c/71258/ see comments for patchset 8

5 5

Suggested file format of new incremental dumps
by Petr Onderka 02 Jul '13

02 Jul '13

For my GSoC project Incremental data dumps [1], I'm creating a new file format to replace Wikimedia's XML data dumps. A sketch of how I imagine the file format to look like is at http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format. What do you think? Does it make sense? Would it work for your use case? Any comments or suggestions are welcome. Petr Onderka [[User:Svick]] [1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps

9 20

Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps
by Neil Harris 02 Jul '13

02 Jul '13

On 01/07/13 23:21, Nicolas Torzec wrote: > Hi there, > > In principle, I understand the need for binary formats and compression in a context with limited resources. > On the other hand, plain text formats are easy to work with, especially for third-party users and organizations. > > Playing the devil advocate, I could even argue that you should keep the data dumps in plain text, and keep your processing dead simple, and then let distributed processing systems such as Hadoop MapReduce (or Storm, Spark, etc.) handle the scale and compute diffs whenever needed or on the fly. > > Reading the Wiki mentioned at the beginning of this thread, it is not clear to me what the requirements are for this new incremental update format, and why? > Therefore, it is not easy to provide input and help. > > > Cheers. > - Nicolas Torzec. > +1 The simplest possible dump format is the best, and there's already a thriving ecosystem around the current XML dumps, which would be broken by moving to a binary format. Binary file formats and APIs defined by code are not the way to go if you want long-term archival that can endure through decades of technological change. If more money is needed for dump processing, it should be budgeted for and added to the IT budget, instead of over-optimizing by using a potentially fragile, and therefore risky, binary format. Archival in a stable format is not a luxury or an optional extra; it's a core part of the Foundation's mission. The value is in the data, which is priceless. Computers and storage are (relatively) cheap by comparison, and Wikipedia is growing significantly more slowly than the year-on-year improvements in storage, processing and communication links. Moreover, re-making the dumps every time provides defence in depth against subtle database corruption that might slowly corrupt a database dump. Please keep the dumps themselves simple and their format stable, and, as Nicolas says, do the clever stuff elsewhere, in which you can use whatever efficient representation you like to do the processing. Neil

2 1

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Wikitech-l July 2013