Re: [Wikitech-l] Wikitech-l Digest, Vol 51, Issue 38 - Wikitech-l

22 Oct 2007

Thanks a lot Gerard and Jesse!
Apologies for jumping to conclusions about Mediawiki not supporting Indian
languages.Will explore what you guys said.Thanks!

Quoting wikitech-l-request(a)lists.wikimedia.org:

...
  Send Wikitech-l mailing list submissions to
 	wikitech-l(a)lists.wikimedia.org

 To subscribe or unsubscribe via the World Wide Web, visit
 	http://lists.wikimedia.org/mailman/listinfo/wikitech-l
 or, via email, send a message with subject or body 'help' to
 	wikitech-l-request(a)lists.wikimedia.org

 You can reach the person managing the list at
 	wikitech-l-owner(a)lists.wikimedia.org

 When replying, please edit your Subject line so it is more specific
 than "Re: Contents of Wikitech-l digest..."

 Today's Topics:

    1. Re: Indian Language support in Mediawiki?
       (Jesse Martin (Pathoschild))
    2. PHP 5.2.5RC1 testing (Ilia Alshanetsky)
    3. Re: Indian Language support in Mediawiki? (GerardM)
    4. Re: Looking for a system administrator familiar with the
       Squid setup (John Q)
    5. New version WikiXRay Python parser (Felipe Ortega)
    6. Incremental history dumps (Lars Aronsson)
    7. Re: Incremental history dumps (Gregory Maxwell)
    8. RFC: Incremental history dumps (Platonides)
    9. Re: [MediaWiki-CVS] SVN: [26830] trunk/phase3 (Simetrical)

 ----------------------------------------------------------------------

 Message: 1
 Date: Fri, 19 Oct 2007 09:54:02 -0400
 From: "Jesse Martin (Pathoschild)" &lt;pathoschild(a)gmail.com&gt;
 Subject: Re: [Wikitech-l] Indian Language support in Mediawiki?
 To: "Wikimedia developers" &lt;wikitech-l(a)lists.wikimedia.org&gt;
 Message-ID:
 	&lt;1913a8240710190654t2ff4beib02fa4343caf2279(a)mail.gmail.com&gt;
 Content-Type: text/plain; charset=ISO-8859-1

 Hello,

 The Hindi and Kannada Wikipedias use JavaScript to do that; for
 example, try typing in the box at
 <http://hi.wikipedia.org/wiki/test?action=edit>.

 The interface has also been translated into Hindi, Tamil, and Kannada.
 You can set the interface language in the file LocalSettings.php
 (globally) or through the wiki page "Special:Preferences" (per user).

 Yours cordially,
 Jesse Martin (Pathoschild)

 ------------------------------

 Message: 2
 Date: Thu, 18 Oct 2007 19:34:14 -0400
 From: Ilia Alshanetsky &lt;ilia(a)prohost.org&gt;
 Subject: [Wikitech-l] PHP 5.2.5RC1 testing
 To: php-qa(a)lists.php.net
 Cc: marc(a)phpmyadmin.net, serendipity(a)supergarv.de, php(a)fudforum.org,
 	contact-us(a)lists.geeklog.net, wikitech-l(a)lists.wikimedia.org,
 	bharat(a)menalto.com, jasper(a)album.co.nz, php-testing(a)phorum.org,
 	dev(a)sugarcrm.com, m(a)wordpress.org,	Greg Beaver
 	&lt;greg(a)chiaraquartet.net&gt;et>, pear-qa(a)lists.php.net,	matteo(a)beccati.com
 Message-ID: &lt;4D932D75-38FB-455A-864A-89F35D113917(a)prohost.org&gt;
 Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed

 Hello!

 You are receiving this email because your project has been selected  
 to take part in a new effort by the PHP QA Team to make sure that  
 your project still works with PHP versions to-be-released. With this  
 we hope to make sure that you are either aware of things that might  
 break, or to make sure we don't introduce any strange regressions.  
 With this effort we hope to build a better relationship between the  
 PHP Team and the major projects.

 If you do not want to receive these heads-up emails, please reply to  
 me personally and I will remove you from the list; but, we hope that  
 you want to actively help us making PHP a better and more stable tool.

 The first release candidate of PHP 5.2.5 was just released and can be  
 downloaded from http://downloads.php.net/ilia/. Please try this  
 release candidate against your code and let us know if any  
 regressions should you find any. The goal is to have 5.2.5 out within  
 three weeks time, so timely testing would be extremely helpful.

 In case you think that other projects should also receive this kinds  
 of emails, please let me know privately, and I will add them to the  
 list of projects to contact.

 Best Regards,

 Ilia Alshanetsky
 5.2 Release Master

 ------------------------------

 Message: 3
 Date: Fri, 19 Oct 2007 16:58:30 +0200
 From: GerardM &lt;gerard.meijssen(a)gmail.com&gt;
 Subject: Re: [Wikitech-l] Indian Language support in Mediawiki?
 To: "Wikimedia developers" &lt;wikitech-l(a)lists.wikimedia.org&gt;
 Message-ID:
 	&lt;41a006820710190758i27879322kf366c08c269dc298(a)mail.gmail.com&gt;
 Content-Type: text/plain; charset=UTF-8

 Hoi,
 When you look at the localisation statistics, you understand what the
 problem is. Much work is needed to complete the localisation for MediaWiki.
 The fact that the Hindi or other Wikipedias have been localised to a larger
 extend makes no difference for the localisation of MediaWiki,

 http://www.mediawiki.org/wiki/Localisation_statistics

 Thanks,
     GerardM

 On 10/19/07, Jesse Martin (Pathoschild) &lt;pathoschild(a)gmail.com&gt; wrote:

 Hello,

 The Hindi and Kannada Wikipedias use JavaScript to do that; for
 example, try typing in the box at
 <http://hi.wikipedia.org/wiki/test?action=edit>.

 The interface has also been translated into Hindi, Tamil, and Kannada.
 You can set the interface language in the file LocalSettings.php
 (globally) or through the wiki page "Special:Preferences" (per user).

 Yours cordially,
 Jesse Martin (Pathoschild)

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 http://lists.wikimedia.org/mailman/listinfo/wikitech-l

 ------------------------------

 Message: 4
 Date: Fri, 19 Oct 2007 09:49:09 -0700
 From: John Q &lt;johnq(a)wikia.com&gt;
 Subject: Re: [Wikitech-l] Looking for a system administrator familiar
 	with the	Squid setup
 To: wikitech-l(a)lists.wikimedia.org
 Message-ID: &lt;4718E005.1020405(a)wikia.com&gt;
 Content-Type: text/plain; charset=ISO-8859-1; format=flowed

 Hi Travis,

 We just went through a big evolution and found a few things that helped 
 us, so how about we take a look. Also, Emil made that patch that 
 prevents google analytics from busting the cache as much... that's in 
 the wikimedia svn. Can you ask Jack to come up to San Mateo and we'll 
 sit down with him and we'll also get Artur to connect with you on-line.

 Thanks,
 John Q.

 -------- Original Message --------
 Subject: [Wikitech-l] Looking for a system administrator familiar with 
 the	Squid setup
 Date: Thu, 18 Oct 2007 08:43:56 -0400
 From: Travis Derouin &lt;travis(a)wikihow.com&gt;
 Reply-To: Wikimedia developers &lt;wikitech-l(a)lists.wikimedia.org&gt;
 To: Wikimedia developers &lt;wikitech-l(a)lists.wikimedia.org&gt;

 Hey,

 We've been running into some performance problems lately, and I'm stumped.
 I'm not sure if we need more hardware or not.

 I'd like to find a system administrator familiar with the Squid,
 Apache/Mediawiki, MySQL setup to take a look at our system, and identify any
 potential problems that we might have. We'd be comfortable with either a
 one-time fee, or an hourly rate. We have a 6 server setup right now, with 1
 Squid, 1 DB, 3 Apaches and 1 spare. If you or someone you know is
 interested, send an e-mail directly to me: travis(a)wikihow.com with your
 details and experience.

 Sorry for the job-type like posting, but I'm out of ideas and need some
 help.

 Thanks!
 Travis
 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 http://lists.wikimedia.org/mailman/listinfo/wikitech-l

 ------------------------------

 Message: 5
 Date: Fri, 19 Oct 2007 20:39:07 +0200 (CEST)
 From: Felipe Ortega &lt;glimmer_phoenix(a)yahoo.es&gt;
 Subject: [Wikitech-l] New version WikiXRay Python parser
 To: wiki-research-l(a)lists.wikimedia.org,
 	wikitech-l(a)lists.wikimedia.org
 Message-ID: &lt;31191.9735.qm(a)web27503.mail.ukl.yahoo.com&gt;
 Content-Type: text/plain; charset=iso-8859-1

 Hi.

 A new version of the Python parser in WikiXRay, along with improved
 documentation, can be found here:

 http://meta.wikimedia.org/wiki/WikiXRay_Python_parser

 Basically, I've developed two flavors: the standard for those people who want
 an alternative to other tools for processing Wikipedia's dumps (including the
 text table). The other version is for research purposes, It ignores the text
 itself and extracts instead useful info on the fly.

 Both flavors use extended inserts (you can tune the size and num. of rows)
 and the --monitor mode calls a db access module to avoid timeout errors.

 Further improvements (--skipnamespaces and --inject, this one should be very
 easy) are on the way.

 Best,

 Felipe.

 ---------------------------------

 S? un Mejor Amante del Cine
 ?Quieres saber c?mo? ?Deja que otras personas te ayuden!.

 ------------------------------

 Message: 6
 Date: Fri, 19 Oct 2007 21:10:36 +0200 (CEST)
 From: Lars Aronsson &lt;lars(a)aronsson.se&gt;
 Subject: [Wikitech-l] Incremental history dumps
 To: wikitech-l(a)lists.wikimedia.org
 Message-ID: &lt;Pine.LNX.4.64.0710192057030.30952(a)localhost.localdomain&gt;
 Content-Type: TEXT/PLAIN; charset=US-ASCII

 In the recent weeks I have been following the database dumps of 
 some languages of Wikipedia.  I download and analyze a dump, do 
 various improvements, and then wait for the next dump to become 
 available for a new analysis.  There are 2 or 3 weeks between each 
 dump.  There appear to be two parallel dump processes continuously 
 running, http://download.wikimedia.org/backup-index.html

 What takes most time in each dump is the large file with complete 
 version history, pages-meta-history.xml.bz2 and 
 pages-meta-history.xml.7z

 This is the largest file in compressed format, but since it 
 contains every version of every article it is also very highly 
 compressed, and expands to become enormous.  I guess that very few 
 people find use for this file.  In addition, only a very small 
 portion of its contents is changed between two dumps.  So we spend 
 a lot of time and effort (and delay of other things) in order to 
 create very little for very few users.

 I think that this dump should be made incremental.  Every week, 
 only that week's additional versions need to be dumped.  This can 
 then be added to the dump of the previous week, the week before 
 that, etc., which hasn't really changed.  This way, the dump 
 process could be made much faster, and the two parallel dump 
 processes would complete the cycle in less time, so new dumps of 
 the same project could be made available more frequently.

 Or is it already done this way, behind the scenes, only that it 
 isn't visible from the outside?

 -- 
   Lars Aronsson (lars(a)aronsson.se)
   Aronsson Datateknik - http://aronsson.se

 ------------------------------

 Message: 7
 Date: Fri, 19 Oct 2007 16:12:06 -0400
 From: "Gregory Maxwell" &lt;gmaxwell(a)gmail.com&gt;
 Subject: Re: [Wikitech-l] Incremental history dumps
 To: "Wikimedia developers" &lt;wikitech-l(a)lists.wikimedia.org&gt;
 Message-ID:
 	&lt;e692861c0710191312g6079c2d6md5cb326a69f84d47(a)mail.gmail.com&gt;
 Content-Type: text/plain; charset=UTF-8

 It already works that way on the backend, pretty much.

 We can't make the old increments available forever beacuse of things
 we ar obligated to discontinue distributing, so incrementals to the
 users would not be so useful.

 On 10/19/07, Lars Aronsson &lt;lars(a)aronsson.se&gt; wrote:

 In the recent weeks I have been following the database dumps of
 some languages of Wikipedia.  I download and analyze a dump, do
 various improvements, and then wait for the next dump to become
 available for a new analysis.  There are 2 or 3 weeks between each
 dump.  There appear to be two parallel dump processes continuously
 running, http://download.wikimedia.org/backup-index.html

 What takes most time in each dump is the large file with complete
 version history, pages-meta-history.xml.bz2 and
 pages-meta-history.xml.7z

 This is the largest file in compressed format, but since it
 contains every version of every article it is also very highly
 compressed, and expands to become enormous.  I guess that very few
 people find use for this file.  In addition, only a very small
 portion of its contents is changed between two dumps.  So we spend
 a lot of time and effort (and delay of other things) in order to
 create very little for very few users.

 I think that this dump should be made incremental.  Every week,
 only that week's additional versions need to be dumped.  This can
 then be added to the dump of the previous week, the week before
 that, etc., which hasn't really changed.  This way, the dump
 process could be made much faster, and the two parallel dump
 processes would complete the cycle in less time, so new dumps of
 the same project could be made available more frequently.

 Or is it already done this way, behind the scenes, only that it
 isn't visible from the outside?

 --
   Lars Aronsson (lars(a)aronsson.se)
   Aronsson Datateknik - http://aronsson.se

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 http://lists.wikimedia.org/mailman/listinfo/wikitech-l

 ------------------------------

 Message: 8
 Date: Fri, 19 Oct 2007 22:16:32 +0200
 From: Platonides &lt;Platonides(a)gmail.com&gt;
 Subject: [Wikitech-l] RFC: Incremental history dumps
 To: wikitech-l(a)lists.wikimedia.org
 Message-ID: &lt;ffb3b0$7p9$1(a)ger.gmane.org&gt;
 Content-Type: text/plain; charset=ISO-8859-1; format=flowed

 Lars Aronsson wrote:
  Or is it already done this way, behind the
scenes, only that it 
 isn't visible from the outside?  
 No.

 AFAIK it is done as follows:

 Precondition: The last full dump (if not present, treat as empty).
 1- Take an snapshot of the wiki status (page table?) and create 
 stub-meta-history
 2- Read stub-meta-history and fill the page content with the last dump 
 page contents. If a page content is not on previous dump, get it from 
 the external storage in a blocking way.

 Result: A bzipped2 full history dump.
 The bzip2 dump is then uncompressed and 7zipped.

 If there's an error on a call to the external storage, the process can't 
 be resumed and the dump fails.

 I had been recently thinking on it and think it could be done as this:
 Precondition: The last full dump (if not present, treat as empty) and 
 its greatest revid.
 1a- Take an snapshot of the wiki status (page table?) and create 
 stub-meta-history
 1b- While reading the revisions, if revid is greater than the 
 lastdumpgreaterrevid (LDGR), add it to N files (a file per M revisions).
 2-Run N processes grabbing these page contents. Store them on a 
 new-format dump (the external storage equivalent), one per revid list 
 file. If one fails, just rerun it.

 3-  Read stub-meta-history and fill the page content with the last dump 
 page contents. If a page text is not on previous dump, grab from the 
 list file if revid > LDGR else, get it from the external storage saving 
 it on a different file.

 Revisions not present on last dump nor incremental dumps will occur on 
 restored pages, and still be able to block it, but being much less, it's 
 much more unlikely that they fail.

 4-Save the new dump LDGR with the new bzipped dump.

 Making available the M+1 incremental dumps, using the smaller 
 meta-stubs-history, last dump can be recreated using the previous one 
 (=less download size).

 Wikimedia would still provide the full dumps, but you would only be need 
 ed the first time.

 Comments?

 ------------------------------

 Message: 9
 Date: Fri, 19 Oct 2007 16:53:15 -0400
 From: Simetrical &lt;Simetrical+wikilist(a)gmail.com&gt;
 Subject: Re: [Wikitech-l] [MediaWiki-CVS] SVN: [26830] trunk/phase3
 To: "Wikimedia developers" &lt;wikitech-l(a)lists.wikimedia.org&gt;
 Message-ID:
 	&lt;7c2a12e20710191353n7b6d2bc4wd2524e19898a2f2c(a)mail.gmail.com&gt;
 Content-Type: text/plain; charset=UTF-8

 On 10/19/07, Thomas Dalton &lt;thomas.dalton(a)gmail.com&gt; wrote:
   The
bigger problem is your join condition.  If a row matches
 pl_namespace=rc_namespace and pl_title=rc_title, then it's joined to
 *every row* of the page and redirect tables, because there are no
 restrictions on them!  The converse is true as well. 
 This is the part I don't understand. What do you mean by there being
 no restrictions? Why does the pl_from=$id part not restrict it
 appropriately?  
 You have to keep in mind that when you're doing simple joins, the
 result set is equal to the Cartesian product (every possible
 combination of rows from each table), filtered according to the
 WHERE/ON conditions (which are equivalent).  The pl_from=$id restricts
 it so that all the rows in the result set will have one particular
 pagelinks row, no other.  The remaining conditions must put enough
 restrictions on the recentchanges, page, *and* redirect tables to keep
 the number of returned rows small enough to be reasonable.

 The problem is that your conditions state that either the
 recentchanges table must obey certain conditions, relating to the one
 pagelinks row already selected, *or* the page and redirect and
 recentchanges tables must obey certain (different) conditions.  If a
 recentchanges row obeys the first set of conditions (i.e., it
 corresponds to the pagelinks row), there are no restrictions on what
 page or redirect rows can be associated with it, and therefore *every*
 page row and *every* redirect row is associated with it, and so is
 *every* combination thereof.

 This will not actually appear in the result set, because the GROUP BY
 will condense rows with identical recentchanges rows.  I'm not sure
 exactly how GROUP BY works here as opposed to DISTINCT, say, given
 that there are no grouping operators or anything: I hardly qualify as
 an SQL expert.  But I could tell from the EXPLAIN that the query was
 seriously inefficient, and I noticed the deficiency in the join
 condition that was prompting a Cartesian join of the last two tables
 (after Xgc in #mysql prompted me to take a closer look at the query).

  I put it through various tests before committing
it, and it seemed to
 give the correct results (obviously, none of my tests revealed the
 error with the cutoff - that's a problem with testing on a test
 install, not a real world database). So is the query correct, just
 inefficient, or were my tests insufficient to catch the mistakes?  
 It may be correct.  I'm not sure, because on my PC (which is my test
 server) it sent mysqld to 90%+ CPU usage for somewhere well over a
 minute while copying to tmp table, so I got bored and killed the
 thread.  Whether it would have returned the correct results half an
 hour from now is a somewhat academic question.  :)

 Generally speaking, it's handy to have a relatively realistic local
 database.  At the suggestion of Yurik, I use the Simple English
 Wikipedia because it's not gigantic and it's not gibberish to me.
 It's still not really ideal, because for instance the user table is
 practically nonexistent, recentchanges is unrealistically small, etc.,
 so I can use the toolserver if I still wasn't sure.  That still
 wouldn't be quite ideal, since it has a different version of MySQL
 installed and so on, but it would be a pretty good approximation.

 By the way, did you test your patch while logged in?  It seems to
 cause a fatal error before it even tries to execute the query.
 Generally speaking, it's a bad idea to mix implicit join syntax (foo,
 bar) with explicit join syntax (foo JOIN bar), like foo, bar, baz LEFT
 OUTER JOIN quuz: it doesn't do what you expect.

 Due to all these issues, I've reverted this, r26848.

 ------------------------------

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 http://lists.wikimedia.org/mailman/listinfo/wikitech-l

 End of Wikitech-l Digest, Vol 51, Issue 38
 ******************************************