Wikitech-l October 2005

wikitech-l@lists.wikimedia.org

118 participants
151 discussions

by muhammad zaid zainuddin

Dear wikitech developers, First of all, we (me and my friends whoe're agree to work on buginese wikipedia) are terribly sorry for this change. It should be before the creation of the domain but turn out that most of our pc (even me) cannot display the fonts in the edit box. Therefore, we'd like to cancel the unicode input/writings for our edit box as well as our display and change to standard wikipedia like enlish/ malay/ indonesian wikis (means that all will be in roman, no unicode like english wiki). This problem actually occurred before the creation of the domain but I kind of busy that time and forget to tell wikitech's. However we'd like wikitech to keep the font family of lontara. We will be needing it for the future. Again, we're terribly-terribly sorry for this messed up. Secondly, I wish to talk to the person in charge for buginese wikipedia, for setting up stuff, how to change titles like special page, sandbox, categories from english to buginese. I can't connect to freenode server due to the software abort. (I don't even know what's that mean) and because I'm the only free at most time, therefore, my pals're depending on me to do the talks to the wiki. Again we're terribly-terribly sorry for this change. we wish to tell wiki sooner. Zaid Zainuddin

18 years, 5 months

the third apologies

by muhammad zaid zainuddin

Dear wikitech, Another thousand apologies. I think I just find out how this whole system works. Darn!! I was so..... stupid. You guys can ignore my previous message Thank you, Muhammad Zaid Zainuddin

18 years, 5 months

MySQL 4.1/5.0 Unicode support

by Brion Vibber

I've checked in a small experimental feature on REL1_5 and HEAD for explicitly specifying the utf8 charset on tables and setting the connection so it doesn't mangle the utf8 data on the wire. As previously discussed this is insufficient for Wikimedia since it will fail when you try to insert data with non-BMP Unicode characters in various places, but those running their own sites and having to use newer MySQL might appreciate more mundate text being stored correctly. Activating this mode on an existing site could cause interesting problems, however; use caution engaging it. The server communications mode is controlled by $wgDBmysql5 (true to send 'SET NAMES utf8' on connect), and if selected on installation table defs are pulled from maintenance/mysql5/tables.sql, adding 'DEFAULT CHARSET=utf8' on the tables. I've tested this briefly on 5.0.15, but I think it will work on 4.1 as well. -- brion vibber (brion @ pobox.com)

18 years, 5 months

Importing XML database dumps - what actually works?

by Nick Jenkins

Hi All, Are other people having grief importing the new XML format database-dumps? Today, I've just tried 3 different methods of importing the EN 20051009_pages_articles.xml.bz2 dump, and not one of them seems to work properly. Incidentally, I have verified that the md5sum of the dump is correct, so as to eliminate downloading problems: ludo:/home/nickj/wikipedia# md5sum 20051009_pages_articles.xml.bz2 4d18ffa1550196f3a6a0abc9ebbd7d06 20051009_pages_articles.xml.bz2 ------------------------------------------------------------------------------------ Method 1: Importing using ImportDump from MediaWiki 1.5.0 running on PHP 4.1.2 This I knew this one might have problems, due to the oldness of the version of PHP. However, this one got the furthest of all the methods. It ran for 6 hours and 24 minutes, and imported around 60 percent of articles. Something (probably PHP) has a memory leak however, as it resulted in Linux 2.6.8's Out-of-Memory killer kicking in, until it killed the script in question. The machine has 448 Mb of RAM, so it took a while for the leak to consume all the memory. Command line was: bzip2 -dc /home/nickj/wikipedia/20051009_pages_articles.xml.bz2 | php maintenance/importDump.php But from the overnight system log we have: Oct 21 03:05:01 ludo kernel: Out of Memory: Killed process 816 (apache). Oct 21 03:13:04 ludo kernel: Out of Memory: Killed process 817 (apache). Oct 21 03:20:41 ludo kernel: Out of Memory: Killed process 7677 (apache). Oct 21 03:23:30 ludo kernel: Out of Memory: Killed process 946 (apache). Oct 21 03:26:57 ludo kernel: Out of Memory: Killed process 7696 (apache). Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 573 (mysqld). Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 575 (mysqld). Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 576 (mysqld). Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 577 (mysqld). Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 3111 (mysqld). Oct 21 06:29:24 ludo kernel: Out of Memory: Killed process 7697 (apache). Oct 21 06:29:25 ludo kernel: Out of Memory: Killed process 7699 (apache). Oct 21 06:29:25 ludo kernel: Out of Memory: Killed process 3110 (php). At that point importing stopped. ------------------------------------------------------------------------------------ Method 2: Importing using ImportDump from MediaWiki 1.5.0 using a fresh PHP 4.4 STABLE CVS snapshot build (From really old, to really new). This I though would work, but it didn't: ludo:/var/www/hosts/local-wikipedia/wiki# bzip2 -dc /home/nickj/wikipedia/20051009_pages_articles.xml.bz2 | ~root/tmp/php-5.1-dev/php4-STABLE-200510201252/sapi/cli/php maintenance/importDump.php 100 (22.802267296596 pages/sec 22.802267296596 revs/sec) 200 (20.961060430845 pages/sec 20.961060430845 revs/sec) 300 (20.006219254115 pages/sec 20.006219254115 revs/sec) [...snip lots of progress lines...] 64000 (41.86646431353 pages/sec 41.86646431353 revs/sec) 64100 (41.87977053847 pages/sec 41.87977053847 revs/sec) 64200 (41.891992792767 pages/sec 41.891992792767 revs/sec) 64300 (41.902506473828 pages/sec 41.902506473828 revs/sec) 64400 (41.920741784615 pages/sec 41.920741784615 revs/sec) 64500 (41.937710744276 pages/sec 41.937710744276 revs/sec) 64600 (41.945053966443 pages/sec 41.945053966443 revs/sec) 64700 (41.95428629711 pages/sec 41.95428629711 revs/sec) PHP Fatal error: Call to a member function on a non-object in /var/www/hosts/local-wikipedia/wiki/includes/Article.php on line 934 ludo:/var/www/hosts/local-wikipedia/wiki# I.e. it dies, after 13 minutes, and at around 4% of the articles. ------------------------------------------------------------------------------------ Method 3: Using the latest mwdumper (from http://download.wikimedia.org/tools/ ), plus the latest and greatest stable JRE (1.5.0_05), and converting into 1.4 format, then importing that into MySQL: /usr/java/jre1.5.0_05/bin/java -server -jar mwdumper.jar --format=sql:1.4 20051009_pages_articles.xml.bz2 | mysql enwiki This ran without any errors, and looked really promising. However before this, where were some 1.5 million articles (from a June SQL dump, which was the last Wikipedia dump I've been able import properly): mysql> select count(*) from cur; +----------+ | count(*) | +----------+ | 1535910 | +----------+ 1 row in set (0.00 sec) # Then I cleared the table: mysql> delete from cur; Query OK, 0 rows affected (4.11 sec) # Then the above mwdumper command ran for 53 minutes before finishing, which seemed way too quick. Checking how many articles had been imported showed there was something wrong: mysql> select count(*) from cur; +----------+ | count(*) | +----------+ | 29166 | +----------+ 1 row in set (0.00 sec) I.e. less than 2% of the articles got imported. ------------------------------------------------------------------------------------ So, my question to the list is this: What methods have you tried for importing the XML dumps? In particular what you tried that actually _worked_ ? (and by "working", I mean runs without a memory leak, runs without dying of an error message, and imports all of the articles into a database). All the best, Nick.

18 years, 5 months

Re: NTL

by Timwi

>>Semi-standard - it uses an X-Forwarded-From header and sometimes it >>reverses the order of the octets (for no good reason). > > Nothing in MediaWiki should reverse the order of the octets, where have > you seen that? He was referring to the NTL proxies. Half of them reverse the order of the octets of the IP addresses in their X-Forwarded-From header, half of them don't.

18 years, 5 months

Re: Upload URL/filesystem restructuring

by Nick Jenkins

Brion Vibber wrote: > One possibility is to embed the timestamp into the URL. So the goatse > version might be: > http://upload.wikimedia.org/wikipedia/en/2005/10/23/074223/Puppy.jpg > > and the reverted image would get a different URL, a few minutes later: > http://upload.wikimedia.org/wikipedia/en/2005/10/23/074506/Puppy.jpg Alternative but very similar idea would be to embed the revision number in the URL, instead of the upload timestamp: Example original: http://upload.wikimedia.org/wikipedia/en/P/1/Puppy.jpg Example revised: http://upload.wikimedia.org/wikipedia/en/P/2/Puppy.jpg Then internally there needs to be some translation/lookup table from image name --> current revision number, as opposed to a lookup table for image name --> upload date. (Integers are smaller than dates, so small memory saving perhaps). Possibility of a very very very small bandwidth saving from slightly shorter URLs. Maybe also it helps if two people upload Puppy.jpg at the exact same second (not sure what happens in a date timestamp system that's only accurate to the second when this happens, but in a revision number system one is always going to first, even by a few microseconds). Lastly, it's easy for a human with the URL to see what revisions come before/after by incrementing/decrementing the digit in the URL, whereas the date and time of the upload of a previous revision cannot be predicted just from the image name. All other benefits as per timestamp system, I think. All the best, Nick.

18 years, 5 months

About the cancelling

by muhammad zaid zainuddin

Dear wikitech, It's not that I give up or something, I just wonder if my PC cannot display the code properly in the edit box , how about others PC? (I only got boxes in the edit box.) Likewise, after I told them that buginese has just created, suddenly, there were like a hundred of request of having this wiki in roman. The main reason is that it'll be easier for them to edit the article. Moreover, this is just struck me, most of the readers are using public PC and thus they usually aren't allowed to change the settings for the PCs (If I'm not mistaken, the control panel for public PCs are locked). And lontara font is not a microsoft software or project or whatever, which means that it didn't come along with the microsoft OS pack. Therefore, this script is not available for almost all public PCs. And this is totally my mistake. Thousand apologies. Maybe you guys're kind of confused, when I said unicode, I referred to the traditional lontara. But plese don't delete the lontara font family. We'll be needing in the future, like quoting the old text etc. Again, I'm terribly-terribly sorry for this messed up. Another thousand apologies. And I hope you guys can change the system to roman as soon as possible. So that I can continue the set up. Zaid Zainuddin

18 years, 5 months

Re: [Wikitech-l] Re: [WikiEN-l] Ratings again (was: Oour quality could rival Encyclopedia Britannica)

by David Gerard

Brion Vibber wrote: >David Gerard wrote: >> Brion, what would it take to get article rating switched on? Is there >> any such feature you would allow in, or is it basically off the agenda >> and I should stop asking? >Well, it needs to be shown to work correctly on pages with thousands of >revisions without bogging the server or otherwise exploding in >interesting ways. Ah, cool. What test results would convince you? (I'm thinking in terms of setting up a box with MediaWiki 1.5 or CVS HEAD, loading a DB dump with lotsa revisions and torturing it.) - d.

18 years, 5 months

Upload URL/filesystem restructuring

by Brion Vibber

Just thought I'd float this idea for comments before I try working on it... Between multi-megapixel digital photographs and other wacky multimedia fun, uploads are taking up an ever-huger amount of disk space, bandwidth, etc. Our existing primary image fileserver is a bit sluggish; a new one with a nice big drive array is on order but we still would like to provide for better local and downstream caching. It would make caching much easier if the file at a given URL was immutable; that is, if a replacement image has a different URL from the old one. For an example of the problem with mutable images, take this scenario: 1) A featured article has a photo, say, [[Image:Puppy.jpg]] 2) Somebody uploads goatse.cx on top of it. 3) A visitor comes, and fetches the goatse image at: http://upload.wikimedia.org/wikipedia/en/a/a1/Puppy.jpg His ISP's transparent proxy caches the image. 4) An admin reverts the image back to the puppy and protects it. 5) Another visitor loads the article, and fetches the puppy image at http://upload.wikimedia.org/wikipedia/en/a/a1/Puppy.jpg He's from the same ISP, and the proxy returns the previously loaded goatse image. 6) The visitor e-mails the Wikimedia board to complain about their *very offensive* web site. ;) One possibility is to embed the timestamp into the URL. So the goatse version might be: http://upload.wikimedia.org/wikipedia/en/2005/10/23/074223/Puppy.jpg and the reverted image would get a different URL, a few minutes later: http://upload.wikimedia.org/wikipedia/en/2005/10/23/074506/Puppy.jpg (The article pages need to be rerendered with the new link, but this is already necessary to accomodate changes in size, etc. Articles are forced to be rechecked from end-clients and are only cached by proxies we control and send explicit purges to, so that 'should' stay under control.) This scheme would allow for outside proxy caches to cache a given image file indefinitely without it becoming dangerously stale, as well as more permanent on-demand replicated image servers to distribute bandwidth across our clusters without stealing squid cache space from articles. A downside is that image URLs aren't predictable ahead of time; unless you're in the database to check what the latest version of the image is you can't build the URL from just a file name. One could though concoct a litte special page or something to redirect to whatever the current version is. Another benefit is that not having the "/a/ad/" cache directory will allow people with badly written ad blockers to see the missing 1/256th of our images again. ;) -- brion vibber (brion @ pobox.com)

18 years, 5 months

JBOD vs. RAID0

by Jeremy Dunck

I see that the wm servers use RAID 0 quite a bit, and wondered why not JBOD? Does RAID 0 perform better, or was there some other reason? http://en.wikipedia.org/wiki/Redundant_array_of_independent_disks#RAID_0

18 years, 5 months

← Newer
1
2
3
4
5
6
7
8
...
16
Older →

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Wikitech-l October 2005