Dear wikitech developers,
First of all, we (me and my friends whoe're agree to work on buginese
wikipedia) are terribly sorry for this change. It should be before the
creation of the domain but turn out that most of our pc (even me) cannot
display the fonts in the edit box. Therefore, we'd like to cancel the
unicode input/writings for our edit box as well as our display and change
to standard wikipedia like enlish/ malay/ indonesian wikis (means that all
will be in roman, no unicode like english wiki). This problem actually
occurred before the creation of the domain but I kind of busy that time and
forget to tell wikitech's. However we'd like wikitech to keep the font
family of lontara. We will be needing it for the future. Again, we're
terribly-terribly sorry for this messed up.
Secondly, I wish to talk to the person in charge for buginese wikipedia,
for setting up stuff, how to change titles like special page, sandbox,
categories from english to buginese. I can't connect to freenode server due
to the software abort. (I don't even know what's that mean) and because I'm
the only free at most time, therefore, my pals're depending on me to do the
talks to the wiki.
Again we're terribly-terribly sorry for this change. we wish to tell wiki
sooner.
Zaid Zainuddin
Dear wikitech,
Another thousand apologies. I think I just find out how this whole system
works. Darn!! I was so..... stupid. You guys can ignore my previous message
Thank you,
Muhammad Zaid Zainuddin
I've checked in a small experimental feature on REL1_5 and HEAD for
explicitly specifying the utf8 charset on tables and setting the
connection so it doesn't mangle the utf8 data on the wire.
As previously discussed this is insufficient for Wikimedia since it will
fail when you try to insert data with non-BMP Unicode characters in
various places, but those running their own sites and having to use
newer MySQL might appreciate more mundate text being stored correctly.
Activating this mode on an existing site could cause interesting
problems, however; use caution engaging it.
The server communications mode is controlled by $wgDBmysql5 (true to
send 'SET NAMES utf8' on connect), and if selected on installation table
defs are pulled from maintenance/mysql5/tables.sql, adding 'DEFAULT
CHARSET=utf8' on the tables.
I've tested this briefly on 5.0.15, but I think it will work on 4.1 as well.
-- brion vibber (brion @ pobox.com)
Hi All,
Are other people having grief importing the new XML format database-dumps?
Today, I've just tried 3 different methods of importing the EN
20051009_pages_articles.xml.bz2 dump, and not one of them seems to
work properly.
Incidentally, I have verified that the md5sum of the dump is correct,
so as to eliminate downloading problems:
ludo:/home/nickj/wikipedia# md5sum 20051009_pages_articles.xml.bz2
4d18ffa1550196f3a6a0abc9ebbd7d06 20051009_pages_articles.xml.bz2
------------------------------------------------------------------------------------
Method 1:
Importing using ImportDump from MediaWiki 1.5.0 running on PHP 4.1.2
This I knew this one might have problems, due to the oldness of the
version of PHP.
However, this one got the furthest of all the methods. It ran for 6
hours and 24 minutes, and imported around 60 percent of articles.
Something (probably PHP) has a memory leak however, as it resulted in
Linux 2.6.8's Out-of-Memory killer kicking in, until it killed the
script in question. The machine has 448 Mb of RAM, so it took a while
for the leak to consume all the memory.
Command line was:
bzip2 -dc /home/nickj/wikipedia/20051009_pages_articles.xml.bz2 | php
maintenance/importDump.php
But from the overnight system log we have:
Oct 21 03:05:01 ludo kernel: Out of Memory: Killed process 816 (apache).
Oct 21 03:13:04 ludo kernel: Out of Memory: Killed process 817 (apache).
Oct 21 03:20:41 ludo kernel: Out of Memory: Killed process 7677 (apache).
Oct 21 03:23:30 ludo kernel: Out of Memory: Killed process 946 (apache).
Oct 21 03:26:57 ludo kernel: Out of Memory: Killed process 7696 (apache).
Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 573 (mysqld).
Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 575 (mysqld).
Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 576 (mysqld).
Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 577 (mysqld).
Oct 21 03:27:06 ludo kernel: Out of Memory: Killed process 3111 (mysqld).
Oct 21 06:29:24 ludo kernel: Out of Memory: Killed process 7697 (apache).
Oct 21 06:29:25 ludo kernel: Out of Memory: Killed process 7699 (apache).
Oct 21 06:29:25 ludo kernel: Out of Memory: Killed process 3110 (php).
At that point importing stopped.
------------------------------------------------------------------------------------
Method 2:
Importing using ImportDump from MediaWiki 1.5.0 using a fresh PHP 4.4
STABLE CVS snapshot build (From really old, to really new).
This I though would work, but it didn't:
ludo:/var/www/hosts/local-wikipedia/wiki# bzip2 -dc
/home/nickj/wikipedia/20051009_pages_articles.xml.bz2 |
~root/tmp/php-5.1-dev/php4-STABLE-200510201252/sapi/cli/php
maintenance/importDump.php
100 (22.802267296596 pages/sec 22.802267296596 revs/sec)
200 (20.961060430845 pages/sec 20.961060430845 revs/sec)
300 (20.006219254115 pages/sec 20.006219254115 revs/sec)
[...snip lots of progress lines...]
64000 (41.86646431353 pages/sec 41.86646431353 revs/sec)
64100 (41.87977053847 pages/sec 41.87977053847 revs/sec)
64200 (41.891992792767 pages/sec 41.891992792767 revs/sec)
64300 (41.902506473828 pages/sec 41.902506473828 revs/sec)
64400 (41.920741784615 pages/sec 41.920741784615 revs/sec)
64500 (41.937710744276 pages/sec 41.937710744276 revs/sec)
64600 (41.945053966443 pages/sec 41.945053966443 revs/sec)
64700 (41.95428629711 pages/sec 41.95428629711 revs/sec)
PHP Fatal error: Call to a member function on a non-object in
/var/www/hosts/local-wikipedia/wiki/includes/Article.php on line 934
ludo:/var/www/hosts/local-wikipedia/wiki#
I.e. it dies, after 13 minutes, and at around 4% of the articles.
------------------------------------------------------------------------------------
Method 3:
Using the latest mwdumper (from http://download.wikimedia.org/tools/
), plus the latest and greatest stable JRE (1.5.0_05), and converting
into 1.4 format, then importing that into MySQL:
/usr/java/jre1.5.0_05/bin/java -server -jar mwdumper.jar
--format=sql:1.4 20051009_pages_articles.xml.bz2 | mysql enwiki
This ran without any errors, and looked really promising.
However before this, where were some 1.5 million articles (from a June
SQL dump, which was the last Wikipedia dump I've been able import
properly):
mysql> select count(*) from cur;
+----------+
| count(*) |
+----------+
| 1535910 |
+----------+
1 row in set (0.00 sec)
# Then I cleared the table:
mysql> delete from cur;
Query OK, 0 rows affected (4.11 sec)
# Then the above mwdumper command ran for 53 minutes before finishing,
which seemed way too quick. Checking how many articles had been
imported showed there was something wrong:
mysql> select count(*) from cur;
+----------+
| count(*) |
+----------+
| 29166 |
+----------+
1 row in set (0.00 sec)
I.e. less than 2% of the articles got imported.
------------------------------------------------------------------------------------
So, my question to the list is this:
What methods have you tried for importing the XML dumps? In particular
what you tried that actually _worked_ ? (and by "working", I mean runs
without a memory leak, runs without dying of an error message, and
imports all of the articles into a database).
All the best,
Nick.
>>Semi-standard - it uses an X-Forwarded-From header and sometimes it
>>reverses the order of the octets (for no good reason).
>
> Nothing in MediaWiki should reverse the order of the octets, where have
> you seen that?
He was referring to the NTL proxies. Half of them reverse the order of
the octets of the IP addresses in their X-Forwarded-From header, half of
them don't.
Brion Vibber wrote:
> One possibility is to embed the timestamp into the URL. So the goatse
> version might be:
> http://upload.wikimedia.org/wikipedia/en/2005/10/23/074223/Puppy.jpg
>
> and the reverted image would get a different URL, a few minutes later:
> http://upload.wikimedia.org/wikipedia/en/2005/10/23/074506/Puppy.jpg
Alternative but very similar idea would be to embed the revision
number in the URL, instead of the upload timestamp:
Example original:
http://upload.wikimedia.org/wikipedia/en/P/1/Puppy.jpg
Example revised:
http://upload.wikimedia.org/wikipedia/en/P/2/Puppy.jpg
Then internally there needs to be some translation/lookup table from
image name --> current revision number, as opposed to a lookup table
for image name --> upload date. (Integers are smaller than dates, so
small memory saving perhaps).
Possibility of a very very very small bandwidth saving from slightly
shorter URLs.
Maybe also it helps if two people upload Puppy.jpg at the exact same
second (not sure what happens in a date timestamp system that's only
accurate to the second when this happens, but in a revision number
system one is always going to first, even by a few microseconds).
Lastly, it's easy for a human with the URL to see what revisions come
before/after by incrementing/decrementing the digit in the URL,
whereas the date and time of the upload of a previous revision cannot
be predicted just from the image name.
All other benefits as per timestamp system, I think.
All the best,
Nick.
Dear wikitech,
It's not that I give up or something, I just wonder if my PC cannot display
the code properly in the edit box , how about others PC? (I only got boxes
in the edit box.) Likewise, after I told them that buginese has just
created, suddenly, there were like a hundred of request of having this wiki
in roman. The main reason is that it'll be easier for them to edit the
article. Moreover, this is just struck me, most of the readers are using
public PC and thus they usually aren't allowed to change the settings for
the PCs (If I'm not mistaken, the control panel for public PCs are locked).
And lontara font is not a microsoft software or project or whatever, which
means that it didn't come along with the microsoft OS pack. Therefore, this
script is not available for almost all public PCs. And this is totally my
mistake. Thousand apologies.
Maybe you guys're kind of confused, when I said unicode, I referred to the
traditional lontara. But plese don't delete the lontara font family. We'll
be needing in the future, like quoting the old text etc.
Again, I'm terribly-terribly sorry for this messed up. Another thousand
apologies. And I hope you guys can change the system to roman as soon as
possible. So that I can continue the set up.
Zaid Zainuddin
Brion Vibber wrote:
>David Gerard wrote:
>> Brion, what would it take to get article rating switched on? Is there
>> any such feature you would allow in, or is it basically off the agenda
>> and I should stop asking?
>Well, it needs to be shown to work correctly on pages with thousands of
>revisions without bogging the server or otherwise exploding in
>interesting ways.
Ah, cool. What test results would convince you? (I'm thinking in terms
of setting up a box with MediaWiki 1.5 or CVS HEAD, loading a DB dump
with lotsa revisions and torturing it.)
- d.
Just thought I'd float this idea for comments before I try working on it...
Between multi-megapixel digital photographs and other wacky multimedia
fun, uploads are taking up an ever-huger amount of disk space,
bandwidth, etc. Our existing primary image fileserver is a bit sluggish;
a new one with a nice big drive array is on order but we still would
like to provide for better local and downstream caching.
It would make caching much easier if the file at a given URL was
immutable; that is, if a replacement image has a different URL from the
old one.
For an example of the problem with mutable images, take this scenario:
1) A featured article has a photo, say, [[Image:Puppy.jpg]]
2) Somebody uploads goatse.cx on top of it.
3) A visitor comes, and fetches the goatse image at:
http://upload.wikimedia.org/wikipedia/en/a/a1/Puppy.jpg
His ISP's transparent proxy caches the image.
4) An admin reverts the image back to the puppy and protects it.
5) Another visitor loads the article, and fetches the puppy image at
http://upload.wikimedia.org/wikipedia/en/a/a1/Puppy.jpg
He's from the same ISP, and the proxy returns the previously loaded
goatse image.
6) The visitor e-mails the Wikimedia board to complain about their *very
offensive* web site. ;)
One possibility is to embed the timestamp into the URL. So the goatse
version might be:
http://upload.wikimedia.org/wikipedia/en/2005/10/23/074223/Puppy.jpg
and the reverted image would get a different URL, a few minutes later:
http://upload.wikimedia.org/wikipedia/en/2005/10/23/074506/Puppy.jpg
(The article pages need to be rerendered with the new link, but this is
already necessary to accomodate changes in size, etc. Articles are
forced to be rechecked from end-clients and are only cached by proxies
we control and send explicit purges to, so that 'should' stay under
control.)
This scheme would allow for outside proxy caches to cache a given image
file indefinitely without it becoming dangerously stale, as well as more
permanent on-demand replicated image servers to distribute bandwidth
across our clusters without stealing squid cache space from articles.
A downside is that image URLs aren't predictable ahead of time; unless
you're in the database to check what the latest version of the image is
you can't build the URL from just a file name. One could though concoct
a litte special page or something to redirect to whatever the current
version is.
Another benefit is that not having the "/a/ad/" cache directory will
allow people with badly written ad blockers to see the missing 1/256th
of our images again. ;)
-- brion vibber (brion @ pobox.com)