We'd love to, but we need to either a) take it offline for a
few days or
b) invent a way to convert the database without data loss or damage while keeping it online.
-- brion vibber (brino @ pobox.com)
I suppose most of the time will be taken to convert old. Shouldn't be possible to convert only cur while either leave old unconvert or mark each entrey in old as unconverted/still in iso-8859-1 and convert these entries when they are needed or by a very low priority job? (of course the soft will need to handle the conversion flag when viewing on old version of an article, doing a diff, ...)
Is this doable or still too complex?
It's possible. Just need to change a bit the software :) Just need to add a flag UTF-8 as the soft add a flag gzip. And tell the soft to read as it is.
Shaihulud
On Tuesday 15 June 2004 08:09, Constans, Camille (C.C.) wrote:
We'd love to, but we need to either a) take it offline for a
few days or
b) invent a way to convert the database without data loss or damage while keeping it online.
-- brion vibber (brino @ pobox.com)
I suppose most of the time will be taken to convert old. Shouldn't be possible to convert only cur while either leave old unconvert or mark each entrey in old as unconverted/still in iso-8859-1 and convert these entries when they are needed or by a very low priority job? (of course the soft will need to handle the conversion flag when viewing on old version of an article, doing a diff, ...)
Is this doable or still too complex?
It's possible. Just need to change a bit the software :) Just need to add a flag UTF-8 as the soft add a flag gzip. And tell the soft to read as it is.
I am thinking about an even simpler solution. Have server-side script convert articles and their histories to UTF-8. Have a postprocessor (written in C) tell if a page is in UTF-8 and change appropriate meta tag if it is. It's vastly improbable that a UTF-8 page will not be in UTF-8, it could be checked on a database dump and I don't believe that any such page would be found. When all pages are converted, site could be switched to UTF-8 and postprocessor turned off.
On Wednesday 16 June 2004 05:05, you wrote:
On Tuesday 15 June 2004 08:09, Constans, Camille (C.C.) wrote:
We'd love to, but we need to either a) take it offline for a
few days or
b) invent a way to convert the database without data loss or damage while keeping it online.
-- brion vibber (brino @ pobox.com)
I suppose most of the time will be taken to convert old. Shouldn't be possible to convert only cur while either leave old unconvert or mark each entrey in old as unconverted/still in iso-8859-1 and convert these entries when they are needed or by a very low priority job? (of course the soft will need to handle the conversion flag when viewing on old version of an article, doing a diff, ...)
Is this doable or still too complex?
It's possible. Just need to change a bit the software :) Just need to add a flag UTF-8 as the soft add a flag gzip. And tell the soft to read as it is.
I am thinking about an even simpler solution. Have server-side script convert articles and their histories to UTF-8. Have a postprocessor (written in C) tell if a page is in UTF-8 and change appropriate meta tag if it is. It's vastly improbable that a UTF-8 page will not be in UTF-8, it could be checked on a database dump and I don't believe that any such page would be found. When all pages are converted, site could be switched to UTF-8 and postprocessor turned off.
This could even be doen without a postprocessor, there is PHP mb_detect_encoding function which does exactly that.
On Wednesday 16 June 2004 10:16, Nikola Smolenski wrote:
I am thinking about an even simpler solution. Have server-side script convert articles and their histories to UTF-8. Have a postprocessor (written in C) tell if a page is in UTF-8 and change appropriate meta tag if it is. It's vastly improbable that a UTF-8 page will not be in UTF-8, it could be checked on a database dump and I don't believe that any such page would be found. When all pages are converted, site could be switched to UTF-8 and postprocessor turned off.
This could even be doen without a postprocessor, there is PHP mb_detect_encoding function which does exactly that.
Quick, dirty, and it seems to work. I know that buffering is slow, but this would only be a temporary solution. When hairs on your head settle down, let's talk about it :)
Index: OutputPage.php =================================================================== RCS file: /cvsroot/wikipedia/phase3/includes/OutputPage.php,v retrieving revision 1.143 diff -u -3 -p -r1.143 OutputPage.php --- OutputPage.php 20 May 2004 12:46:31 -0000 1.143 +++ OutputPage.php 16 Jun 2004 12:31:10 -0000 @@ -333,7 +333,13 @@ class OutputPage { setcookie( $name, $val, $exp, "/" ); }
+ ob_start(); $sk->outputPage( $this ); + $output=ob_get_contents(); + ob_end_clean(); + if(mb_detect_encoding($output,"UTF-8,ISO-8859-1")=="UTF-8") + $output=preg_replace("/charset=iso-8859-1/","charset=utf-8",$output,1); + echo $output; # flush(); }
Though not being a developer I might make a suggestion on this topic.
For I'm helping out with the Open Directiory Project I know they have been struggling with converting their databases to utf8 quite some time now. And strange characters keep appearing here and there.
Maybe the devs of mediawiki could get some cool hints from them guys on what unforeseen problems might arise when converting to utf8.
Cheers Manfred
On Wednesday 16 June 2004 10:16, Nikola Smolenski wrote:
I am thinking about an even simpler solution. Have server-side script convert articles and their histories to UTF-8. Have a postprocessor (written in C) tell if a page is in UTF-8 and change appropriate meta
tag
if it is. It's vastly improbable that a UTF-8 page will not be in
UTF-8,
it could be checked on a database dump and I don't believe that any
such
page would be found. When all pages are converted, site could be
switched
to UTF-8 and postprocessor turned off.
This could even be doen without a postprocessor, there is PHP mb_detect_encoding function which does exactly that.
Quick, dirty, and it seems to work. I know that buffering is slow, but this would only be a temporary solution. When hairs on your head settle down, let's talk about it :)
tic@tictric.net wrote:
Though not being a developer I might make a suggestion on this topic.
For I'm helping out with the Open Directiory Project I know they have been struggling with converting their databases to utf8 quite some time now. And strange characters keep appearing here and there.
Maybe the devs of mediawiki could get some cool hints from them guys on what unforeseen problems might arise when converting to utf8.
Cheers Manfred
We have already converted the french wiki to utf-8. Except some strange characters which didn't belong to latin-1 (typically windows code page characters), we didn't have have problems as far as i remember. The problem with non latin-1 can even be corrected by a bot before de conversion. As the french wiki is relatively small compared to the english wiki, we just converted the dump. For bigger wikis i guess converting all tables, except old, all at once can be a good idea. Then as proposed, setting a « utf-8 » flag for old articles and converting them one by one, starting by the most recent ones. The program we used to convert to utf-8 can be adapted to convert an article instead of a dump quite easily (and it should be faster in fact).
Cheers,
Med
When attempting to delete a page I get an Database Error. That's what mysql complains about.
X-----snip
DELETE FROM linkscc WHERE lcc_pageid='717') aus der Funktion "". MySQL meldete den Fehler "1064: You have an error in your SQL syntax. Check the manual that corresponds to your MySQL server version for the right syntax to use near ')' at line 1".
x-----snap
The page still was deleted then and appears under the restoring link for sysops.
I ran a repair command over my tables and I cannot find an very obvious error anywhere but still that message above might tell you something.
Manfred
Medéric Boquien wrote:
We have already converted the french wiki to utf-8. Except some strange characters which didn't belong to latin-1 (typically windows code page characters), we didn't have have problems as far as i remember.
This is why you should have listened to me when I said "*DON'T* convert from ISO-8859-1 to UTF-8. Convert from Windows-1252 to UTF-8!" because I foresaw that problem.
Timwi
Nikola Smolenski wrote:
Quick, dirty, and it seems to work. I know that buffering is slow, but this would only be a temporary solution. When hairs on your head settle down, let's talk about it :)
That won't work, as you have to be able to link from one page to another. Title management, pulling things from the database, and case folding are all dependent on the character set, long before we output anything.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Nikola Smolenski wrote:
Quick, dirty, and it seems to work. I know that buffering is slow, but this would only be a temporary solution. When hairs on your head settle down, let's talk about it :)
That won't work, as you have to be able to link from one page to another. Title management, pulling things from the database, and case folding are all dependent on the character set, long before we output anything.
If I understood him right, his suggestion was to take the server down briefly to convert everything but 'old' to UTF-8 atomically. In particular, this would convert 'cur', and we can use UTF-8 for everything including title management and case folding.
The only thing that will happen is that the history of all pages whose title contains non-ASCII characters, will disappear. I don't think that's too much of a problem; for as long as it's inaccessible, nobody can temper with it, and its inaccessibility will not impede editing the most recent versions; and we can convert the 'old' portion of those pages first so their history comes back as soon as possible. Then convert the rest and we're set. I think this is better than taking the server down for several days to convert everything.
Timwi
On Monday 21 June 2004 17:42, Timwi wrote:
Brion Vibber wrote:
Nikola Smolenski wrote:
Quick, dirty, and it seems to work. I know that buffering is slow, but this would only be a temporary solution. When hairs on your head settle down, let's talk about it :)
That won't work, as you have to be able to link from one page to another. Title management, pulling things from the database, and case folding are all dependent on the character set, long before we output anything.
If I understood him right, his suggestion was to take the server down briefly to convert everything but 'old' to UTF-8 atomically. In particular, this would convert 'cur', and we can use UTF-8 for everything including title management and case folding.
Well, no, I suggested that articles are converted one at a time, both cur and old; each article should be protected from editing during the conversion.
Pulling things from the database is not a problem, database will be happy to think that it is still pulling ISO-8859-1 and won't even notice the change.
Title management and case folding could pose a a problem, solution for it would be not to convert articles which link to articles which have high characters in their titles. I don't think there's so many of them.
And when everything is done, convert titles, and rest of the articles.
wikitech-l@lists.wikimedia.org