Staying so long with ISO 8859 was a mistake.
So I propose converting all Wikipedias that aren't using UTF-8 yet to UTF-8. Procedure should be like that: 1. new LanguageXX.php prepared and put under some name 2. make backups 3. create tables curutf8 and oldutf8 4. disable write access 5. convert all data - numeric HTML codes are going to be replaced by UTF-8 characters too. 6. rename tables cur and old to cur88591 and cur88591 7. rename tables curutf8 and oldutf8 to cur and old 8. replace old LanguageXX.php with utf8-enabled version 9. reenable write access
The conversion script should be tested on test.* Wikipedia first.
During step 5 Wikipedia is going to be read only. It may take some time, especially with English Wikipedia, so it's better to do conversion of each Wikipedia separately. During steps 6-8 Wikipedia may not work at all, but it's going to take less than a minute.
Does anybody have any really good reason why shouldn't I proceed ? These reasons aren't good enough: * broken URLs - all old URLs are going to work after upgrade * size increase - size is going to stay about the same * broken browsers - they should be upgraded, if someone has browser so old that it doesn't grok UTF-8, it's not going to grok CSS, PNGs, and other things we're using either. Unless we want to remove all CSS and PNGs, there's no point in not using UTF-8. * ISO 8859-N is good enough - no, it's not. Not if someone wants to write about people and places from countries where non-8859-1 Latin characters are used, or about linguistics, or math, etc.
On Nov 17, 2003, at 15:02, Tomasz Wegrzanowski wrote:
Staying so long with ISO 8859 was a mistake.
So I propose converting all Wikipedias that aren't using UTF-8 yet to UTF-8. Procedure should be like that:
[...]
How about we do the conversion when installing the new big database server? (Tentatively next week if they actually ship the machine on time.)
We'll have to go down to read-only mode while copying stuff over anyway, so this consolidates downtime. The conversion itself can be done by simply piping the database dump through iconv as it's being copied into the new db.
Easy as pie!
-- brion vibber (brion @ pobox.com)
On Mon, Nov 17, 2003 at 03:28:46PM -0800, Brion Vibber wrote:
On Nov 17, 2003, at 15:02, Tomasz Wegrzanowski wrote:
Staying so long with ISO 8859 was a mistake.
So I propose converting all Wikipedias that aren't using UTF-8 yet to UTF-8. Procedure should be like that:
[...]
How about we do the conversion when installing the new big database server? (Tentatively next week if they actually ship the machine on time.)
We'll have to go down to read-only mode while copying stuff over anyway, so this consolidates downtime. The conversion itself can be done by simply piping the database dump through iconv as it's being copied into the new db.
Easy as pie!
It would be better if numeric entities were converted too. The code is somewhere in Phase1->Phase2 conversion script, and in konwert program too.
Tomasz Wegrzanowski wrote:
It would be better if numeric entities were converted too. The code is somewhere in Phase1->Phase2 conversion script, and in konwert program too.
I'm not sure what other peoples' feelings on this are, but I personally would prefer using numeric entities on the English wikipedia for characters that aren't standard in English. Many of us don't have keyboard configurations that allow the literal characters to be typed, so keeping them as numeric entities makes editing them easier.
As for the UTF-8 switch itself, I can see some advantages, but I think you gloss over the disadvantages too quickly. On Mozilla Firebird on Linux, at least, the UTF-8 wikis look _much_ worse than the ISO-8859-x wikis, because they're forced to use a Unicode font, of which only fairly ugly ones with strange proportioning are Freely available.
-Mark
Tomasz-
- broken browsers - they should be upgraded, if someone has browser so old that it doesn't grok UTF-8, it's not going to grok CSS, PNGs, and other things we're using either. Unless we want to remove all CSS and PNGs, there's no point in not using UTF-8.
Is this true? All I know is that we had a *lot* of problems with broken special chars on the Meta-Wiki during the logo contest. I have no idea which browser broke them, but it seems to be a not totally uncommon one, perhaps in the 5% range. Given that a single edit by such a person will break an entire page, it might not be so wise to switch (but perhaps I'm missing something -- is Meta running UTF-8?).
Regards,
Erik
On Tue, Nov 18, 2003 at 01:24:00AM +0100, Erik Moeller wrote:
Tomasz-
- broken browsers - they should be upgraded, if someone has browser so old that it doesn't grok UTF-8, it's not going to grok CSS, PNGs, and other things we're using either. Unless we want to remove all CSS and PNGs, there's no point in not using UTF-8.
Is this true? All I know is that we had a *lot* of problems with broken special chars on the Meta-Wiki during the logo contest. I have no idea which browser broke them, but it seems to be a not totally uncommon one, perhaps in the 5% range. Given that a single edit by such a person will break an entire page, it might not be so wise to switch (but perhaps I'm missing something -- is Meta running UTF-8?).
It's nothing like 5% - it's at least one order of magnitude smaller number. On Polish Wikipedia there weren't any serious problems with UTF-8-incompatible browsers.
Hello,
On 18-11-2003 you (Erik Moeller) wrote:
Tomasz-
- broken browsers - they should be upgraded, if someone has browser so
old that it doesn't grok UTF-8, it's not going to grok CSS, PNGs, and other things we're using either. Unless we want to remove all CSS and PNGs, there's no point in not using UTF-8.
Is this true?
No, it isn't. Though in some bowserd these developments were in the same timeframe, in others they weren't. It also ignores two other obvious points:
- Without CSS a page should stll be readable. Without PNG a page should still be readable. Mess up the encoding and the page becomes a rebus, at best.
- The gift of looking through the Ethernet wires and telephone cables into all the computer rooms in the world is sufficiently rare that one should never demand a user to upgrade. Feel free to tell the user you're too lazy to support their software but blaming the break on them is quite insulting.
So, what are our plans of doing this in a way that will allow the W to adapt, so as to not drive away anyone?
EM> All I know is that we had a *lot* of problems with broken EM> special chars on the Meta-Wiki during the logo contest. I have no idea EM> which browser broke them, but it seems to be a not totally uncommon one, EM> perhaps in the 5% range. Given that a single edit by such a person will EM> break an entire page, it might not be so wise to switch (but perhaps I'm EM> missing something -- is Meta running UTF-8?).
We, Fy:, are having this problem now, but here it appears it has something to do with our language files. One language file is OK, though the localisation is less than perfect, the next improved version is not.
(There are also still some English-language strings that I can't seem to find in the language-file, but that's probably another matter.)
Sincerely,
wikipedia-l@lists.wikimedia.org