Switching everything to UTF-8

List overview All Threads
Download

newer

older

Dynamic dates

MediaWiki patch for...

Tomasz Wegrzanowski

18 Nov 2003 18 Nov '03

12:02 a.m.

Staying so long with ISO 8859 was a mistake.

So I propose converting all Wikipedias that aren't using UTF-8 yet to UTF-8. Procedure should be like that: 1. new LanguageXX.php prepared and put under some name 2. make backups 3. create tables curutf8 and oldutf8 4. disable write access 5. convert all data - numeric HTML codes are going to be replaced by UTF-8 characters too. 6. rename tables cur and old to cur88591 and cur88591 7. rename tables curutf8 and oldutf8 to cur and old 8. replace old LanguageXX.php with utf8-enabled version 9. reenable write access

The conversion script should be tested on test.* Wikipedia first.

During step 5 Wikipedia is going to be read only. It may take some time, especially with English Wikipedia, so it's better to do conversion of each Wikipedia separately. During steps 6-8 Wikipedia may not work at all, but it's going to take less than a minute.

Does anybody have any really good reason why shouldn't I proceed ? These reasons aren't good enough: * broken URLs - all old URLs are going to work after upgrade * size increase - size is going to stay about the same * broken browsers - they should be upgraded, if someone has browser so old that it doesn't grok UTF-8, it's not going to grok CSS, PNGs, and other things we're using either. Unless we want to remove all CSS and PNGs, there's no point in not using UTF-8. * ISO 8859-N is good enough - no, it's not. Not if someone wants to write about people and places from countries where non-8859-1 Latin characters are used, or about linguistics, or math, etc.

Show replies by date

Brion Vibber

18 Nov 18 Nov

12:28 a.m.

New subject: [Wikipedia-l] Switching everything to UTF-8

On Nov 17, 2003, at 15:02, Tomasz Wegrzanowski wrote:

...

Staying so long with ISO 8859 was a mistake.

So I propose converting all Wikipedias that aren't using UTF-8 yet to UTF-8. Procedure should be like that:

[...]

How about we do the conversion when installing the new big database server? (Tentatively next week if they actually ship the machine on time.)

We'll have to go down to read-only mode while copying stuff over anyway, so this consolidates downtime. The conversion itself can be done by simply piping the database dump through iconv as it's being copied into the new db.

Easy as pie!

-- brion vibber (brion @ pobox.com)

Tomasz Wegrzanowski

12:57 a.m.

New subject: [Wikipedia-l] Switching everything to UTF-8

On Mon, Nov 17, 2003 at 03:28:46PM -0800, Brion Vibber wrote:

...

On Nov 17, 2003, at 15:02, Tomasz Wegrzanowski wrote:

...
Staying so long with ISO 8859 was a mistake.

So I propose converting all Wikipedias that aren't using UTF-8 yet to UTF-8. Procedure should be like that:

[...]

How about we do the conversion when installing the new big database server? (Tentatively next week if they actually ship the machine on time.)

We'll have to go down to read-only mode while copying stuff over anyway, so this consolidates downtime. The conversion itself can be done by simply piping the database dump through iconv as it's being copied into the new db.

Easy as pie!

It would be better if numeric entities were converted too. The code is somewhere in Phase1->Phase2 conversion script, and in konwert program too.

Nikola Smolenski

9:30 a.m.

New subject: [Wikipedia-l] Switching everything to UTF-8

On Tuesday 18 November 2003 00:57, Tomasz Wegrzanowski wrote:

...

On Mon, Nov 17, 2003 at 03:28:46PM -0800, Brion Vibber wrote:

...
We'll have to go down to read-only mode while copying stuff over anyway, so this consolidates downtime. The conversion itself can be done by simply piping the database dump through iconv as it's being copied into the new db.

It would be better if numeric entities were converted too. The code is somewhere in Phase1->Phase2 conversion script, and in konwert program too.

Note also that cases like &#1071; must NOT be converted, as it is their intention to appear like entities in displayed text. Perhaps it's the easiest not to convert & at all.

It would also be nice if one who does the conversion tell us what was the size of DB dump before and after the conversion :) I'm quite sure there'll be a decrease.

Would it also make sense to convert [[sr:%D0...]] interlanguage links for further decrease in size? I don't know is there a tool which could do it, but if not I could make a C program for it in five minutes.

erik_moeller＠gmx.de

1:22 a.m.

New subject: [Wikipedia-l] Switching everything to UTF-8

Tomasz-

...

broken browsers - they should be upgraded, if someone has browser so old that it doesn't grok UTF-8, it's not going to grok CSS, PNGs, and other things we're using either. Unless we want to remove all CSS and PNGs, there's no point in not using UTF-8.

Is this true? All I know is that we had a *lot* of problems with broken special chars on the Meta-Wiki during the logo contest. I have no idea which browser broke them, but it seems to be a not totally uncommon one, perhaps in the 5% range. Given that a single edit by such a person will break an entire page, it might not be so wise to switch (but perhaps I'm missing something -- is Meta running UTF-8?).

Regards,

Erik

Lars Aronsson

5:06 a.m.

New subject: [Wikipedia-l] Switching everything to UTF-8

Erik Moeller wrote:

...

perhaps in the 5% range. Given that a single edit by such a person will break an entire page, it might not be so wise to switch (but perhaps I'm missing something -- is Meta running UTF-8?).

Would it be possible to let the database run on UTF-8 internally, but to let the PHP script analyze and convert data to and from certain browsers? Perhaps the majority of users are using UTF-8-capable browsers, so the conversion would use a minimum of resources.

All I know is that MySQL has better UTF-8 support from version 4.1.x, as described in chapter 9, http://www.mysql.com/doc/en/Charset.html The same goes for Perl version 5.8, but what about PHP?

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se/

Brion Vibber

7:17 a.m.

New subject: [Wikipedia-l] Switching everything to UTF-8

On Nov 17, 2003, at 20:06, Lars Aronsson wrote:

...

Erik Moeller wrote:

...
perhaps in the 5% range. Given that a single edit by such a person will break an entire page, it might not be so wise to switch (but perhaps I'm missing something -- is Meta running UTF-8?).

Would it be possible to let the database run on UTF-8 internally, but to let the PHP script analyze and convert data to and from certain browsers? Perhaps the majority of users are using UTF-8-capable browsers, so the conversion would use a minimum of resources.

Certainly possible, as long as care is taken to keep round-trips clean.

Another possibility is simply to 'blacklist' known problem browsers by printing a notice/link to better browsers on the edit page warning that they may have problems, as we now have a warning on long pages that some browsers may have problems. (Though in that case we aren't checking specific browsers.)

The main problem browser these days is Internet Explorer for Mac; it's years out of date and the most recent version still doesn't grok UTF-8 for editing. The most recent Macs ship with Safari as the default, but most existing Macs out there are going to have IE or (shudder) Netscape 4.x as the default browser.

...

All I know is that MySQL has better UTF-8 support from version 4.1.x, as described in chapter 9, http://www.mysql.com/doc/en/Charset.html The same goes for Perl version 5.8, but what about PHP?

PHP currently has pretty much no UTF-8 support aside from some conversion functions. Strings are treated as arbitrary-length byte sequences, and we've got some custom functions to deal with case changing and the like.

There are some multibyte character set support functions which may or may not be suitable for replacing the Utf8Case functions, that should get looked into.

-- brion vibber (brion @ pobox.com)

erik_moeller＠gmx.de

1:09 p.m.

New subject: [Wikipedia-l] Switching everything to UTF-8

Brion-

...

Another possibility is simply to 'blacklist' known problem browsers by printing a notice/link to better browsers on the edit page warning that they may have problems,

For the record, I would support blocking problematic browsers from editing entirely. If UTF-8 has enough advantages (and many people seem to think it does), then telling 2% of the userbase that their browser is outdated and corrupts pages when editing seems acceptable. Perhaps we can do it in a smart way and provide links to alternatives for each blacklisted browser (e.g. Netscape 4->Mozilla).

On the other hand, I would not support any solution that still leads to corrupted pages.

Regards,

Erik

Brion Vibber

9:09 a.m.

New subject: [Wikipedia-l] Switching everything to UTF-8

Here's a breakdown of browsers used to save edits over a day on de.wikipedia.org. I checked de because it's the largest non-English non-UTF-8-using Wikipedia.

Total posts in the day: 2861

Browsers known to work with editing UTF-8: MSIE/Win: 1146 Gecko: 966 [549 Windows; 379 Linux/Unix; 31 Mac] Opera 6+: 331 [291 Windows; 40 Linux] Konqueror: 216 Safari: 137 total: 2796 (98%)

Browsers which can be problematic with UTF-8: old Netscape: 25 ELinks: 20 (not tested, but it's a text-mode browser and these tend to have problems in this area) MSIE/Mac: 10 Lynx: 4 (text-mode again) Opera <6: 2 total: 61 (2%)

There's some margin of error; my regexps lost track of 4 hits or so somewhere in that mess.

I'm actually surprised that the Internet Explorer/Mac quotient is so low, though it's a pleasant surprise. :)

Anyway, adding a notice and/or doing some extra conversion for those 2% shouldn't be a huge server burden.

-- brion vibber (brion @ pobox.com)

Andre Engels

12:43 p.m.

New subject: [Wikipedia-l] Switching everything to UTF-8

On Tue, 18 Nov 2003, Brion Vibber wrote:

...

Here's a breakdown of browsers used to save edits over a day on de.wikipedia.org. I checked de because it's the largest non-English non-UTF-8-using Wikipedia.

Total posts in the day: 2861

Browsers known to work with editing UTF-8: MSIE/Win: 1146 Gecko: 966 [549 Windows; 379 Linux/Unix; 31 Mac] Opera 6+: 331 [291 Windows; 40 Linux] Konqueror: 216 Safari: 137 total: 2796 (98%)

Browsers which can be problematic with UTF-8: old Netscape: 25 ELinks: 20 (not tested, but it's a text-mode browser and these tend to have problems in this area) MSIE/Mac: 10 Lynx: 4 (text-mode again) Opera <6: 2 total: 61 (2%)

There's some margin of error; my regexps lost track of 4 hits or so somewhere in that mess.

I did the actual check for Lynx. Lynx does a commendable job at reading UTF-8 (mapping each character to the closest transcription into Latin-1), but editing with it will cause disasters (it will send the output in the same way, thus with all non-Latin-1 characters changed to their closest Latin-1 equivalent).

Although I do like to have UTF-8 (recently had to add dots that do not belong there on 'Hisarlik' in my nl:-article on Schliemann), I also think that 2% of our userbase is large enough a group to not just get rid of or allow to inadvertently destruct pages.

Andre Engels

Lars Aronsson

19 Nov 19 Nov

5:31 a.m.

New subject: [Wikipedia-l] Switching everything to UTF-8

Brion Vibber wrote:

...

Anyway, adding a notice and/or doing some extra conversion for those 2% shouldn't be a huge server burden.

Would that be only for posting, or will the 2% also be blocked from reading the site? 2% of Wikipedia's readers is a lot.

On the other hand, staying with Latin-1 might exclude some active contributors (Tomasz W.) who want to push the envelope. And wiki experience says it pays off to cater to your best contributors.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se/

Brion Vibber

9:01 a.m.

New subject: [Wikipedia-l] Switching everything to UTF-8

On Nov 18, 2003, at 20:31, Lars Aronsson wrote:

...

Brion Vibber wrote:

...
Anyway, adding a notice and/or doing some extra conversion for those 2% shouldn't be a huge server burden.

Would that be only for posting, or will the 2% also be blocked from reading the site? 2% of Wikipedia's readers is a lot.

Nobody's going to be blocked.

"NOTICE" and "CONVERSION" don't sound like blocking to me, and I hope they don't to anyone else.

You'll notice we *don't* block browsers that are known to damage long articles during editing; we warn about the problem and when things do go awry anyway people can fix it up.

Can we all drop the paranoia? We're not making a switch of the existing Latin-1 phase3 wikis to UTF-8 until compatibility code is in place. That's what we decided over a year ago when the issue was first raised, no?

-- brion vibber (brion @ pobox.com)

Jurriaan Schulman

11:36 a.m.

New subject: Popular articles

Is there any way to find out which articles are visited most or how often a certain article was visited/read?

Jurriaan

Alfio Puglisi

1:05 p.m.

New subject: Popular articles

The per-article database was disabled some months ago due to performance reasons, and it is unlikely that it will come back. Perhaps some administrator can parse the http log files, but it would be a special task unless automated writing some programs or plugging in existing ones.

Ciao Alfio

On Wed, 19 Nov 2003, Jurriaan Schulman wrote:

...

Is there any way to find out which articles are visited most or how often a certain article was visited/read?

Jurriaan

Wikitech-l mailing list Wikitech-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

Tomasz Wegrzanowski

21 Nov 21 Nov

7:39 a.m.

New subject: [Wikipedia-l] Switching everything to UTF-8

On Tue, Nov 18, 2003 at 12:09:26AM -0800, Brion Vibber wrote:

...

Here's a breakdown of browsers used to save edits over a day on de.wikipedia.org. I checked de because it's the largest non-English non-UTF-8-using Wikipedia.

Total posts in the day: 2861

Browsers known to work with editing UTF-8: MSIE/Win: 1146 Gecko: 966 [549 Windows; 379 Linux/Unix; 31 Mac] Opera 6+: 331 [291 Windows; 40 Linux] Konqueror: 216 Safari: 137 total: 2796 (98%)

Browsers which can be problematic with UTF-8: old Netscape: 25 ELinks: 20 (not tested, but it's a text-mode browser and these tend to have problems in this area) MSIE/Mac: 10 Lynx: 4 (text-mode again) Opera <6: 2 total: 61 (2%)

There's some margin of error; my regexps lost track of 4 hits or so somewhere in that mess.

I'm actually surprised that the Internet Explorer/Mac quotient is so low, though it's a pleasant surprise. :)

Anyway, adding a notice and/or doing some extra conversion for those 2% shouldn't be a huge server burden.

Reasonably recent Lynx doesn't have problem with Unicode, and so I don't expect ELinks to have it either. All other problematic browsers are really ancient software, and there's nothing wrong with expecting people to upgrade. We're not the only website that doesn't support them.

Just placing a notice if problematic browes was found would suffice.

Tomasz Wegrzanowski

18 Nov 18 Nov

5:19 a.m.

New subject: [Wikipedia-l] Switching everything to UTF-8

On Tue, Nov 18, 2003 at 01:24:00AM +0100, Erik Moeller wrote:

...

Tomasz-

...

broken browsers - they should be upgraded, if someone has browser so old that it doesn't grok UTF-8, it's not going to grok CSS, PNGs, and other things we're using either. Unless we want to remove all CSS and PNGs, there's no point in not using UTF-8.

Is this true? All I know is that we had a *lot* of problems with broken special chars on the Meta-Wiki during the logo contest. I have no idea which browser broke them, but it seems to be a not totally uncommon one, perhaps in the 5% range. Given that a single edit by such a person will break an entire page, it might not be so wise to switch (but perhaps I'm missing something -- is Meta running UTF-8?).

It's nothing like 5% - it's at least one order of magnitude smaller number. On Polish Wikipedia there weren't any serious problems with UTF-8-incompatible browsers.

Aliter

20 Nov 20 Nov

7:35 p.m.

Hello,

On 18-11-2003 you (Erik Moeller) wrote:

...

Tomasz-

...

broken browsers - they should be upgraded, if someone has browser so

old that it doesn't grok UTF-8, it's not going to grok CSS, PNGs, and other things we're using either. Unless we want to remove all CSS and PNGs, there's no point in not using UTF-8.

...

Is this true?

No, it isn't. Though in some bowserd these developments were in the same timeframe, in others they weren't. It also ignores two other obvious points:

- Without CSS a page should stll be readable. Without PNG a page should still be readable. Mess up the encoding and the page becomes a rebus, at best.

- The gift of looking through the Ethernet wires and telephone cables into all the computer rooms in the world is sufficiently rare that one should never demand a user to upgrade. Feel free to tell the user you're too lazy to support their software but blaming the break on them is quite insulting.

So, what are our plans of doing this in a way that will allow the W to adapt, so as to not drive away anyone?

EM> All I know is that we had a *lot* of problems with broken EM> special chars on the Meta-Wiki during the logo contest. I have no idea EM> which browser broke them, but it seems to be a not totally uncommon one, EM> perhaps in the 5% range. Given that a single edit by such a person will EM> break an entire page, it might not be so wise to switch (but perhaps I'm EM> missing something -- is Meta running UTF-8?).

We, Fy:, are having this problem now, but here it appears it has something to do with our language files. One language file is OK, though the localisation is less than perfect, the next improved version is not.

(There are also still some English-language strings that I can't seem to find in the language-file, but that's probably another matter.)

Sincerely,

-- Aliter

Russell Jones

24 Nov 24 Nov

10:37 a.m.

MediaWiki is user supported, Open Source Software. If the software doesn't support a minority browser, it's because the people using the browser are 'too lazy' to add support for it. There should be no compulsion, IMHO, on anyone to write support for such browsers. If you want it so badly, write it yourself and submit a patch.

Russell

Aliter wrote:

...

never demand a user to upgrade. Feel free to tell the user you're too lazy to support their software but blaming the break on them is quite insulting.

Jay Bowks

19 Nov 19 Nov

11:16 p.m.

New subject: Forbidden

Hi, maybe one of you can help... When going to http://en.wikipedia.org Why do I get the message:

Forbidden You don't have permission to access / on this server.

Apache/1.3.28 Server at en.wikipedia.org Port 80

Is is a problem with my firewall or is there something amiss with the server?

Thanks, Jay B.

Brion Vibber

11:28 p.m.

New subject: Forbidden

On Nov 19, 2003, at 14:16, Jay Bowks wrote:

...

You don't have permission to access / on this server.

Does your firewall remove your browser's "user-agent" header?

-- brion vibber (brion @ pobox.com)

Jay Bowks

22 Nov 22 Nov

4:02 p.m.

New subject: Still Forbidden...

Hi/Salute/Saluton, This was quick... actually I think my inbox just got your message as I downloaded from my ISP...

You asked...

...

On Nov 19, 2003, at 14:16, Jay Bowks wrote:

...
You don't have permission to access / on this server. Does your firewall remove your browser's "user-agent" header? -- brion vibber (brion @ pobox.com)

I disabled the personal firewall, which does have some privacy features included, but this didn't seem to help.

I still get

" Forbidden You don't have permission to access / on this server. Apache/1.3.28 Server at en.wikipedia.org Port 80"

Could my ISP's IP's have been blacklisted for some reason?

If you have some idea of what else I could try I'd appreciate it...

Thanks/Gratias/Dankon Jay B.

Jay Bowks

6:32 p.m.

New subject: Eating up Forbidden... Fruits

...

...
...
Does your firewall remove your browser's "user-agent" header? -- brion vibber (brion @ pobox.com)

After much tweaking of settings with Norton's and Windows XP firewalls I was able to gain access to wikipedia sites... it seems my computer has been the target of buffer overflow attacks according to the logs, which were blocked by Norton's Firewall... the program went on the defense sutting down privacy/information access... From Brion's explanation of cookies with Safari I figured I'd take a look at the cookies settings in Norton's and sure enough it was denying cookies... I set it back to eating these forbidden treats and it seems now to access the wikipedia ok... (And eventhough I use XP at home, mostly, I also have a Linux-Mandrake box I use from time to time, I do use Safari at work, on an iMac with OS X, one of the 8 computers in my classroom, I will pass on the message about insecure cookies to the other teachers who also use it at the three schools I work in, they're Mac inundated, although most of them are older iMacs using OS 9, the newer eMacs all come with OS X).

Ok, enough of my jubilant chat for now, thanks Brion for the suggestions, and for the patience of those on Wikitech-l.

Cheers! Jay B.

7551

Age (days ago)

7558

Last active (days ago)

wikitech-l@lists.wikimedia.org

21 comments

11 participants

tags (0)

participants (11)

Alfio Puglisi
Aliter
Andre Engels
Brion Vibber
erik_moeller＠gmx.de
Jay Bowks
Jurriaan Schulman
Lars Aronsson
Nikola Smolenski
Russell Jones
Tomasz Wegrzanowski