Re: UTF-8 for nl:wikipedia

List overview All Threads
Download

newer

older

Wiki down?

One wikimedia for lots of...

Stephan Walter

17 Aug 2004 17 Aug '04

8:22 a.m.

Gerard Meijssen wrote:

...

some time ago to move to UTF-8. At the time it was not such a good idea as UTF-8 does take more room.

More room? UTF-8 does not use more memory, if that's what you mean. HTML entities (like Ӓ) use 5 up to 7 bytes, while a character in UTF-8 uses at most 4 bytes.

Regards, Stephan

-- Stephan.Walter@epfl.ch -- http://lart.info/~stw/ -- PGP: B2421799

Show replies by date

Ævar Arnfjörð Bjarmason

17 Aug 17 Aug

10:22 a.m.

New subject: UTF-8 for nl:wikipedia

He is probably confusing it with UTF-16 or UTF-32

On Tue, 17 Aug 2004 14:22:04 +0200, Stephan Walter stephan.walter@epfl.ch wrote:

...

Gerard Meijssen wrote:

...
some time ago to move to UTF-8. At the time it was not such a good idea as UTF-8 does take more room.

More room? UTF-8 does not use more memory, if that's what you mean. HTML entities (like Ӓ) use 5 up to 7 bytes, while a character in UTF-8 uses at most 4 bytes.

Regards, Stephan

-- Stephan.Walter@epfl.ch -- http://lart.info/~stw/ -- PGP: B2421799

Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

Manuel G R

18 Aug 18 Aug

7:11 p.m.

New subject: UTF-8 for nl:wikipedia

In the case of ISO 8859-1 characters outside the ASCII range (e.g. á, é, ñ, ç, etc) it actually needs more space, 1 byte versus 2 bytes, I think. Nevertheless nl: has few of these characters when compared with es: or fr:.

Ævar Arnfjörð Bjarmason wrote:

...

He is probably confusing it with UTF-16 or UTF-32

On Tue, 17 Aug 2004 14:22:04 +0200, Stephan Walter stephan.walter@epfl.ch wrote:

...
Gerard Meijssen wrote:

...
some time ago to move to UTF-8. At the time it was not such a good idea as UTF-8 does take more room.

More room? UTF-8 does not use more memory, if that's what you mean. HTML entities (like Ӓ) use 5 up to 7 bytes, while a character in UTF-8 uses at most 4 bytes.

Regards, Stephan

-- Stephan.Walter@epfl.ch -- http://lart.info/~stw/ -- PGP: B2421799

Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

Kai F. Lahmann

19 Aug 19 Aug

9:10 a.m.

New subject: UTF-8 for nl:wikipedia

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Am Dunnersdag 19 August 2004 01:11 schrieb Manuel G R:

...

In the case of ISO 8859-1 characters outside the ASCII range (e.g. á, é, ñ, ç, etc) it actually needs more space, 1 byte versus 2 bytes, I think. Nevertheless nl: has few of these characters when compared with es: or fr:.

those are 128 characters, which get 1 byte bigger, yes. Also there are many characters, which get smaller, as the HTML-Entities are 6-8 Byte long, the same in UTF-8 is only 3-4 Byte. And don't forget links like [[Lodz|Łódź]]: 28 byte before, 14 Byte after (maybe little different..). Same for all Interwiki-Links to non-latin Wikipedias and even for many east-european WPs. On DE: (known to have a massive use of ä, ö ü and ß) we had no visible grow with the conversion.

- -- Kai F. Lahmann

1zu160-Bahner http://www.1zu160.net

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFBJKbJNx16d/gd8cIRAo1MAJ9CPpWslvpCBgoOjCOYnZrHIhXigACfamU9 jEiTlFVDjtn/PhPXc74dqZ8= =zuYq -----END PGP SIGNATURE-----

Timwi

11:26 a.m.

New subject: UTF-8 for nl:wikipedia

Kai F. Lahmann wrote:

...

...
In the case of ISO 8859-1 characters outside the ASCII range (e.g. á, é, ñ, ç, etc) it actually needs more space, 1 byte versus 2 bytes, I think. Nevertheless nl: has few of these characters when compared with es: or fr:.

those are 128 characters, which get 1 byte bigger, yes. Also there are many characters, which get smaller, as the HTML-Entities are 6-8 Byte long, the same in UTF-8 is only 3-4 Byte. And don't forget links like [[Lodz|Łódź]]: 28 byte before, 14 Byte after (maybe little different..). Same for all Interwiki-Links to non-latin Wikipedias and even for many east-european WPs. On DE: (known to have a massive use of ä, ö ü and ß) we had no visible grow with the conversion.

I really don't know why we are discussing here the space requirements of UTF-8 versus Latin-1. This is absolutely not a criterion, a problem, an issue, or a concern.

We want UTF-8 because it allows us to do things that Latin-1 cannot do (in particular, have article titles with proper characters). Talking about space requirements is pointless. Something that can do less will *obviously* require less space for the things it *can* do, but what use is that if it cannot do what we want.

Timwi

Kai F. Lahmann

2:19 p.m.

New subject: UTF-8 for nl:wikipedia

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Am Dunnersdag 19 August 2004 17:26 schrieb Timwi:

...

We want UTF-8 because it allows us to do things that Latin-1 cannot do (in particular, have article titles with proper characters). Talking about space requirements is pointless. Something that can do less will *obviously* require less space for the things it *can* do, but what use is that if it cannot do what we want.

ok.

3 Points you should know before: - - Links with Encodings as %FC don't work after the conversion - they should be converted before, especially for Interlanguage-Links, where you sometimes can't guess the right spelling - - sometimes there are two encodings for only 1 character - for example Ð has 4 or 5 of them. - - we had massive problems with non-ASCII-Usernames on de:

- -- Kai F. Lahmann

1zu160-Bahner http://www.1zu160.net

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFBJO9JNx16d/gd8cIRAg8XAKDoTktn6vB+9IpFOo4wn9iC5vsSngCg494/ V22AfG8vWdPp0FU7GNnIyZM= =2A/7 -----END PGP SIGNATURE-----

Brion Vibber

3:51 p.m.

New subject: UTF-8 for nl:wikipedia

Kai F. Lahmann wrote:

...

Links with Encodings as %FC don't work after the conversion - they should be

converted before, especially for Interlanguage-Links, where you sometimes can't guess the right spelling

That should work fine, can you give some examples?

-- brion vibber (brion @ pobox.com)

Kai F. Lahmann

20 Aug 20 Aug

11:07 a.m.

New subject: UTF-8 for nl:wikipedia

Am Donnerstag, 19. August 2004 21:51 schrieb Brion Vibber:

...

That should work fine, can you give some examples?

see there: http://de.wikipedia.org/wiki/Wikipedia:Umstellung_auf_Unicode/Prozent-Links

-- Kai F. Lahmann 1zu160-Bahner http://www.1zu160.net

Brion Vibber

12:57 p.m.

New subject: UTF-8 for nl:wikipedia

Kai F. Lahmann wrote:

...

Am Donnerstag, 19. August 2004 21:51 schrieb Brion Vibber:

...
That should work fine, can you give some examples?

see there: http://de.wikipedia.org/wiki/Wikipedia:Umstellung_auf_Unicode/Prozent-Links

Those should only ever be used in interwiki links (or of course URLs when linking from elsewhere).

-- brion vibber (brion @ pobox.com)

Kai F. Lahmann

1:09 p.m.

New subject: UTF-8 for nl:wikipedia

Am Freitag, 20. August 2004 18:57 schrieb Brion Vibber:

...

Those should only ever be used in interwiki links (or of course URLs when linking from elsewhere).

"should", yes ;)

-- Kai F. Lahmann 1zu160-Bahner http://www.1zu160.net

Andre Engels

19 Aug 19 Aug

7:30 p.m.

New subject: UTF-8 for nl:wikipedia

On Thu, 19 Aug 2004 20:19:48 +0200 "Kai F. Lahmann" kl@3dots.de wrote:

...

3 Points you should know before:

Links with Encodings as %FC don't work after the conversion - they

should be converted before, especially for Interlanguage-Links, where you sometimes can't guess the right spelling

I haven't met any of those links on nl:

...

sometimes there are two encodings for only 1 character - for example Ð

has 4 or 5 of them.

Sounds more like a problem for having UTF-8 than for changing into it.

...

we had massive problems with non-ASCII-Usernames on de:

These will probably be much less because non-ASCII characters are more rare in Dutch than in German. In a total of 5245 users, I counted 7 usernames with non-ASCII characters (I might have missed 1 or 2). 5 of these have not made any edits, and a sixth was only a one-time interwiki visitor, so the only user that would actually have problems is Känsterle.

Andre Engels

7446

Age (days ago)

7449

Last active (days ago)

wikitech-l@lists.wikimedia.org

10 comments

7 participants

tags (0)

participants (7)

Andre Engels
Brion Vibber
Kai F. Lahmann
Manuel G R
Stephan Walter
Timwi
Ævar Arnfjörð Bjarmason