Gerard Meijssen wrote:
some time ago to move to UTF-8. At the time it was not such a good idea as UTF-8 does take more room.
More room? UTF-8 does not use more memory, if that's what you mean. HTML entities (like Ӓ) use 5 up to 7 bytes, while a character in UTF-8 uses at most 4 bytes.
Regards, Stephan
He is probably confusing it with UTF-16 or UTF-32
On Tue, 17 Aug 2004 14:22:04 +0200, Stephan Walter stephan.walter@epfl.ch wrote:
Gerard Meijssen wrote:
some time ago to move to UTF-8. At the time it was not such a good idea as UTF-8 does take more room.
More room? UTF-8 does not use more memory, if that's what you mean. HTML entities (like Ӓ) use 5 up to 7 bytes, while a character in UTF-8 uses at most 4 bytes.
Regards, Stephan
-- Stephan.Walter@epfl.ch -- http://lart.info/~stw/ -- PGP: B2421799
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
In the case of ISO 8859-1 characters outside the ASCII range (e.g. á, é, ñ, ç, etc) it actually needs more space, 1 byte versus 2 bytes, I think. Nevertheless nl: has few of these characters when compared with es: or fr:.
Ævar Arnfjörð Bjarmason wrote:
He is probably confusing it with UTF-16 or UTF-32
On Tue, 17 Aug 2004 14:22:04 +0200, Stephan Walter stephan.walter@epfl.ch wrote:
Gerard Meijssen wrote:
some time ago to move to UTF-8. At the time it was not such a good idea as UTF-8 does take more room.
More room? UTF-8 does not use more memory, if that's what you mean. HTML entities (like Ӓ) use 5 up to 7 bytes, while a character in UTF-8 uses at most 4 bytes.
Regards, Stephan
-- Stephan.Walter@epfl.ch -- http://lart.info/~stw/ -- PGP: B2421799
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Am Dunnersdag 19 August 2004 01:11 schrieb Manuel G R:
In the case of ISO 8859-1 characters outside the ASCII range (e.g. á, é, ñ, ç, etc) it actually needs more space, 1 byte versus 2 bytes, I think. Nevertheless nl: has few of these characters when compared with es: or fr:.
those are 128 characters, which get 1 byte bigger, yes. Also there are many characters, which get smaller, as the HTML-Entities are 6-8 Byte long, the same in UTF-8 is only 3-4 Byte. And don't forget links like [[Lodz|Łódź]]: 28 byte before, 14 Byte after (maybe little different..). Same for all Interwiki-Links to non-latin Wikipedias and even for many east-european WPs. On DE: (known to have a massive use of ä, ö ü and ß) we had no visible grow with the conversion.
- -- Kai F. Lahmann
1zu160-Bahner http://www.1zu160.net
Kai F. Lahmann wrote:
In the case of ISO 8859-1 characters outside the ASCII range (e.g. á, é, ñ, ç, etc) it actually needs more space, 1 byte versus 2 bytes, I think. Nevertheless nl: has few of these characters when compared with es: or fr:.
those are 128 characters, which get 1 byte bigger, yes. Also there are many characters, which get smaller, as the HTML-Entities are 6-8 Byte long, the same in UTF-8 is only 3-4 Byte. And don't forget links like [[Lodz|Łódź]]: 28 byte before, 14 Byte after (maybe little different..). Same for all Interwiki-Links to non-latin Wikipedias and even for many east-european WPs. On DE: (known to have a massive use of ä, ö ü and ß) we had no visible grow with the conversion.
I really don't know why we are discussing here the space requirements of UTF-8 versus Latin-1. This is absolutely not a criterion, a problem, an issue, or a concern.
We want UTF-8 because it allows us to do things that Latin-1 cannot do (in particular, have article titles with proper characters). Talking about space requirements is pointless. Something that can do less will *obviously* require less space for the things it *can* do, but what use is that if it cannot do what we want.
Timwi
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Am Dunnersdag 19 August 2004 17:26 schrieb Timwi:
We want UTF-8 because it allows us to do things that Latin-1 cannot do (in particular, have article titles with proper characters). Talking about space requirements is pointless. Something that can do less will *obviously* require less space for the things it *can* do, but what use is that if it cannot do what we want.
ok.
3 Points you should know before: - - Links with Encodings as %FC don't work after the conversion - they should be converted before, especially for Interlanguage-Links, where you sometimes can't guess the right spelling - - sometimes there are two encodings for only 1 character - for example Ð has 4 or 5 of them. - - we had massive problems with non-ASCII-Usernames on de:
- -- Kai F. Lahmann
1zu160-Bahner http://www.1zu160.net
Kai F. Lahmann wrote:
- Links with Encodings as %FC don't work after the conversion - they should be
converted before, especially for Interlanguage-Links, where you sometimes can't guess the right spelling
That should work fine, can you give some examples?
-- brion vibber (brion @ pobox.com)
Am Donnerstag, 19. August 2004 21:51 schrieb Brion Vibber:
That should work fine, can you give some examples?
see there: http://de.wikipedia.org/wiki/Wikipedia:Umstellung_auf_Unicode/Prozent-Links
Kai F. Lahmann wrote:
Am Donnerstag, 19. August 2004 21:51 schrieb Brion Vibber:
That should work fine, can you give some examples?
see there: http://de.wikipedia.org/wiki/Wikipedia:Umstellung_auf_Unicode/Prozent-Links
Those should only ever be used in interwiki links (or of course URLs when linking from elsewhere).
-- brion vibber (brion @ pobox.com)
Am Freitag, 20. August 2004 18:57 schrieb Brion Vibber:
Those should only ever be used in interwiki links (or of course URLs when linking from elsewhere).
"should", yes ;)
On Thu, 19 Aug 2004 20:19:48 +0200 "Kai F. Lahmann" kl@3dots.de wrote:
3 Points you should know before:
- Links with Encodings as %FC don't work after the conversion - they
should be converted before, especially for Interlanguage-Links, where you sometimes can't guess the right spelling
I haven't met any of those links on nl:
- sometimes there are two encodings for only 1 character - for example Ð
has 4 or 5 of them.
Sounds more like a problem for having UTF-8 than for changing into it.
- we had massive problems with non-ASCII-Usernames on de:
These will probably be much less because non-ASCII characters are more rare in Dutch than in German. In a total of 5245 users, I counted 7 usernames with non-ASCII characters (I might have missed 1 or 2). 5 of these have not made any edits, and a sixth was only a one-time interwiki visitor, so the only user that would actually have problems is Känsterle.
Andre Engels
wikitech-l@lists.wikimedia.org