On Mon, Nov 17, 2003 at 06:06:00PM +0100, Vincent Ramos wrote:
Cons:
- any text containing non ASCII characters would increase
its weight : instead of one byte for a single <c with cedilla>, it would require two; French uses lots of non ASCII characters, as é è ç à ù;
It's not obvious it's going to be bigger at all - you need only 2-3 chars instead of about 8 (ሴ) for characters not in Latin 1.
French main page: Latin 1 (as is): 28035 Naively converted to UTF-8: 28235 (0.7% bigger) With all &entities; UTF-8-ized 27867 (0.6% smaller)
The last number is smaller than Latin 1, but some of it is cheating - French main page contained &codes; for some characters that could be represented directly (ü), and some of conversions weren't completely legal (&->&).
Still, there's no reason to believe UTF-8 is going to be significantly worse spacewise than ISO-8859-1. For English Wikipedia it's going to be obviously smaller, as it uses even less characters from 128-255 range.