UTF8 Normalization

List overview All Threads
Download

newer

older

$wgRestrictionLevels

User agent statistics

Jared Williams

22 Nov 2008 22 Nov '08

11:54 a.m.

Hi, I patched UtfNormal.php to use the new intl extension's normalization function, http://php.net/manual/en/normalizer.normalize.php

But running UtfNormalTest.php causes 200+ errors, whilst the PHP normalization routines are error free.

So kind at a standstill wondering what the problem is. I guess WM use utf8_normalize() which is based on ICU like intl, does than have the same errors?

Jared

Show replies by date

Brion Vibber

24 Nov 24 Nov

5:17 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Jared Williams wrote:

...

I patched UtfNormal.php to use the new intl extension's normalization function, http://php.net/manual/en/normalizer.normalize.php

But running UtfNormalTest.php causes 200+ errors, whilst the PHP normalization routines are error free.

So kind at a standstill wondering what the problem is. I guess WM use utf8_normalize() which is based on ICU like intl, does than have the same errors?

It may be that intl is using an ICU library which supports an older version of Unicode. UtfNormal is currently built using the Unicode 5.1 files, as I recall, so will have the 5.1 test cases.

For the most part, this means that the newer version includes mapping rules for newly-added characters -- normalization is deliberately designed to be compatible and existing rules should never change. (Though there might be a few exceptions changed as errata, I forget.)

- -- brion

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEUEARECAAYFAkkq/a8ACgkQwRnhpk1wk46vBQCVEmS0p5MkPiiBujH5C9N1L2TC 7wCgv+9gJ0d3UNgyE6FvFFSsHZxuEQw= =NNX9 -----END PGP SIGNATURE-----

Jared Williams

6:07 p.m.

...

-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Brion Vibber Sent: 24 November 2008 19:17 To: Wikimedia developers Subject: Re: [Wikitech-l] UTF8 Normalization

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Jared Williams wrote:

...
I patched UtfNormal.php to use the new intl extension's

normalization

...
function, http://php.net/manual/en/normalizer.normalize.php

But running UtfNormalTest.php causes 200+ errors,

whilst the PHP

...
normalization routines are error free.

So kind at a standstill wondering what the problem is.

I guess WM use

...
utf8_normalize() which is based on ICU like intl, does than

have the

...
same errors?

It may be that intl is using an ICU library which supports an older version of Unicode. UtfNormal is currently built using the Unicode 5.1 files, as I recall, so will have the 5.1 test cases.

For the most part, this means that the newer version includes mapping rules for newly-added characters -- normalization is deliberately designed to be compatible and existing rules should never change. (Though there might be a few exceptions changed as errata, I forget.)

Ah, php.net provided intl is built with ICU 3.6..which is unicode 5.0

Jared

5887

Age (days ago)

5889

Last active (days ago)

wikitech-l@lists.wikimedia.org

2 comments

2 participants

tags (0)

participants (2)

Brion Vibber
Jared Williams