Normalization Code

List overview All Threads
Download

newer

older

Exclude normalization for specific...

Re: [MediaWiki-l] Wikify for...

Al Johnson

26 Dec 2012 26 Dec '12

11:20 p.m.

Hello,

I need to make sure a backend Java process is doing the same UTF normalization that is done for edit text. Grep'ing for 'normaliz' brings up a lot and I'm not a php dev. Can someone point me to a key php module and/or function?

Thank you, Al

Show replies by date

Jeremy Baron

27 Dec 27 Dec

12:11 a.m.

On Wed, Dec 26, 2012 at 10:20 PM, Al Johnson alj62888@yahoo.com wrote:

...

I need to make sure a backend Java process is doing the same UTF normalization that is done for edit text. Grep'ing for 'normaliz' brings up a lot and I'm not a php dev. Can someone point me to a key php module and/or function?

I guess (it really is a guess) start with includes/normal/UtfNormal.php and https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;a=blob;f=includes...

Brion Vibber

4:40 a.m.

On Wed, Dec 26, 2012 at 2:20 PM, Al Johnson wrote:

...

I need to make sure a backend Java process is doing the same UTF normalization that is done for edit text. Grep'ing for 'normaliz' brings up a lot and I'm not a php dev. Can someone point me to a key php module and/or function?

The PHP code for this is in includes/normal -- luckily you shouldn't have to replicate most of that code which is nasty and low-level.

For the most part, you want to do two things: * make sure the input is valid UTF-8 * normalize any composition character sequences to 'normalization form C'

Reading data in from a UTF-8 input stream into a Java string should already take care of making sure it's valid UTF-8. :) If you want to treat invalid input the same, make sure that invalid UTF-8 sequences get converted to the 'replacement character' U+FFFD rather than throwing an exception.

It looks like you should be able to use the java.text.Normalizer class to convert to NFC: < http://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html%3E

You might or might not prefer to use the Java version of the ICU library to do the same thing, it might be more up to date: < http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Normalizer.html%3E

-- brion

Al Johnson

6:14 a.m.

Thanks for the Java API ref. But, I'm curious as to how or where invalid UTF-8 sequences come about; is it primarily a hacker thing? I see the Java Character API has an isValidCodePoint() method. Do I just run each code point through that?

Thanks, al

________________________________ From: Brion Vibber brion@pobox.com To: Al Johnson alj62888@yahoo.com; MediaWiki announcements and site admin list mediawiki-l@lists.wikimedia.org Sent: Wednesday, December 26, 2012 8:40 PM Subject: Re: [MediaWiki-l] Normalization Code

On Wed, Dec 26, 2012 at 2:20 PM, Al Johnson wrote:

...

The PHP code for this is in includes/normal -- luckily you shouldn't have to replicate most of that code which is nasty and low-level.

For the most part, you want to do two things: * make sure the input is valid UTF-8 * normalize any composition character sequences to 'normalization form C'

It looks like you should be able to use the java.text.Normalizer class to convert to NFC: http://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html

You might or might not prefer to use the Java version of the ICU library to do the same thing, it might be more up to date: http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Normalizer.html

-- brion

Brion Vibber

8:24 a.m.

On Wed, Dec 26, 2012 at 9:14 PM, Al Johnson alj62888@yahoo.com wrote:

...

Thanks for the Java API ref. But, I'm curious as to how or where invalid UTF-8 sequences come about; is it primarily a hacker thing?

Most frequently due to buggy bot tools or reaaaally old browsers that didn't support UTF-8 correctly.

...

I see the Java Character API has an isValidCodePoint() method. Do I just run each code point through that?

By the time your data is in Java String objects or 'char's it's already been decoded from UTF-8 (8-bit byte stream) into UTF-16 (16-bit character string). I don't remember offhand enough about Java I/O to tell you exactly what class in the input stack is doing that though. :)

-- brion

4387

Age (days ago)

4388

Last active (days ago)

mediawiki-l@lists.wikimedia.org

4 comments

3 participants

tags (0)

participants (3)

Al Johnson
Brion Vibber
Jeremy Baron