Re: [MediaWiki-l] Normalization Code

27 Dec 2012

Thanks for the Java API ref.  But, I'm curious as to how or where invalid UTF-8
sequences come about; is it primarily a hacker thing?  I see the Java Character API has an
isValidCodePoint() method.  Do I just run each code point through that?

Thanks,
al

________________________________
 From: Brion Vibber &lt;brion(a)pobox.com...

To: Al Johnson &lt;alj62888(a)yahoo.com&gt;om>; MediaWiki announcements and site admin
list &lt;mediawiki-l(a)lists.wikimedia.org&gt; 
Sent: Wednesday, December 26, 2012 8:40 PM
Subject: Re: [MediaWiki-l] Normalization Code

On Wed, Dec 26, 2012 at 2:20 PM, Al Johnson wrote:

I need to make sure a backend Java process is doing the same UTF normalization that is
done for edit text.  Grep'ing for 'normaliz' brings up a lot and I'm not a
php dev.  Can someone point me to a key php module and/or function?
...

The PHP code for this is in includes/normal -- luckily you shouldn't have to replicate
most of that code which is nasty and low-level.

For the most part, you want to do two things:
* make sure the input is valid UTF-8
* normalize any composition character sequences to 'normalization form C'

Reading data in from a UTF-8 input stream into a Java string should already take care of
making sure it's valid UTF-8. :) If you want to treat invalid input the same, make
sure that invalid UTF-8 sequences get converted to the 'replacement character'
U+FFFD rather than throwing an exception.

It looks like you should be able to use the java.text.Normalizer class to convert to NFC:
<http://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html...

You might or might not prefer to use the Java version of the ICU library to do the same
thing, it might be more up to date:
<http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Normalizer.html...

-- brion

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

Re: [MediaWiki-l] Normalization Code