On Wed, Dec 26, 2012 at 9:14 PM, Al Johnson alj62888@yahoo.com wrote:
Thanks for the Java API ref. But, I'm curious as to how or where invalid UTF-8 sequences come about; is it primarily a hacker thing?
Most frequently due to buggy bot tools or reaaaally old browsers that didn't support UTF-8 correctly.
I see the Java Character API has an isValidCodePoint() method. Do I just run each code point through that?
By the time your data is in Java String objects or 'char's it's already been decoded from UTF-8 (8-bit byte stream) into UTF-16 (16-bit character string). I don't remember offhand enough about Java I/O to tell you exactly what class in the input stack is doing that though. :)
-- brion