On Wed, Dec 26, 2012 at 9:14 PM, Al Johnson <alj62888(a)yahoo.com> wrote:
Thanks for the Java API ref. But, I'm curious as
to how or where invalid
UTF-8 sequences come about; is it primarily a hacker thing?
Most frequently due to buggy bot tools or reaaaally old browsers that
didn't support UTF-8 correctly.
I see the Java Character API has an
isValidCodePoint() method. Do I
just run each code point through that?
By the time your data is in Java String objects or 'char's it's already
been decoded from UTF-8 (8-bit byte stream) into UTF-16 (16-bit character
string). I don't remember offhand enough about Java I/O to tell you exactly
what class in the input stack is doing that though. :)
-- brion