Re: [Wikitech-l] Bad UTF-8

4 Mar 2007


      -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Gregory Maxwell wrote:
...
One problem I seem to run into over and over again while working with
data from our projects is invalid UTF-8 sequences that litter our
output. I could go on about how doing this is a horrible offense
against man, but my main concern is to simply get around these issues
in my own code.
[snip]
...
Simply striping the bad characters on the serialized output frightens
me because I worry that eventually I'll strip an adjacent delimiter
and make the output unparsable.  It would at least be useful to know
all the places I could expect to find junk like this. :)
Hypothetically, almost anyplace that crops strings or otherwise does
internal string manipulation other than with Language::truncate() could
end up spitting out bad UTF-8. :P
Web POST and GET input is sanitized (normalized and bad character
stripped) at the WebRequest level, but internal processing is not always
pure, and in most cases output is not sanitized either. Incorrect
cropping of long values in limited-length database fields is another
possibility.
(Note that XML dump generation specifically runs a UTF-8 cleanup step on
output; the XML dump output is thus guaranteed to be UTF-8-clean and NFC.)
Running UtfNormal::cleanUp() or similar should guarantee well-formed
UTF-8 output. Any invalid bytes will be replaced with the Unicode
replacement character U+FFFE which is reserved for this purpose.
Thanks to UTF-8's structured value space, accidentally removing ASCII
field delimiters in the cleanup step isn't possible -- the ASCII byte
cannot be confused with a trailing byte for a longer sequence -- so in
theory it should be safe to run a single step at the end instead of many
individual cleanup checks.
- -- brion vibber (brion @ pobox.com / brion @ wikimedia.org)
...PGP SIGNATURE...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFF6kzowRnhpk1wk44RAtoQAKCMd93IlhuuuEpV9GjtQeZua5XUXwCgsFaP
d1QkBzfToD01f0Ljgl+Y7EY=
=txlC
-----END PGP SIGNATURE-----

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Bad UTF-8