Bad UTF-8

List overview All Threads
Download

newer

older

MediaWiki automated test run...

[Wikipedia Search on Mobile]...

Gregory Maxwell

26 Feb 2007 26 Feb '07

3:07 p.m.

One problem I seem to run into over and over again while working with data from our projects is invalid UTF-8 sequences that litter our output. I could go on about how doing this is a horrible offense against man, but my main concern is to simply get around these issues in my own code.

Does anyone have a document describing places where we're known to emit malformed unicode?

Tonight I encountered it in the file history output for a query.php request.

http://commons.wikimedia.org/w/query.php?what=categories%7Ctemplates%7Clinks...

You can see it on the site proper at: http://commons.wikimedia.org/w/index.php?title=Image:00022279.jpg&action...

Simply striping the bad characters on the serialized output frightens me because I worry that eventually I'll strip an adjacent delimiter and make the output unparsable. It would at least be useful to know all the places I could expect to find junk like this. :)

Show replies by date

Jeffrey V. Merkey

26 Feb 26 Feb

4:49 p.m.

Gregory Maxwell wrote:

...

One problem I seem to run into over and over again while working with data from our projects is invalid UTF-8 sequences that litter our output. I could go on about how doing this is a horrible offense against man, but my main concern is to simply get around these issues in my own code.

Does anyone have a document describing places where we're known to emit malformed unicode?

I have only seen this on some older dumps. Have not seen it for a while. I parse them out of the dumps.

Jeff

...

Tonight I encountered it in the file history output for a query.php request.

http://commons.wikimedia.org/w/query.php?what=categories%7Ctemplates%7Clinks...

You can see it on the site proper at: http://commons.wikimedia.org/w/index.php?title=Image:00022279.jpg&action...

Simply striping the bad characters on the serialized output frightens me because I worry that eventually I'll strip an adjacent delimiter and make the output unparsable. It would at least be useful to know all the places I could expect to find junk like this. :)

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Ilmari Karonen

7:22 p.m.

Gregory Maxwell wrote:

...

Simply striping the bad characters on the serialized output frightens me because I worry that eventually I'll strip an adjacent delimiter and make the output unparsable.

Due to the way UTF-8 is designed, it should in fact be safe to do this. For details, see http://en.wikipedia.org/wiki/UTF-8.

-- Ilmari Karonen

Brion Vibber

4 Mar 4 Mar

12:36 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Gregory Maxwell wrote:

...

One problem I seem to run into over and over again while working with data from our projects is invalid UTF-8 sequences that litter our output. I could go on about how doing this is a horrible offense against man, but my main concern is to simply get around these issues in my own code.

[snip]

...

Simply striping the bad characters on the serialized output frightens me because I worry that eventually I'll strip an adjacent delimiter and make the output unparsable. It would at least be useful to know all the places I could expect to find junk like this. :)

Hypothetically, almost anyplace that crops strings or otherwise does internal string manipulation other than with Language::truncate() could end up spitting out bad UTF-8. :P

Web POST and GET input is sanitized (normalized and bad character stripped) at the WebRequest level, but internal processing is not always pure, and in most cases output is not sanitized either. Incorrect cropping of long values in limited-length database fields is another possibility.

(Note that XML dump generation specifically runs a UTF-8 cleanup step on output; the XML dump output is thus guaranteed to be UTF-8-clean and NFC.)

Running UtfNormal::cleanUp() or similar should guarantee well-formed UTF-8 output. Any invalid bytes will be replaced with the Unicode replacement character U+FFFE which is reserved for this purpose.

Thanks to UTF-8's structured value space, accidentally removing ASCII field delimiters in the cleanup step isn't possible -- the ASCII byte cannot be confused with a trailing byte for a longer sequence -- so in theory it should be safe to run a single step at the end instead of many individual cleanup checks.

- -- brion vibber (brion @ pobox.com / brion @ wikimedia.org)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFF6kzowRnhpk1wk44RAtoQAKCMd93IlhuuuEpV9GjtQeZua5XUXwCgsFaP d1QkBzfToD01f0Ljgl+Y7EY= =txlC -----END PGP SIGNATURE-----

Jeffrey V. Merkey

2:23 p.m.

Brion Vibber wrote:

...

Hypothetically, almost anyplace that crops strings or otherwise does internal string manipulation other than with Language::truncate() could end up spitting out bad UTF-8. :P

Web POST and GET input is sanitized (normalized and bad character stripped) at the WebRequest level, but internal processing is not always pure, and in most cases output is not sanitized either. Incorrect cropping of long values in limited-length database fields is another possibility.

(Note that XML dump generation specifically runs a UTF-8 cleanup step on output; the XML dump output is thus guaranteed to be UTF-8-clean and NFC.)

This statement is true. I have never seen an XML dump with unicode errors.

Jeff

Jeffrey V. Merkey

2:23 p.m.

Brion Vibber wrote:

...

Hypothetically, almost anyplace that crops strings or otherwise does internal string manipulation other than with Language::truncate() could end up spitting out bad UTF-8. :P

Web POST and GET input is sanitized (normalized and bad character stripped) at the WebRequest level, but internal processing is not always pure, and in most cases output is not sanitized either. Incorrect cropping of long values in limited-length database fields is another possibility.

(Note that XML dump generation specifically runs a UTF-8 cleanup step on output; the XML dump output is thus guaranteed to be UTF-8-clean and NFC.)

This statement is true. I have never seen a Wikipeida XML dump with unicode errors.

Jeff

6476

Age (days ago)

6482

Last active (days ago)

wikitech-l@lists.wikimedia.org

5 comments

4 participants

tags (0)

participants (4)

Brion Vibber
Gregory Maxwell
Ilmari Karonen
Jeffrey V. Merkey