One problem I seem to run into over and over again while working with data from our projects is invalid UTF-8 sequences that litter our output. I could go on about how doing this is a horrible offense against man, but my main concern is to simply get around these issues in my own code.
Does anyone have a document describing places where we're known to emit malformed unicode?
Tonight I encountered it in the file history output for a query.php request.
http://commons.wikimedia.org/w/query.php?what=categories%7Ctemplates%7Clinks...
You can see it on the site proper at: http://commons.wikimedia.org/w/index.php?title=Image:00022279.jpg&action...
Simply striping the bad characters on the serialized output frightens me because I worry that eventually I'll strip an adjacent delimiter and make the output unparsable. It would at least be useful to know all the places I could expect to find junk like this. :)