[Wikitech-l] Bad UTF-8

26 Feb 2007


      One problem I seem to run into over and over again while working with
data from our projects is invalid UTF-8 sequences that litter our
output. I could go on about how doing this is a horrible offense
against man, but my main concern is to simply get around these issues
in my own code.
Does anyone have a document describing places where we're known to
emit malformed unicode?
Tonight I encountered it in the file history output for a query.php request.
http://commons.wikimedia.org/w/query.php?what=categories%7Ctemplates%7Clinks...
You can see it on the site proper at:
http://commons.wikimedia.org/w/index.php?title=Image:00022279.jpg&action...
Simply striping the bad characters on the serialized output frightens
me because I worry that eventually I'll strip an adjacent delimiter
and make the output unparsable.  It would at least be useful to know
all the places I could expect to find junk like this. :)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Bad UTF-8