Re: [Wikitech-l] Bad UTF-8

4 Mar 2007


      Brion Vibber wrote:
...
Hypothetically, almost anyplace that crops strings or otherwise does
internal string manipulation other than with Language::truncate() could
end up spitting out bad UTF-8. :P
Web POST and GET input is sanitized (normalized and bad character
stripped) at the WebRequest level, but internal processing is not always
pure, and in most cases output is not sanitized either. Incorrect
cropping of long values in limited-length database fields is another
possibility.
(Note that XML dump generation specifically runs a UTF-8 cleanup step on
output; the XML dump output is thus guaranteed to be UTF-8-clean and NFC.)
This statement is true. I have never seen a Wikipeida XML dump with 
unicode errors.
Jeff

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Bad UTF-8