[Mediawiki-l] Tidy up invalid multibyte chars in xml-dumps

Rolf Lampa rolf.lampa at rilnet.com
Fri Jan 12 22:12:09 UTC 2007

Brion Vibber wrote:
> Rolf Lampa wrote:
>> Is there any cheap trick to make a "bit- or byte-wise wash" of the xml 
>> dump-files? Or is there any tool out there capable of streaming through 
>> huge files tidying them up? (I can make my own "bitwise wash/mask" with 
>> Delphi-code if I just knew for sure what general bit-pattern (if any) to 
>> apply on the file).
> There's already such verification run on the dumps as they're produced,
> so it shouldn't be possible to get a file with invalid UTF-8 characters.
> Can you be specific about where in the file they occur

The file I had problems with was "svwiki-20061208-pages-articles.xml", 
but now I tried downloading the file again and when trying to run the 
Xml2Sql it works just fine!  ... Sorry, I should have tried that before 
of course...

Obviously one of my programs has done something with the file. I'm using 
EditPad Pro and PowerGREP for some Regex tricks. Could be some other 
tool too causing this, hm.  I guess I have some homework to do to find 
out what actually happened... ahum.


// Rolf Lampa

More information about the MediaWiki-l mailing list