[Mediawiki-l] Tidy up invalid multibyte chars in xml-dumps
rolf.lampa at rilnet.com
Fri Jan 12 22:12:09 UTC 2007
Brion Vibber wrote:
> Rolf Lampa wrote:
>> Is there any cheap trick to make a "bit- or byte-wise wash" of the xml
>> dump-files? Or is there any tool out there capable of streaming through
>> huge files tidying them up? (I can make my own "bitwise wash/mask" with
>> Delphi-code if I just knew for sure what general bit-pattern (if any) to
>> apply on the file).
> There's already such verification run on the dumps as they're produced,
> so it shouldn't be possible to get a file with invalid UTF-8 characters.
> Can you be specific about where in the file they occur
The file I had problems with was "svwiki-20061208-pages-articles.xml",
but now I tried downloading the file again and when trying to run the
Xml2Sql it works just fine! ... Sorry, I should have tried that before
Obviously one of my programs has done something with the file. I'm using
EditPad Pro and PowerGREP for some Regex tricks. Could be some other
tool too causing this, hm. I guess I have some homework to do to find
out what actually happened... ahum.
// Rolf Lampa
More information about the MediaWiki-l