Brion Vibber wrote:
Rolf Lampa wrote:
Is there any cheap trick to make a "bit- or byte-wise wash" of the xml dump-files? Or is there any tool out there capable of streaming through huge files tidying them up? (I can make my own "bitwise wash/mask" with Delphi-code if I just knew for sure what general bit-pattern (if any) to apply on the file).
There's already such verification run on the dumps as they're produced, so it shouldn't be possible to get a file with invalid UTF-8 characters.
Can you be specific about where in the file they occur
The file I had problems with was "svwiki-20061208-pages-articles.xml", but now I tried downloading the file again and when trying to run the Xml2Sql it works just fine! ... Sorry, I should have tried that before of course...
Obviously one of my programs has done something with the file. I'm using EditPad Pro and PowerGREP for some Regex tricks. Could be some other tool too causing this, hm. I guess I have some homework to do to find out what actually happened... ahum.
Regards,
// Rolf Lampa