[WikiEN-l] Old Wikipedia backups discovered

Tim Starling tstarling at wikimedia.org
Wed Dec 22 00:53:56 UTC 2010


On 22/12/10 11:43, Anthony wrote:
> On Tue, Dec 21, 2010 at 10:02 AM, Tim Starling <tstarling at wikimedia.org> wrote:
>> I've uploaded my latest attempt at converting the backup to XML:
>>
>> http://noc.wikimedia.org/~tstarling/wikipedia-2001-08-xml.7z
>>
>> The archive contains an invalid XML file, with control characters
>> preserved, and a valid XML file, with control characters filtered.
> Which control characters?  Aren't control characters allowed in XML 1.1?

I did: tr -d '\000-\010\013-\037'

In XML 1.1:

"Definition: A character is an atomic unit of text as specified by
ISO/IEC 10646:2000 [ISO/IEC 10646]. Legal characters are tab, carriage
return, line feed, and the legal characters of Unicode and ISO/IEC 10646.

"Char       ::=       #x9 | #xA | #xD | [#x20-#xD7FF] |
[#xE000-#xFFFD] | [#x10000-#x10FFFF]    /* any Unicode character,
excluding the surrogate blocks, FFFE, and FFFF. */"

Without this change, importDump.php gives a fatal error.

-- Tim Starling




More information about the WikiEN-l mailing list