[WikiEN-l] Old Wikipedia backups discovered
Tim Starling
tstarling at wikimedia.org
Wed Dec 22 00:53:56 UTC 2010
On 22/12/10 11:43, Anthony wrote:
> On Tue, Dec 21, 2010 at 10:02 AM, Tim Starling <tstarling at wikimedia.org> wrote:
>> I've uploaded my latest attempt at converting the backup to XML:
>>
>> http://noc.wikimedia.org/~tstarling/wikipedia-2001-08-xml.7z
>>
>> The archive contains an invalid XML file, with control characters
>> preserved, and a valid XML file, with control characters filtered.
> Which control characters? Aren't control characters allowed in XML 1.1?
I did: tr -d '\000-\010\013-\037'
In XML 1.1:
"Definition: A character is an atomic unit of text as specified by
ISO/IEC 10646:2000 [ISO/IEC 10646]. Legal characters are tab, carriage
return, line feed, and the legal characters of Unicode and ISO/IEC 10646.
"Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |
[#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character,
excluding the surrogate blocks, FFFE, and FFFF. */"
Without this change, importDump.php gives a fatal error.
-- Tim Starling
More information about the WikiEN-l
mailing list