Hi,
I tried using mwdumper (latest SVN revision 57818)
to import jawiki-20090927-pages-articles.xml [1]
into MySQL, but I got an error:
Data too long for column 'rev_comment'
The problem is that the xml file contains a revision
comment that is 257 bytes long, but the column
accepts at most 255 bytes.
First I was stumped as to how this could happen,
but then I found that on the Wikipedia page, the
comment ends with the byte 'e3', while in the
xml file it ends with 'ef bf bd'. See [2] for details.
I think the cause is something like this:
- Comments are truncated to 255 bytes when they
are stored.
- In this case, this means that a three-byte UTF-8
sequence is cut off after its first byte (hex value e3),
so the comment ends with an invalid one-byte UTF-8
sequence.
- The dump process has to generate valid UTF-8
(otherwise, most XML parsers wouldn't accept
the file), so it replaces the invalid one-byte UTF-8
sequence by the 'replacement character' U+FFFD,
which has the three-byte UTF-8 sequence 'ef bf bd'.
See [3].
- In this case, the comment grows from 255 bytes
to 257 bytes.
How to fix this? I think MediaWiki should make sure
that a comment contains only valid UTF-8 sequences,
even when it is truncated. This may mean that it
has to be truncated to less than 255 bytes.
Alternatively, the dump process could drop invalid
UTF-8 sequences instead of replacing them.
Yet another fix: mwdumper should make sure
that a comment is at most 255 bytes long and
truncate it if necessary.
More details can be found at [2].
Bye,
Christopher
[1]
http://download.wikimedia.org/jawiki/20090927/jawiki-20090927-pages-article…
[2]
http://en.wikipedia.org/wiki/User:Chrisahn/CommentTooLong
[3]
http://www.utf8-chartable.de/unicode-utf8-table.pl?start=65520