Hi,
I tried using mwdumper (latest SVN revision 57818) to import jawiki-20090927-pages-articles.xml [1] into MySQL, but I got an error:
Data too long for column 'rev_comment'
The problem is that the xml file contains a revision comment that is 257 bytes long, but the column accepts at most 255 bytes.
First I was stumped as to how this could happen, but then I found that on the Wikipedia page, the comment ends with the byte 'e3', while in the xml file it ends with 'ef bf bd'. See [2] for details.
I think the cause is something like this:
- Comments are truncated to 255 bytes when they are stored.
- In this case, this means that a three-byte UTF-8 sequence is cut off after its first byte (hex value e3), so the comment ends with an invalid one-byte UTF-8 sequence.
- The dump process has to generate valid UTF-8 (otherwise, most XML parsers wouldn't accept the file), so it replaces the invalid one-byte UTF-8 sequence by the 'replacement character' U+FFFD, which has the three-byte UTF-8 sequence 'ef bf bd'. See [3].
- In this case, the comment grows from 255 bytes to 257 bytes.
How to fix this? I think MediaWiki should make sure that a comment contains only valid UTF-8 sequences, even when it is truncated. This may mean that it has to be truncated to less than 255 bytes.
Alternatively, the dump process could drop invalid UTF-8 sequences instead of replacing them.
Yet another fix: mwdumper should make sure that a comment is at most 255 bytes long and truncate it if necessary.
More details can be found at [2].
Bye, Christopher
[1] http://download.wikimedia.org/jawiki/20090927/jawiki-20090927-pages-articles... [2] http://en.wikipedia.org/wiki/User:Chrisahn/CommentTooLong [3] http://www.utf8-chartable.de/unicode-utf8-table.pl?start=65520