Hi,
I tried using mwdumper (latest SVN revision 57818) to import jawiki-20090927-pages-articles.xml [1] into MySQL, but I got an error:
Data too long for column 'rev_comment'
The problem is that the xml file contains a revision comment that is 257 bytes long, but the column accepts at most 255 bytes.
First I was stumped as to how this could happen, but then I found that on the Wikipedia page, the comment ends with the byte 'e3', while in the xml file it ends with 'ef bf bd'. See [2] for details.
I think the cause is something like this:
- Comments are truncated to 255 bytes when they are stored.
- In this case, this means that a three-byte UTF-8 sequence is cut off after its first byte (hex value e3), so the comment ends with an invalid one-byte UTF-8 sequence.
- The dump process has to generate valid UTF-8 (otherwise, most XML parsers wouldn't accept the file), so it replaces the invalid one-byte UTF-8 sequence by the 'replacement character' U+FFFD, which has the three-byte UTF-8 sequence 'ef bf bd'. See [3].
- In this case, the comment grows from 255 bytes to 257 bytes.
How to fix this? I think MediaWiki should make sure that a comment contains only valid UTF-8 sequences, even when it is truncated. This may mean that it has to be truncated to less than 255 bytes.
Alternatively, the dump process could drop invalid UTF-8 sequences instead of replacing them.
Yet another fix: mwdumper should make sure that a comment is at most 255 bytes long and truncate it if necessary.
More details can be found at [2].
Bye, Christopher
[1] http://download.wikimedia.org/jawiki/20090927/jawiki-20090927-pages-articles... [2] http://en.wikipedia.org/wiki/User:Chrisahn/CommentTooLong [3] http://www.utf8-chartable.de/unicode-utf8-table.pl?start=65520
On Fri, Oct 16, 2009 at 3:25 PM, Jona Christopher Sahnwaldt jcsahnwaldt@gmail.com wrote:
How to fix this? I think MediaWiki should make sure that a comment contains only valid UTF-8 sequences, even when it is truncated. This may mean that it has to be truncated to less than 255 bytes.
Alternatively, the dump process could drop invalid UTF-8 sequences instead of replacing them.
Yet another fix: mwdumper should make sure that a comment is at most 255 bytes long and truncate it if necessary.
The silent truncation thing is ridiculous on a lot of levels:
* It creates invalid UTF-8, if MySQL isn't using a utf8 collation (which has its own problems, and Wikimedia doesn't use it). * How many characters can be stored depends on how many bytes the relevant language's writing system happens to be in UTF-8, so Chinese/Arabic/Hebrew/Greek/Russian/etc. users get <150 characters. (Unless you use the utf8 collation, which has its own problems, and Wikimedia doesn't use it.) * It will cause a fatal error if MySQL is in strict mode. * It makes it difficult to impossible for a specific wiki to decide to allow longer edit summaries. (Personally, I find 255 characters is often too short. I think Citizendium is hacked to allow more.) * The limit in MySQL is counted in bytes (unless you use the utf8 collation, etc.), but HTML maxlength is counted in characters, so we have no way to effectively limit things client-side without JavaScript. Currently we fake it by setting a maxlength of 200 characters, and hoping that that winds up being less than 255 bytes. That leaves enough breathing room so languages like French don't overrun, but I assume speakers of Chinese/Arabic/Hebrew/Greek/Russian/etc. languages are just resigned to the fact that their edit summaries get unpredictably truncated. Also, it's unnecessarily small for English, where 255 characters would usually fit -- in fact enwiki has a Gadget that hacks this up, and I've sometimes edited up the maxlength manually when I found I wanted a little more space.
The correct fix is just to make the field TEXT/BLOB so the length limit is enforced purely in the application. The same goes for log_comment, where last I checked we weren't even doing the maxlength=200 hack. Are there any objections to finally doing this? What's the procedure these days for schema changes? I'd check in something right now, in fact, except that I have to go in like ten minutes.
(erm, end rant)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aryeh Gregor wrote:
The correct fix is just to make the field TEXT/BLOB so the length limit is enforced purely in the application. The same goes for log_comment, where last I checked we weren't even doing the maxlength=200 hack. Are there any objections to finally doing this? What's the procedure these days for schema changes? I'd check in something right now, in fact, except that I have to go in like ten minutes.
There's long been a desire to lengthen the edit summary[1] (and log reasons too) - why not do that while the patient is open on the operating table?
- -Mike
[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=4714
On Fri, Oct 16, 2009 at 7:08 PM, Mike.lifeguard mike.lifeguard@gmail.com wrote:
There's long been a desire to lengthen the edit summary[1] (and log reasons too) - why not do that while the patient is open on the operating table?
Yes, of course. That would be trivial with the schema update in place, and impossible without.
On Fri, Oct 16, 2009 at 21:25, Jona Christopher Sahnwaldt jcsahnwaldt@gmail.com wrote:
Yet another fix: mwdumper should make sure that a comment is at most 255 bytes long and truncate it if necessary.
I implemented this fix / hack and checked it in at
http://dbpedia.svn.sourceforge.net/viewvc/dbpedia?view=rev&revision=1771
Seems to fix that problem for me. Feel free to copy that code back to mediawiki if you want.
Thanks for mwdumper, by the way! Nice little piece of code. Quite clean and modular.
Christopher
Oops, I just realized that this is a known bug:
https://bugzilla.wikimedia.org/show_bug.cgi?id=13721
Should have checked there first.
Anyway, allowing longer comments sounds like a good idea.
wikitech-l@lists.wikimedia.org