On Fri, Oct 16, 2009 at 3:25 PM, Jona Christopher Sahnwaldt jcsahnwaldt@gmail.com wrote:
How to fix this? I think MediaWiki should make sure that a comment contains only valid UTF-8 sequences, even when it is truncated. This may mean that it has to be truncated to less than 255 bytes.
Alternatively, the dump process could drop invalid UTF-8 sequences instead of replacing them.
Yet another fix: mwdumper should make sure that a comment is at most 255 bytes long and truncate it if necessary.
The silent truncation thing is ridiculous on a lot of levels:
* It creates invalid UTF-8, if MySQL isn't using a utf8 collation (which has its own problems, and Wikimedia doesn't use it). * How many characters can be stored depends on how many bytes the relevant language's writing system happens to be in UTF-8, so Chinese/Arabic/Hebrew/Greek/Russian/etc. users get <150 characters. (Unless you use the utf8 collation, which has its own problems, and Wikimedia doesn't use it.) * It will cause a fatal error if MySQL is in strict mode. * It makes it difficult to impossible for a specific wiki to decide to allow longer edit summaries. (Personally, I find 255 characters is often too short. I think Citizendium is hacked to allow more.) * The limit in MySQL is counted in bytes (unless you use the utf8 collation, etc.), but HTML maxlength is counted in characters, so we have no way to effectively limit things client-side without JavaScript. Currently we fake it by setting a maxlength of 200 characters, and hoping that that winds up being less than 255 bytes. That leaves enough breathing room so languages like French don't overrun, but I assume speakers of Chinese/Arabic/Hebrew/Greek/Russian/etc. languages are just resigned to the fact that their edit summaries get unpredictably truncated. Also, it's unnecessarily small for English, where 255 characters would usually fit -- in fact enwiki has a Gadget that hacks this up, and I've sometimes edited up the maxlength manually when I found I wanted a little more space.
The correct fix is just to make the field TEXT/BLOB so the length limit is enforced purely in the application. The same goes for log_comment, where last I checked we weren't even doing the maxlength=200 hack. Are there any objections to finally doing this? What's the procedure these days for schema changes? I'd check in something right now, in fact, except that I have to go in like ten minutes.
(erm, end rant)