I was implementing a configuration option to allow changes to the maximum length of edit summaries and log reasons, which I've often felt are inadequate. In doing so, I discovered an issue that needs to be resolved before I proceed. Basically, when you type stuff in the form, the input box's maxlength parameter is a maximum number of *characters*. But when stuff gets validated and put in the database, it's generally a number of *bytes* that things are truncated to. I've verified that Firefox will actually permit 200 multibyte characters to be submitted as an edit summary, when they cannot possibly fit into the database field. Now, we actually have two different ways of dealing with this right now:
* Log reasons just ignore the issue and pretend maxlength (which is 255 for them) is in bytes. Chances are good that's because whoever wrote up that code thought they *were*. If the log reason happens to end up being more than 255 bytes, as far as I can tell it will get sent to the database as-is. That part should definitely be fixed: MySQL in strict mode, and presumably PostgreSQL and Oracle, will return a fatal error if this occurs. MySQL in non-strict mode will silently truncate it without regard to character boundaries, which is only slightly better (and I guess users of non-English Wikipedias have gotten used to this behavior). * Edit summaries do a sort of hacky workaround. They specify maxlength=200, but then truncate the summary (using a nice Unicode-aware truncation function) to 250 bytes, and then add an ellipsis on the end if necessary. This probably works great for Latin-based languages that just have the occasional two-byte character with diacritic, but isn't much of a win for speakers of Hebrew or Greek or Chinese. For English speakers, it's just annoying, since it artificially limits the edit summary length. Using Firebug, I was able to delete the maxlength parameter and submit a 250-character summary on the English Wikipedia (a trick I'll remember for the future if this doesn't get changed soon ;) ).
The clean way to fix this, it seems to me, is as follows:
* For users of the MySQL UTF-8 schema, there should theoretically be no problem: all the database sizes are in characters already. The only change needed for these wikis is to up the edit summary length to the length of the database field and change the existing truncation logic to work in characters. I don't think we have any clean way to detect this scenario in the code at present, however, and anyway they're not important, they aren't Wikipedia. ;) * For everyone else, the client-side maxlength limit needs to be changed to a byte count. This should be possible using JavaScript, and is probably not otherwise possible (although if it were that would be great). In the event of a client-side overrun (say, because the user doesn't have JavaScript), the server should truncate the provided string and return it in an error message for the user to manually adjust. It should not silently truncate it. In this case as well, the maxlength for edit summaries should be upped to 255 bytes.
What does everyone else think about this? Returning a truncated reason in an error message will be a pain to write up, because there are so many entry points to this logic that will need to handle the error in their own strange and idiosyncratic ways, but I think it's the only correct thing to do here.