I was implementing a configuration option to allow changes to the
maximum length of edit summaries and log reasons, which I've often
felt are inadequate. In doing so, I discovered an issue that needs to
be resolved before I proceed. Basically, when you type stuff in the
form, the input box's maxlength parameter is a maximum number of
*characters*. But when stuff gets validated and put in the database,
it's generally a number of *bytes* that things are truncated to. I've
verified that Firefox will actually permit 200 multibyte characters to
be submitted as an edit summary, when they cannot possibly fit into
the database field. Now, we actually have two different ways of
dealing with this right now:
* Log reasons just ignore the issue and pretend maxlength (which is
255 for them) is in bytes. Chances are good that's because whoever
wrote up that code thought they *were*. If the log reason happens to
end up being more than 255 bytes, as far as I can tell it will get
sent to the database as-is. That part should definitely be fixed:
MySQL in strict mode, and presumably PostgreSQL and Oracle, will
return a fatal error if this occurs. MySQL in non-strict mode will
silently truncate it without regard to character boundaries, which is
only slightly better (and I guess users of non-English Wikipedias have
gotten used to this behavior).
* Edit summaries do a sort of hacky workaround. They specify
maxlength=200, but then truncate the summary (using a nice
Unicode-aware truncation function) to 250 bytes, and then add an
ellipsis on the end if necessary. This probably works great for
Latin-based languages that just have the occasional two-byte character
with diacritic, but isn't much of a win for speakers of Hebrew or
Greek or Chinese. For English speakers, it's just annoying, since it
artificially limits the edit summary length. Using Firebug, I was
able to delete the maxlength parameter and submit a 250-character
summary on the English Wikipedia (a trick I'll remember for the future
if this doesn't get changed soon ;) ).
The clean way to fix this, it seems to me, is as follows:
* For users of the MySQL UTF-8 schema, there should theoretically be
no problem: all the database sizes are in characters already. The
only change needed for these wikis is to up the edit summary length to
the length of the database field and change the existing truncation
logic to work in characters. I don't think we have any clean way to
detect this scenario in the code at present, however, and anyway
they're not important, they aren't Wikipedia. ;)
* For everyone else, the client-side maxlength limit needs to be
changed to a byte count. This should be possible using JavaScript,
and is probably not otherwise possible (although if it were that would
be great). In the event of a client-side overrun (say, because the
user doesn't have JavaScript), the server should truncate the provided
string and return it in an error message for the user to manually
adjust. It should not silently truncate it. In this case as well,
the maxlength for edit summaries should be upped to 255 bytes.
What does everyone else think about this? Returning a truncated
reason in an error message will be a pain to write up, because there
are so many entry points to this logic that will need to handle the
error in their own strange and idiosyncratic ways, but I think it's
the only correct thing to do here.