I was implementing a configuration option to allow changes to the maximum length of edit summaries and log reasons, which I've often felt are inadequate. In doing so, I discovered an issue that needs to be resolved before I proceed. Basically, when you type stuff in the form, the input box's maxlength parameter is a maximum number of *characters*. But when stuff gets validated and put in the database, it's generally a number of *bytes* that things are truncated to. I've verified that Firefox will actually permit 200 multibyte characters to be submitted as an edit summary, when they cannot possibly fit into the database field. Now, we actually have two different ways of dealing with this right now:
* Log reasons just ignore the issue and pretend maxlength (which is 255 for them) is in bytes. Chances are good that's because whoever wrote up that code thought they *were*. If the log reason happens to end up being more than 255 bytes, as far as I can tell it will get sent to the database as-is. That part should definitely be fixed: MySQL in strict mode, and presumably PostgreSQL and Oracle, will return a fatal error if this occurs. MySQL in non-strict mode will silently truncate it without regard to character boundaries, which is only slightly better (and I guess users of non-English Wikipedias have gotten used to this behavior). * Edit summaries do a sort of hacky workaround. They specify maxlength=200, but then truncate the summary (using a nice Unicode-aware truncation function) to 250 bytes, and then add an ellipsis on the end if necessary. This probably works great for Latin-based languages that just have the occasional two-byte character with diacritic, but isn't much of a win for speakers of Hebrew or Greek or Chinese. For English speakers, it's just annoying, since it artificially limits the edit summary length. Using Firebug, I was able to delete the maxlength parameter and submit a 250-character summary on the English Wikipedia (a trick I'll remember for the future if this doesn't get changed soon ;) ).
The clean way to fix this, it seems to me, is as follows:
* For users of the MySQL UTF-8 schema, there should theoretically be no problem: all the database sizes are in characters already. The only change needed for these wikis is to up the edit summary length to the length of the database field and change the existing truncation logic to work in characters. I don't think we have any clean way to detect this scenario in the code at present, however, and anyway they're not important, they aren't Wikipedia. ;) * For everyone else, the client-side maxlength limit needs to be changed to a byte count. This should be possible using JavaScript, and is probably not otherwise possible (although if it were that would be great). In the event of a client-side overrun (say, because the user doesn't have JavaScript), the server should truncate the provided string and return it in an error message for the user to manually adjust. It should not silently truncate it. In this case as well, the maxlength for edit summaries should be upped to 255 bytes.
What does everyone else think about this? Returning a truncated reason in an error message will be a pain to write up, because there are so many entry points to this logic that will need to handle the error in their own strange and idiosyncratic ways, but I think it's the only correct thing to do here.
Simetrical schreef:
What does everyone else think about this? Returning a truncated reason in an error message will be a pain to write up, because there are so many entry points to this logic that will need to handle the error in their own strange and idiosyncratic ways, but I think it's the only correct thing to do here.
How about some general summary processing function that does that?
Roan Kattouw (Catrope)
On Sun, Mar 2, 2008 at 12:00 PM, Roan Kattouw roan.kattouw@home.nl wrote:
Simetrical schreef:
What does everyone else think about this? Returning a truncated reason in an error message will be a pain to write up, because there are so many entry points to this logic that will need to handle the error in their own strange and idiosyncratic ways, but I think it's the only correct thing to do here.
How about some general summary processing function that does that?
That will generate the error, but it needs to be handled individually by each caller. There's no generic "generate an error message for this page that allows the user to resubmit the form" function.
That will generate the error, but it needs to be handled individually by each caller. There's no generic "generate an error message for this page that allows the user to resubmit the form" function.
They don't need to resubmit the whole form, just the summary. Could a generic function give them the chance to change the summary and just pass all the rest of the fields through untouched?
On Sun, Mar 2, 2008 at 12:12 PM, Thomas Dalton thomas.dalton@gmail.com wrote:
They don't need to resubmit the whole form, just the summary. Could a generic function give them the chance to change the summary and just pass all the rest of the fields through untouched?
An interesting thought.
Simetrical wrote:
- For everyone else, the client-side maxlength limit needs to be
changed to a byte count. This should be possible using JavaScript, and is probably not otherwise possible (although if it were that would be great). In the event of a client-side overrun (say, because the user doesn't have JavaScript), the server should truncate the provided string and return it in an error message for the user to manually adjust. It should not silently truncate it. In this case as well, the maxlength for edit summaries should be upped to 255 bytes.
I've once written a code which limits the summary length in bytes by just preventing the user from writing more bytes than possible, but I didn't thoroughly test it and thus it wasn't enabled: http://he.wikipedia.org/wiki/MediaWiki:Summarylength.js (this code requires the last function in http://he.wikipedia.org/wiki/MediaWiki:Functions.js , String.prototype.getByteLength).
Simetrical wrote:
I was implementing a configuration option to allow changes to the maximum length of edit summaries and log reasons, which I've often felt are inadequate. In doing so, I discovered an issue that needs to be resolved before I proceed. Basically, when you type stuff in the form, the input box's maxlength parameter is a maximum number of *characters*. But when stuff gets validated and put in the database, it's generally a number of *bytes* that things are truncated to. I've verified that Firefox will actually permit 200 multibyte characters to be submitted as an edit summary, when they cannot possibly fit into the database field.
The current mix of limits between the HTML form field length and DB field length is known to not be an exact match; it's an approximate compromise.
There would however *be* no strict need to limit the summary length if the DB field is expanded -- going from VARCHAR or TINYBLOB out to BLOB will make pretty much arbitrarily large text possible.
If a limit is still desired, the most sensible thing would simply be to trim it at whatever huge number of characters you like. This can be done with the $wgContLang->trim() function on submission.
-- brion vibber (brion @ wikimedia.org)
So... Inline with your comments, the sane thing to do would be to: - Wait for the fields in the database to be changed to BLOB for their usage. - Setup a config variable with an array of what character size to limit certain summaries. - Use that character size limit inside the HTML generation (The size= parameter of the XML:: function likely) rather than the hardcoded current one. - Call $wgContLang->trim(); to limit comments to a universal character count, rather than a byte count. - And in certain cases have the user re-sumbit with a new summary.
Now for my notes: - The re-submition for certain forms like the edit form should probably be done in a method similar to how other errors are treated. (We could probably actually just stick in a little extra code since we're already detecting blank edit summaries for users) - It would probably be best if we had a user preference (disabled by default) that would tell the server to just silently truncate their comments without warning them, because there are some users who may not want to need to resubmit the mass number of forms they need to submit all the time. - As for the generic function, it should be possible; -- We could make it another one of those error page generation functions inside of OutputPage:: and just call the function using $wgOut for it. -- The collection of form information should be possible by using $wgRequest->wasPosted() ? 'POST' : 'GET' to first generate the type of form. -- However, there is no current function to return an array of all the things that were passed, and using $_GET and $_POST is depreciated. We'd have to add a function for that into the WebRequest. Additionally it may be a good idea to separate logic on what is GET and what is POST as it's possible to GET a few of the values while POSTing the rest.
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
Brion Vibber wrote:
Simetrical wrote:
I was implementing a configuration option to allow changes to the maximum length of edit summaries and log reasons, which I've often felt are inadequate. In doing so, I discovered an issue that needs to be resolved before I proceed. Basically, when you type stuff in the form, the input box's maxlength parameter is a maximum number of *characters*. But when stuff gets validated and put in the database, it's generally a number of *bytes* that things are truncated to. I've verified that Firefox will actually permit 200 multibyte characters to be submitted as an edit summary, when they cannot possibly fit into the database field.
The current mix of limits between the HTML form field length and DB field length is known to not be an exact match; it's an approximate compromise.
There would however *be* no strict need to limit the summary length if the DB field is expanded -- going from VARCHAR or TINYBLOB out to BLOB will make pretty much arbitrarily large text possible.
If a limit is still desired, the most sensible thing would simply be to trim it at whatever huge number of characters you like. This can be done with the $wgContLang->trim() function on submission.
-- brion vibber (brion @ wikimedia.org)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Sun, Mar 2, 2008 at 5:13 PM, DanTMan dan_the_man@telus.net wrote:
- And in certain cases have the user re-sumbit with a new summary.
No, the whole point is that should be unnecessary. If we're limiting only by character count, we can use character-based limits on both client and server side. That was only going to be necessary if the client had to use character limits while the server used byte limits. This removes a whole group of issues and also is more reasonable -- unless Greek/Russian/etc. speakers are just naturally more concise than English speakers, which I doubt. (In the case of abjads like Hebrew and Arabic, or especially logographic writing systems like Chinese, a lower character limit actually makes sense if the goal is to avoid rambling summaries, but that's more of a random side-effect than an intended advantage.)
On 02/03/2008, Simetrical Simetrical+wikilist@gmail.com wrote:
On Sun, Mar 2, 2008 at 5:13 PM, DanTMan dan_the_man@telus.net wrote:
- And in certain cases have the user re-sumbit with a new summary.
No, the whole point is that should be unnecessary.
It could happen in the case of bots or any other time when people aren't typing the summary into an actual textbox. How you get a bot to re-submit a summary, I don't know - just truncating it is the only option.
On Sun, Mar 2, 2008 at 6:23 PM, Thomas Dalton thomas.dalton@gmail.com wrote:
It could happen in the case of bots or any other time when people aren't typing the summary into an actual textbox. How you get a bot to re-submit a summary, I don't know - just truncating it is the only option.
If we provide no interface to allow long summary submissions then we can safely truncate it in the event some screwy client does send one, IMO.
On 02/03/2008, Simetrical Simetrical+wikilist@gmail.com wrote:
On Sun, Mar 2, 2008 at 6:23 PM, Thomas Dalton thomas.dalton@gmail.com wrote:
It could happen in the case of bots or any other time when people aren't typing the summary into an actual textbox. How you get a bot to re-submit a summary, I don't know - just truncating it is the only option.
If we provide no interface to allow long summary submissions then we can safely truncate it in the event some screwy client does send one, IMO.
Safely or not, I don't see an alternative.
On Sun, Mar 2, 2008 at 3:56 PM, Brion Vibber brion@wikimedia.org wrote:
The current mix of limits between the HTML form field length and DB field length is known to not be an exact match; it's an approximate compromise.
There would however *be* no strict need to limit the summary length if the DB field is expanded -- going from VARCHAR or TINYBLOB out to BLOB will make pretty much arbitrarily large text possible.
If a limit is still desired, the most sensible thing would simply be to trim it at whatever huge number of characters you like. This can be done with the $wgContLang->trim() function on submission.
That would be . . . a dramatically simpler and cleverer way of doing things. I like it. :D Are you (or someone) willing to do the schema change? I'll commit some code to truncate log comments by byte count now, so that the status quo doesn't dramatically change if some bot commits really long log summaries and relies on current behavior, and the schema is changed to allow long log summaries. If the schema changes are okay with you, I'd be happy to write up the code to do the configurable client- and server-side limiting.
Simetrical wrote:
On Sun, Mar 2, 2008 at 3:56 PM, Brion Vibber brion@wikimedia.org wrote:
The current mix of limits between the HTML form field length and DB field length is known to not be an exact match; it's an approximate compromise.
There would however *be* no strict need to limit the summary length if the DB field is expanded -- going from VARCHAR or TINYBLOB out to BLOB will make pretty much arbitrarily large text possible.
If a limit is still desired, the most sensible thing would simply be to trim it at whatever huge number of characters you like. This can be done with the $wgContLang->trim() function on submission.
That would be . . . a dramatically simpler and cleverer way of doing things. I like it. :D Are you (or someone) willing to do the schema change?
I think it's one of those that's been on the agenda for some time. Might as well do it. :)
-- brion
wikitech-l@lists.wikimedia.org