Στις 04-08-2010, ημέρα Τετ, και ώρα 04:17 +0000, ο/η MZMcBride έγραψε:
Aryeh Gregor wrote:
The same could be said of practically any user-visible change. I mean, maybe if we add a new special page we'll break some script that was screen-scraping Special:SpecialPages. We can either freeze MediaWiki and never change anything for fear that we'll break something, or we can evaluate each potential change on the basis of how likely it is to break anything. I can't see anything breaking too badly if rev_len is reported in characters instead of bytes -- the only place it's likely to be useful is in heuristics, and by their nature, those won't break too badly if the numbers they're based on change somewhat.
This is problematic logic for a few reasons. I see a change to the rev_len logic as being similar to a change in article count logic. The same arguments work in both places, specifically the "step problem" that will cause nasty jumps in graphs.[1]
In some cases, as you've noted, we're talking about a change by a factor of three. Plenty of scripts rely on hard-coded values to determine size thresholds for certain behaviors. While these scripts may not have the best implementations, I don't think it's fair to say that they're worth breaking.
The comparison to screen-scraping seems pretty spurious as well. The reason it's acceptable to break screen-scraping scripts is that there's a functioning API alternative that is designed for bots and scripts. One of the design principles is consistency. Altering a metric by up to a factor of three (and even worse, doing so in an unpredictable manner) breaks this consistency needlessly.
Is it worth the cost to add 300 million+ rows to easily have character count? I don't know. Personally, I don't mind rev_len being in bytes; it makes more sense from a database and technical perspective to me. Admittedly, though, I deal mostly with English sites.
I"m all for the change, but it would have to be announced well in advance of rollout and coordinated with other folks. For example, I have a check against rev_len (in bytes) when writing out XML dumps, in order to avoid rev id and rev content out of sync errors that we have run into multiple times in the past. That code would need to be changed to count characters of the text being used for prefetch instead of bytes.
Ariel