Στις 04-08-2010, ημέρα Τετ, και ώρα 04:17 +0000, ο/η MZMcBride έγραψε:
Aryeh Gregor wrote:
The same could be said of practically any
user-visible change. I
mean, maybe if we add a new special page we'll break some script that
was screen-scraping Special:SpecialPages. We can either freeze
MediaWiki and never change anything for fear that we'll break
something, or we can evaluate each potential change on the basis of
how likely it is to break anything. I can't see anything breaking too
badly if rev_len is reported in characters instead of bytes -- the
only place it's likely to be useful is in heuristics, and by their
nature, those won't break too badly if the numbers they're based on
change somewhat.
This is problematic logic for a few reasons. I see a change to the rev_len logic
as being similar to a change in article count logic. The same arguments work in
both places, specifically the "step problem" that will cause nasty jumps in
graphs.[1]
In some cases, as you've noted, we're talking about a change by a factor of
three. Plenty of scripts rely on hard-coded values to determine size thresholds
for certain behaviors. While these scripts may not have the best
implementations, I don't think it's fair to say that they're worth breaking.
The comparison to screen-scraping seems pretty spurious as well. The reason it's
acceptable to break screen-scraping scripts is that there's a functioning API
alternative that is designed for bots and scripts. One of the design principles
is consistency. Altering a metric by up to a factor of three (and even worse,
doing so in an unpredictable manner) breaks this consistency needlessly.
Is it worth the cost to add 300 million+ rows to easily have character count? I
don't know. Personally, I don't mind rev_len being in bytes; it makes more sense
from a database and technical perspective to me. Admittedly, though, I deal
mostly with English sites.
I"m all for the change, but it would have to be announced well in
advance of rollout and coordinated with other folks. For example, I
have a check against rev_len (in bytes) when writing out XML dumps, in
order to avoid rev id and rev content out of sync errors that we have
run into multiple times in the past. That code would need to be changed
to count characters of the text being used for prefetch instead of
bytes.
Ariel