On Wed, Aug 4, 2010 at 12:17 AM, MZMcBride <z(a)mzmcbride.com> wrote:
This is problematic logic for a few reasons. I see a
change to the rev_len logic
as being similar to a change in article count logic. The same arguments work in
both places, specifically the "step problem" that will cause nasty jumps in
graphs.[1]
We'd presumably change the rev_len's that are already in the database,
so the charts would just have to be regenerated using a new dump.
Article count is different, because we don't store historical article
counts in a way that we could retroactively change the way they're
computed.
In some cases, as you've noted, we're talking
about a change by a factor of
three. Plenty of scripts rely on hard-coded values to determine size thresholds
for certain behaviors. While these scripts may not have the best
implementations, I don't think it's fair to say that they're worth breaking.
They wouldn't break, though. They'd just work a bit differently (the
cutoff being somewhat lower than expected).
The comparison to screen-scraping seems pretty
spurious as well. The reason it's
acceptable to break screen-scraping scripts is that there's a functioning API
alternative that is designed for bots and scripts.
It's not acceptable to break screen-scraping bots, actually.
Otherwise we'd probably just stop emitting well-formed XML. But we
don't worry about them where we don't have specific reason to suspect
there's an actual problem.
On Wed, Aug 4, 2010 at 1:29 AM, Ariel T. Glenn <ariel(a)wikimedia.org> wrote:
For example, I
have a check against rev_len (in bytes) when writing out XML dumps, in
order to avoid rev id and rev content out of sync errors that we have
run into multiple times in the past. That code would need to be changed
to count characters of the text being used for prefetch instead of
bytes.
That's an interesting use-case. Okay, so it looks like people are
really relying on the current semantics, and we'd have to be careful
changing them.
On Wed, Aug 4, 2010 at 1:38 AM, MZMcBride <z(a)mzmcbride.com> wrote:
Are character counts between programming languages
generally consistent?
If you equate "character" to "code point in NFC", then yes.
And
is there a performance concern with counting characters vs. counting bytes?
No, not realistically. Counting characters is a lot slower,
relatively speaking, but counting bytes is so ridiculously fast in
absolute terms that this makes no difference in a practical sense for
our purposes. If you're dealing with so many articles that the
strlen()s add up to a lot in absolute terms, as dump processes might,
you'll be bottlenecked by disk reads anyway, so it will make no
difference. You can do the strlen() on the current pages while you
wait for the next ones to be read off disk, and you lose no time even
if the strlen() takes a hundred times longer. As noted in the other
thread, I just found that our home-brewed mb_strlen() takes ~100,000
times as long as the native one for at least some sample input, and
it's still a trivial amount when applied to things the length of
actual articles.