On Wed, Aug 4, 2010 at 12:17 AM, MZMcBride z@mzmcbride.com wrote:
This is problematic logic for a few reasons. I see a change to the rev_len logic as being similar to a change in article count logic. The same arguments work in both places, specifically the "step problem" that will cause nasty jumps in graphs.[1]
We'd presumably change the rev_len's that are already in the database, so the charts would just have to be regenerated using a new dump. Article count is different, because we don't store historical article counts in a way that we could retroactively change the way they're computed.
In some cases, as you've noted, we're talking about a change by a factor of three. Plenty of scripts rely on hard-coded values to determine size thresholds for certain behaviors. While these scripts may not have the best implementations, I don't think it's fair to say that they're worth breaking.
They wouldn't break, though. They'd just work a bit differently (the cutoff being somewhat lower than expected).
The comparison to screen-scraping seems pretty spurious as well. The reason it's acceptable to break screen-scraping scripts is that there's a functioning API alternative that is designed for bots and scripts.
It's not acceptable to break screen-scraping bots, actually. Otherwise we'd probably just stop emitting well-formed XML. But we don't worry about them where we don't have specific reason to suspect there's an actual problem.
On Wed, Aug 4, 2010 at 1:29 AM, Ariel T. Glenn ariel@wikimedia.org wrote:
For example, I have a check against rev_len (in bytes) when writing out XML dumps, in order to avoid rev id and rev content out of sync errors that we have run into multiple times in the past. That code would need to be changed to count characters of the text being used for prefetch instead of bytes.
That's an interesting use-case. Okay, so it looks like people are really relying on the current semantics, and we'd have to be careful changing them.
On Wed, Aug 4, 2010 at 1:38 AM, MZMcBride z@mzmcbride.com wrote:
Are character counts between programming languages generally consistent?
If you equate "character" to "code point in NFC", then yes.
And is there a performance concern with counting characters vs. counting bytes?
No, not realistically. Counting characters is a lot slower, relatively speaking, but counting bytes is so ridiculously fast in absolute terms that this makes no difference in a practical sense for our purposes. If you're dealing with so many articles that the strlen()s add up to a lot in absolute terms, as dump processes might, you'll be bottlenecked by disk reads anyway, so it will make no difference. You can do the strlen() on the current pages while you wait for the next ones to be read off disk, and you lose no time even if the strlen() takes a hundred times longer. As noted in the other thread, I just found that our home-brewed mb_strlen() takes ~100,000 times as long as the native one for at least some sample input, and it's still a trivial amount when applied to things the length of actual articles.