On Thu, Jan 6, 2011 at 11:38 AM, Brion Vibber brion@pobox.com wrote:
On Thu, Jan 6, 2011 at 11:01 AM, Jay Ashworth jra@baylink.com wrote:
From: "George Herbert" george.herbert@gmail.com I suspect that diffs are relatively rare events in the day to day WMF processing, though non-trivial.
Every single time you make an edit, unless I badly misunderstand the current architecture; that's how it's possible for multiple people editing the same article not to collide unless their edits actually collide at the paragraph level.
Not to mention pulling old versions.
Can someone who knows the current code better than me confirm or deny?
There's a few separate issues mixed up here, I think.
First: diffs for viewing and the external diff3 merging for resolving edit conflicts are actually unrelated code paths and use separate diff engines. (Nor does diff3 get used at all unless there actually is a conflict to resolve -- if nobody else edited since your change, it's not called.)
Second: the notion that diffing a structured document must inherently be very slow is, I think, not right.
A well-structured document should be pretty diff-friendly actually; our diffs are already working on two separate levels (paragraphs as a whole, then words within matched paragraphs). In the most common cases, the diffing might actually work pretty much the same -- look for nodes that match, then move on to nodes that don't; within changed nodes, look for sub-nodes that can be highlighted. Comparisons between nodes may be slower than straight strings, but the basic algorithms don't need to be hugely different, and the implementation can be in heavily-optimized C++ just like our text diffs are today.
Third: the most common diff view cases are likely adjacent revisions of recent edits, which smells like cache. :) Heck, these could be made once and then simply *stored*, never needing to be recalculated again.
Fourth: the notion that diffing structured documents would be overwhelming for the entire Wikimedia infrastructure... even if we assume such diffs are much slower, I think this is not really an issue compared to the huge CPU savings that it could bring elsewhere.
The biggest user of CPU has long been parsing and re-parsing of wikitext. Every time someone comes along with different view preferences, we have to parse again. Every time a template or image changes, we have to parse again. Every time there's an edit, we have to parse again. Every time something fell out of cache, we have to parse again.
And that parsing is *really expensive* on large, complex pages. Much of the history of MediaWiki's parser development has been in figuring out how to avoid parsing quite as much, or setting limits to keep the worst corner cases from bringing down the server farm.
We parse *way*, *wayyyyy* more than we diff. [...]
Even if we diff on average 2-3x per edit, we're only doing order ten edits a second across the projects, right? Not going to dig up the current stats, but that's what I remember from last time I looked.
So; priority remains parser and actual used syntax cleanup, from a sanity point of view (being able to describe the syntax usefully, and in a way that allows multiple parsers to be written), with diff management as a distant low-impact priority...