On Thu, Jan 6, 2011 at 11:01 AM, Jay Ashworth
<jra(a)baylink.com> wrote:
From:
"George Herbert" <george.herbert(a)gmail.com>
I suspect that diffs are relatively rare events in the day to day WMF
processing, though non-trivial.
Every single time you make an edit, unless I badly misunderstand the
current
architecture; that's how it's possible for multiple people editing the
same article not to collide unless their edits actually collide at the
paragraph level.
Not to mention pulling old versions.
Can someone who knows the current code better than me confirm or deny?
There's a few separate issues mixed up here, I think.
First: diffs for viewing and the external diff3 merging for resolving edit
conflicts are actually unrelated code paths and use separate diff engines.
(Nor does diff3 get used at all unless there actually is a conflict to
resolve -- if nobody else edited since your change, it's not called.)
Second: the notion that diffing a structured document must inherently be
very slow is, I think, not right.
A well-structured document should be pretty diff-friendly actually; our
diffs are already working on two separate levels (paragraphs as a whole,
then words within matched paragraphs). In the most common cases, the diffing
might actually work pretty much the same -- look for nodes that match, then
move on to nodes that don't; within changed nodes, look for sub-nodes that
can be highlighted. Comparisons between nodes may be slower than straight
strings, but the basic algorithms don't need to be hugely different, and the
implementation can be in heavily-optimized C++ just like our text diffs are
today.
Third: the most common diff view cases are likely adjacent revisions of
recent edits, which smells like cache. :) Heck, these could be made once and
then simply *stored*, never needing to be recalculated again.
Fourth: the notion that diffing structured documents would be overwhelming
for the entire Wikimedia infrastructure... even if we assume such diffs are
much slower, I think this is not really an issue compared to the huge CPU
savings that it could bring elsewhere.
The biggest user of CPU has long been parsing and re-parsing of wikitext.
Every time someone comes along with different view preferences, we have to
parse again. Every time a template or image changes, we have to parse again.
Every time there's an edit, we have to parse again. Every time something
fell out of cache, we have to parse again.
And that parsing is *really expensive* on large, complex pages. Much of the
history of MediaWiki's parser development has been in figuring out how to
avoid parsing quite as much, or setting limits to keep the worst corner
cases from bringing down the server farm.
We parse *way*, *wayyyyy* more than we diff.
[...]
Even if we diff on average 2-3x per edit, we're only doing order ten
edits a second across the projects, right? Not going to dig up the
current stats, but that's what I remember from last time I looked.
So; priority remains parser and actual used syntax cleanup, from a
sanity point of view (being able to describe the syntax usefully, and
in a way that allows multiple parsers to be written), with diff
management as a distant low-impact priority...
--
-george william herbert
george.herbert(a)gmail.com