On Thu, Jan 6, 2011 at 11:01 AM, Jay Ashworth <jra(a)baylink.com> wrote:
----- Original Message -----
From: "George Herbert"
<george.herbert(a)gmail.com>
A text-based diff of XML sucks, but how about a DOM
based
(structural)
diff?
Sure, but how much more processor horsepower is that going to take.
Scale is a driver in Mediawiki, for obvious reasons.
I suspect that diffs are relatively rare events in the day to day WMF
processing, though non-trivial.
Every single time you make an edit, unless I badly misunderstand the
current
architecture; that's how it's possible for multiple people editing the
same article not to collide unless their edits actually collide at the
paragraph level.
Not to mention pulling old versions.
Can someone who knows the current code better than me confirm or deny?
There's a few separate issues mixed up here, I think.
First: diffs for viewing and the external diff3 merging for resolving edit
conflicts are actually unrelated code paths and use separate diff engines.
(Nor does diff3 get used at all unless there actually is a conflict to
resolve -- if nobody else edited since your change, it's not called.)
Second: the notion that diffing a structured document must inherently be
very slow is, I think, not right.
A well-structured document should be pretty diff-friendly actually; our
diffs are already working on two separate levels (paragraphs as a whole,
then words within matched paragraphs). In the most common cases, the diffing
might actually work pretty much the same -- look for nodes that match, then
move on to nodes that don't; within changed nodes, look for sub-nodes that
can be highlighted. Comparisons between nodes may be slower than straight
strings, but the basic algorithms don't need to be hugely different, and the
implementation can be in heavily-optimized C++ just like our text diffs are
today.
Third: the most common diff view cases are likely adjacent revisions of
recent edits, which smells like cache. :) Heck, these could be made once and
then simply *stored*, never needing to be recalculated again.
Fourth: the notion that diffing structured documents would be overwhelming
for the entire Wikimedia infrastructure... even if we assume such diffs are
much slower, I think this is not really an issue compared to the huge CPU
savings that it could bring elsewhere.
The biggest user of CPU has long been parsing and re-parsing of wikitext.
Every time someone comes along with different view preferences, we have to
parse again. Every time a template or image changes, we have to parse again.
Every time there's an edit, we have to parse again. Every time something
fell out of cache, we have to parse again.
And that parsing is *really expensive* on large, complex pages. Much of the
history of MediaWiki's parser development has been in figuring out how to
avoid parsing quite as much, or setting limits to keep the worst corner
cases from bringing down the server farm.
We parse *way*, *wayyyyy* more than we diff.
Part of what makes these things slow is that we have to do a lot of work
from scratch every time, and we have to do it in slow PHP code, and we have
to keep going back and fetching more stuff halfway through. Expanding
templates can change the document structure at the next parsing level, so
referenced files and templates have to be fetched or recalculated, often one
at a time because it's hard to batch up a list of everything we need at
once.
I think there would be some very valuable savings to using a document model
that can be stored in a machine-readable way up front. A data structure that
can be described as JSON or XML (for examples) allows leaving the low-level
"how do I turn a string into a structure" details to highly-tuned native C
code. A document model that is easily traversed and mapped to/from
hierarchical HTML allows code to process just the parts of the document it
needs at any given time, and would make it easier to share intermediate data
between variants if that's still needed.
In some cases, work that is today done in the 'parser' could even be done by
client-side JavaScript (on supporting user-agents), moving little bits of
work from the server farm (where CPU time is vast but sharply limited) to
end-user browsers (where there's often a local surplus -- CPU's not doing
much while it's waiting on the network to transfer big JPEG images).
It may be easier to prototype a lot of this outside of MediaWiki, though, or
in specific areas such as media or interactive extensions, before we all go
trying to redo the full core.
-- brion