On Thu, Jan 6, 2011 at 11:01 AM, Jay Ashworth jra@baylink.com wrote:
----- Original Message -----
From: "George Herbert" george.herbert@gmail.com
A text-based diff of XML sucks, but how about a DOM based (structural) diff?
Sure, but how much more processor horsepower is that going to take.
Scale is a driver in Mediawiki, for obvious reasons.
I suspect that diffs are relatively rare events in the day to day WMF processing, though non-trivial.
Every single time you make an edit, unless I badly misunderstand the current architecture; that's how it's possible for multiple people editing the same article not to collide unless their edits actually collide at the paragraph level.
Not to mention pulling old versions.
Can someone who knows the current code better than me confirm or deny?
There's a few separate issues mixed up here, I think.
First: diffs for viewing and the external diff3 merging for resolving edit conflicts are actually unrelated code paths and use separate diff engines. (Nor does diff3 get used at all unless there actually is a conflict to resolve -- if nobody else edited since your change, it's not called.)
Second: the notion that diffing a structured document must inherently be very slow is, I think, not right.
A well-structured document should be pretty diff-friendly actually; our diffs are already working on two separate levels (paragraphs as a whole, then words within matched paragraphs). In the most common cases, the diffing might actually work pretty much the same -- look for nodes that match, then move on to nodes that don't; within changed nodes, look for sub-nodes that can be highlighted. Comparisons between nodes may be slower than straight strings, but the basic algorithms don't need to be hugely different, and the implementation can be in heavily-optimized C++ just like our text diffs are today.
Third: the most common diff view cases are likely adjacent revisions of recent edits, which smells like cache. :) Heck, these could be made once and then simply *stored*, never needing to be recalculated again.
Fourth: the notion that diffing structured documents would be overwhelming for the entire Wikimedia infrastructure... even if we assume such diffs are much slower, I think this is not really an issue compared to the huge CPU savings that it could bring elsewhere.
The biggest user of CPU has long been parsing and re-parsing of wikitext. Every time someone comes along with different view preferences, we have to parse again. Every time a template or image changes, we have to parse again. Every time there's an edit, we have to parse again. Every time something fell out of cache, we have to parse again.
And that parsing is *really expensive* on large, complex pages. Much of the history of MediaWiki's parser development has been in figuring out how to avoid parsing quite as much, or setting limits to keep the worst corner cases from bringing down the server farm.
We parse *way*, *wayyyyy* more than we diff.
Part of what makes these things slow is that we have to do a lot of work from scratch every time, and we have to do it in slow PHP code, and we have to keep going back and fetching more stuff halfway through. Expanding templates can change the document structure at the next parsing level, so referenced files and templates have to be fetched or recalculated, often one at a time because it's hard to batch up a list of everything we need at once.
I think there would be some very valuable savings to using a document model that can be stored in a machine-readable way up front. A data structure that can be described as JSON or XML (for examples) allows leaving the low-level "how do I turn a string into a structure" details to highly-tuned native C code. A document model that is easily traversed and mapped to/from hierarchical HTML allows code to process just the parts of the document it needs at any given time, and would make it easier to share intermediate data between variants if that's still needed.
In some cases, work that is today done in the 'parser' could even be done by client-side JavaScript (on supporting user-agents), moving little bits of work from the server farm (where CPU time is vast but sharply limited) to end-user browsers (where there's often a local surplus -- CPU's not doing much while it's waiting on the network to transfer big JPEG images).
It may be easier to prototype a lot of this outside of MediaWiki, though, or in specific areas such as media or interactive extensions, before we all go trying to redo the full core.
-- brion