On Thu, Nov 20, 2014 at 10:59 AM, James Forrester jforrester@wikimedia.org wrote:
A paragraph-level diff means that you only get an edit conflict if two people change the same paragraph. A character-level diff would mean,
then,
that you only get a conflict if they change the same character? That
sounds
a bit excessive. (Stupid example: if I change "sixty-three" to
"sixty-five"
and someone else changes it to "seventy-three", that should probably be a conflict, but a character-level diff would happily merge them into "seventy-five".)
Sure, but wikitext "paragraphs" are significantly more extensive and diverse than the NLP concept; to give an example:
Original wikitext:
There are six [[alpaca]] shearers on [[Sunningdale Acers|the farm]].
My changes:
There are six [[*Alpaca fiber|*alpaca]] shearers on [[Sunningdale Acr*e*s|the farm]].
Their changes:
There are six [[alpaca]] shearers on [[Sunningdale Acers|the farm*stead* ]].
Merging these two changes requires character-level merging (or something that natively understand wikitext at a subtle level. The first change would go through as a word-level diff (but not at sentence-level); the second wouldn't go through even then. Of course, we could prompt people to review the diff after saving if we're auto-merging, but that might be something we should be doing even now?
I don't think this is particularly unique to wikitext, but sure, a character-level (or even word-level) diff would often bring better results than the current algorithm. My point is that paragraph-based (and maybe even sentence-based) diffing makes unwanted results rare enough that it can just be applied without any oversight from the user, while the same definitely would not be true of the finer-grained algorithms. They could be applied with some sort of user review, or 3-way merge interface, and those would be cool features in general, but more complex than just tweaking the diff algorithm, I would think.
...which made me wonder: are we logging enough information of edit conflicts that we could just replay them with an alternative algorithm and see how well it performs? None of the EventLogging schemas which look relevant (Edit [1], EditConflict [2], EditDebugging [3]) seem to store the text which could not be saved, and while EditDebugging saves the ids for both old revisions for a successful automatic merge, I'm not sure if those can be connected with id of the new revision.
[1] https://meta.wikimedia.org/wiki/Schema:Edit [2] https://meta.wikimedia.org/wiki/Schema:EditConflict [3] https://meta.wikimedia.org/wiki/Schema:EditDebugging