On Thu, Nov 20, 2014 at 10:59 AM, James Forrester <jforrester(a)wikimedia.org>
wrote:
A paragraph-level diff means that you only get an
edit conflict if two
people change the same paragraph. A character-level diff would mean,
then,
that you only get a conflict if they change the
same character? That
sounds
a bit excessive. (Stupid example: if I change
"sixty-three" to
"sixty-five"
and someone else changes it to
"seventy-three", that should probably be a
conflict, but a character-level diff would happily merge them into
"seventy-five".)
Sure, but wikitext "paragraphs" are significantly more extensive and
diverse than the NLP concept; to give an example:
Original wikitext:
There are six [[alpaca]] shearers on [[Sunningdale Acers|the farm]].
My changes:
There are six [[*Alpaca fiber|*alpaca]] shearers on [[Sunningdale
Acr*e*s|the
farm]].
Their changes:
There are six [[alpaca]] shearers on [[Sunningdale Acers|the farm*stead*
]].
Merging these two changes requires character-level merging (or something
that natively understand wikitext at a subtle level. The first change would
go through as a word-level diff (but not at sentence-level); the second
wouldn't go through even then. Of course, we could prompt people to review
the diff after saving if we're auto-merging, but that might be something we
should be doing even now?
I don't think this is particularly unique to wikitext, but sure, a
character-level (or even word-level) diff would often bring better results
than the current algorithm. My point is that paragraph-based (and maybe
even sentence-based) diffing makes unwanted results rare enough that it can
just be applied without any oversight from the user, while the same
definitely would not be true of the finer-grained algorithms. They could be
applied with some sort of user review, or 3-way merge interface, and those
would be cool features in general, but more complex than just tweaking the
diff algorithm, I would think.
...which made me wonder: are we logging enough information of edit
conflicts that we could just replay them with an alternative algorithm and
see how well it performs? None of the EventLogging schemas which look
relevant (Edit [1], EditConflict [2], EditDebugging [3]) seem to store the
text which could not be saved, and while EditDebugging saves the ids for
both old revisions for a successful automatic merge, I'm not sure if those
can be connected with id of the new revision.
[1]
https://meta.wikimedia.org/wiki/Schema:Edit
[2]
https://meta.wikimedia.org/wiki/Schema:EditConflict
[3]
https://meta.wikimedia.org/wiki/Schema:EditDebugging