On Mon, Jan 17, 2011 at 10:40 AM, Alex Brollo alex.brollo@gmail.com wrote:
2011/1/17 Bryan Tong Minh bryan.tongminh@gmail.com
Difficult, but doable. Jan-Paul's sentence-level editing tool is able to make the distinction. It would perhaps be possible to use that as a framework for sentence-level diffs.
Difficult, but diff between versions of a page does it. Looking at diff between pages, I simply thought firmly that only diff paragraphs were stored, so that the page was built as updated diff segments. I had no idea how this could be done, but all was "magic"!
Paragraphs are much easier to recognize than sentences, as wikitext has a paragraph delimiter - a blank line. To truly recognize sentences, you basically have to engage in natural language processing, though you can probably get it right 90% of the time without too much effort.
And to recognize what's going on when a sentence changes *and* is moved from one paragraph to another, requires an even greater level of natural language understanding. Again though, you can probably get it right most of the time without too much effort.
Wikitext actually makes it easier for the most part, as you can use tricks such as the fact that the periods in [[I.M. Someone]] don't represent sentence delimiters, since they are contained in square brackets. But not all periods which occur in the middle of a sentence are contained in square brackets, and not all sentences end with a period.
I'd say "difficult but doable" is quite accurate, although with the caveat that even the state of the art tools available today are probably going to make mistakes that would be obvious to a human. I'm sure there are tools for this, and there are probably some decent ones that are open source. But it's not as simple as just adding an index.