On Mon, Jan 17, 2011 at 10:40 AM, Alex Brollo <alex.brollo(a)gmail.com> wrote:
2011/1/17 Bryan Tong Minh
<bryan.tongminh(a)gmail.com>
Difficult, but doable. Jan-Paul's sentence-level editing tool is able
to make the distinction. It would perhaps be possible to use that as a
framework for sentence-level diffs.
Difficult, but diff between versions of a page does it. Looking at diff
between pages, I simply thought firmly that only diff paragraphs were
stored, so that the page was built as updated diff segments. I had no idea
how this could be done, but all was "magic"!
Paragraphs are much easier to recognize than sentences, as wikitext
has a paragraph delimiter - a blank line. To truly recognize
sentences, you basically have to engage in natural language
processing, though you can probably get it right 90% of the time
without too much effort.
And to recognize what's going on when a sentence changes *and* is
moved from one paragraph to another, requires an even greater level of
natural language understanding. Again though, you can probably get it
right most of the time without too much effort.
Wikitext actually makes it easier for the most part, as you can use
tricks such as the fact that the periods in [[I.M. Someone]] don't
represent sentence delimiters, since they are contained in square
brackets. But not all periods which occur in the middle of a sentence
are contained in square brackets, and not all sentences end with a
period.
I'd say "difficult but doable" is quite accurate, although with the
caveat that even the state of the art tools available today are
probably going to make mistakes that would be obvious to a human. I'm
sure there are tools for this, and there are probably some decent ones
that are open source. But it's not as simple as just adding an index.