On 06/10/2014 01:27 PM, Federico Leva (Nemo) wrote:
Gabriel Wicke, 10/06/2014 20:08:
Working on Parsoid HTML can be just an easier way to manipulate wikitext.
Still, wikitext markup is the anchor to recognise similar paragraphs; not HTML. (I mean, when I migrate old translations manually.) The peculiarities telling me two paragraphs are from the same source may not even produce any HTML difference, or have wildly different output.[1] Does the HTML5 DOM tell the *whole* story about the original wikitext? Specs don't say so, AFAICS.
If you are more interested in the structure and less interested in syntactical detail, then doing the comparison directly at the HTML level could actually be very useful.
Still, by reading the specs I don't see how one could easily extract the (representation of) the original markup or the linguistic elements.
Each HTML element is also annotated with its source range, so you can easily get the wikitext that corresponded to some element in HTML. The other way to get the wikitext for a part is to ask Parsoid to create it.
One could perhaps remove all the innermost content of tags, a series of attributes like about and typeof, all the {"wt":"unused value"} etc. and then watch for the noise of additional markup when comparing two wikitexts. It's not any easier than action=parse or custom regexes, unless there is already some tool doing it.
You should be able to do something like
[[foo]] [[API:Query|bar]] [http://www.example.com/ baz] -> [[]] [[API:Query|]] [http://www.example.com/ ]
As an example, I pasted your example line into http://parsoid-lb.eqiad.wikimedia.org/_wikitext/
Then I removed all text content, and fed that to http://parsoid-lb.eqiad.wikimedia.org/_html/
Result: [[foo|<nowiki/>]] [[API:Query|<nowiki/>]] [http://www.example.com/]
You can also use Parsoid to further normalize the formatting of wikitext, which might help you to pick up similarity more easily.
Gabriel