Re: [Wikitext-l] Markup cleansing by clearing all linguistic elements

10 Jun 2014


      On 06/10/2014 01:27 PM, Federico Leva (Nemo) wrote:
...
Gabriel Wicke, 10/06/2014 20:08:
...
Working on Parsoid HTML can be just an easier way to manipulate wikitext.
Still, wikitext markup is the anchor to recognise similar paragraphs; not
HTML. (I mean, when I migrate old translations manually.) The peculiarities
telling me two paragraphs are from the same source may not even produce any
HTML difference, or have wildly different output.[1] Does the HTML5 DOM tell
the *whole* story about the original wikitext? Specs don't say so, AFAICS.
If you are more interested in the structure and less interested in
syntactical detail, then doing the comparison directly at the HTML level
could actually be very useful.
...
Still, by reading the specs I don't see how one could easily extract the
(representation of) the original markup or the linguistic elements.
Each HTML element is also annotated with its source range, so you can easily
get the wikitext that corresponded to some element in HTML. The other way to
get the wikitext for a part is to ask Parsoid to create it.
...
One
could perhaps remove all the innermost content of tags, a series of
attributes like about and typeof, all the {"wt":"unused value"} etc. and
then watch for the noise of additional markup when comparing two wikitexts.
It's not any easier than action=parse or custom regexes, unless there is
already some tool doing it.
You should be able to do something like
[[foo]] [[API:Query|bar]] [http://www.example.com/ baz]
-> [[]] [[API:Query|]] [http://www.example.com/ ]
As an example, I pasted your example line into
http://parsoid-lb.eqiad.wikimedia.org/_wikitext/
Then I removed all text content, and fed that to
http://parsoid-lb.eqiad.wikimedia.org/_html/
Result:
[[foo|<nowiki/>]] [[API:Query|<nowiki/>]] [http://www.example.com/]
You can also use Parsoid to further normalize the formatting of wikitext,
which might help you to pick up similarity more easily.
Gabriel

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Wikitext-l] Markup cleansing by clearing all linguistic elements