Markup cleansing by clearing all linguistic elements

List overview All Threads
Download

newer

older

Problems after switching to the...

Using Parsoid with a wiki...

Pratik Lahoti

9 Jun 2014 9 Jun '14

5:37 p.m.

Hi,

I am working on the mass migration tools project https://www.mediawiki.org/wiki/Extension:Translate/Mass_migration_tools as a part of Google Summer of Code. One of the parts of project is to import old translations into the Translate Extension.

We are done with a basic import by splitting the old pages on double newlines (\n\n) and some more alignment based on h2 headers. We are now thinking of improving the alignment.

Is there some work done on the subject mentioned? For each of the unit, what I would like to do is clear all the linguistic elements and have the bare markup left. Then, I could compare the markup of the source and target units and align accordingly.

Are there any API's available which already do this? Please guide me to accomplish this task.

-- Warm Regards, *Pratik Lahoti* GSoC Intern | Wikimedia User:BPositive http://www.mediawiki.org/wiki/User:BPositive

Attachments:

attachment.htm (text/html — 1.1 KB)

Show replies by date

Gabriel Wicke

10 Jun 10 Jun

12:30 a.m.

New subject: Markup cleansing by clearing all linguistic elements

Hi Pratik!

On 06/09/2014 10:37 AM, Pratik Lahoti wrote:

...

Are there any API's available which already do this? Please guide me to accomplish this task.

If you haven't heard of it, then https://www.mediawiki.org/wiki/Parsoid might be useful. It lets you work on HTML instead of wikitext, and can convert that HTML back to wikitext.

I'm also curious how the this work will interact with https://www.mediawiki.org/wiki/Content_translation, which is also based on Parsoid.

Gabriel

Federico Leva (Nemo)

2:24 p.m.

New subject: Markup cleansing by clearing all linguistic elements

Gabriel Wicke, 10/06/2014 02:30:

...

If you haven't heard of it, thenhttps://www.mediawiki.org/wiki/Parsoid might be useful. It lets you work on HTML instead of wikitext, and can convert that HTML back to wikitext.

I'm also curious how the this work will interact with https://www.mediawiki.org/wiki/Content_translation, which is also based on Parsoid.

There is no interaction because PageMigration doesn't need to manipulate HTML. :)

The question might have been unclear: what would be interesting (if easily available) is the ability to input a wikitext and get as output *only* the wikitext "markup" i.e. everything except the "linguistic" plain text (with some approximation). So for the example at https://www.mediawiki.org/wiki/API:Parsing_wikitext#Example_2

[[foo]] [[API:Query|bar]] [http://www.example.com/ baz] -> [[]] [[API:Query|]] [http://www.example.com/ ]

or something like that.

AFAIK there are solutions to get the plain text, e.g. people often want to look up the text of a Wiktionary entry from the API (with varying degrees of success), but I'm not sure if there is something available to do the opposite or one would need to build it on top of those existing tools, by "subtraction".

Nemo

Gabriel Wicke

6:08 p.m.

New subject: Markup cleansing by clearing all linguistic elements

On 06/10/2014 07:24 AM, Federico Leva (Nemo) wrote:

...

Gabriel Wicke, 10/06/2014 02:30:

...
I'm also curious how the this work will interact with https://www.mediawiki.org/wiki/Content_translation, which is also based on Parsoid.

There is no interaction because PageMigration doesn't need to manipulate HTML. :)

Working on Parsoid HTML can be just an easier way to manipulate wikitext.

...

AFAIK there are solutions to get the plain text, e.g. people often want to look up the text of a Wiktionary entry from the API (with varying degrees of success), but I'm not sure if there is something available to do the opposite or one would need to build it on top of those existing tools, by "subtraction".

You could of course remove the text content from Parsoid HTML & convert that back to wikitext.

I'm still wondering what the longer-term plan for <translate> is once content translation becomes available. To me it seem that there will be a good amount of overlap in its functionality. If content translation will basically replace the <translate> functionality, then that would make things much easier for Parsoid and visual editing.

Gabriel

Federico Leva (Nemo)

8:27 p.m.

New subject: Markup cleansing by clearing all linguistic elements

Gabriel Wicke, 10/06/2014 20:08:

...

Working on Parsoid HTML can be just an easier way to manipulate wikitext.

Still, wikitext markup is the anchor to recognise similar paragraphs; not HTML. (I mean, when I migrate old translations manually.) The peculiarities telling me two paragraphs are from the same source may not even produce any HTML difference, or have wildly different output.[1] Does the HTML5 DOM tell the *whole* story about the original wikitext? Specs don't say so, AFAICS.

Still, by reading the specs I don't see how one could easily extract the (representation of) the original markup or the linguistic elements. One could perhaps remove all the innermost content of tags, a series of attributes like about and typeof, all the {"wt":"unused value"} etc. and then watch for the noise of additional markup when comparing two wikitexts. It's not any easier than action=parse or custom regexes, unless there is already some tool doing it.

Nemo

[1] As an imperfect example, if I find

ตัวดำเนินการที่ใช้ได้จะแสดงไว้ทางด้านขวา ตามลำดับ ดูที่ {{mediawiki|m:Help:Calculation|Help:Calculation}} สำหรับรายละเอียดเพิ่มเติม ของตัวดำเนินการแต่ละอย่าง, ความถูกต้องและรูปแบบของผลลัพธ์ที่คืนค่ามาอาจจะแตกต่างกันไป ขึ้นอยู่กับระบบปฏิบัติการของเซิร์ฟเวอร์ที่ซอฟท์แวร์มีเดียวิกิรันอยู่ และการจัดรูปแบบตัวเลขของภาษา ที่เซิร์ฟเวอร์ใช้

in https://www.mediawiki.org/?oldid=544536&action=edit I'm pretty sure that's the same paragraph as the one containing {{mediawiki}} in the source. What the output of {{mediawiki}} is here doesn't matter much.

Gabriel Wicke

11:31 p.m.

New subject: Markup cleansing by clearing all linguistic elements

On 06/10/2014 01:27 PM, Federico Leva (Nemo) wrote:

...

Gabriel Wicke, 10/06/2014 20:08:

...
Working on Parsoid HTML can be just an easier way to manipulate wikitext.

Still, wikitext markup is the anchor to recognise similar paragraphs; not HTML. (I mean, when I migrate old translations manually.) The peculiarities telling me two paragraphs are from the same source may not even produce any HTML difference, or have wildly different output.[1] Does the HTML5 DOM tell the *whole* story about the original wikitext? Specs don't say so, AFAICS.

If you are more interested in the structure and less interested in syntactical detail, then doing the comparison directly at the HTML level could actually be very useful.

...

Still, by reading the specs I don't see how one could easily extract the (representation of) the original markup or the linguistic elements.

Each HTML element is also annotated with its source range, so you can easily get the wikitext that corresponded to some element in HTML. The other way to get the wikitext for a part is to ask Parsoid to create it.

...

One could perhaps remove all the innermost content of tags, a series of attributes like about and typeof, all the {"wt":"unused value"} etc. and then watch for the noise of additional markup when comparing two wikitexts. It's not any easier than action=parse or custom regexes, unless there is already some tool doing it.

You should be able to do something like

[[foo]] [[API:Query|bar]] [http://www.example.com/ baz] -> [[]] [[API:Query|]] [http://www.example.com/ ]

As an example, I pasted your example line into http://parsoid-lb.eqiad.wikimedia.org/_wikitext/

Then I removed all text content, and fed that to http://parsoid-lb.eqiad.wikimedia.org/_html/

Result: [[foo|<nowiki/>]] [[API:Query|<nowiki/>]] [http://www.example.com/]

You can also use Parsoid to further normalize the formatting of wikitext, which might help you to pick up similarity more easily.

Gabriel

3860

Age (days ago)

3861

Last active (days ago)

wikitext-l@lists.wikimedia.org

5 comments

3 participants

tags (0)

participants (3)

Federico Leva (Nemo)
Gabriel Wicke
Pratik Lahoti