Re: [Wikitext-l] Markup cleansing by clearing all linguistic elements

10 Jun 2014

Gabriel Wicke, 10/06/2014 20:08:
...
  Working on Parsoid HTML can be just an easier way to
manipulate wikitext. 
Still, wikitext markup is the anchor to recognise similar paragraphs; 
not HTML. (I mean, when I migrate old translations manually.) The 
peculiarities telling me two paragraphs are from the same source may not 
even produce any HTML difference, or have wildly different output.[1] 
Does the HTML5 DOM tell the *whole* story about the original wikitext? 
Specs don't say so, AFAICS.

Still, by reading the specs I don't see how one could easily extract the 
(representation of) the original markup or the linguistic elements. One 
could perhaps remove all the innermost content of tags, a series of 
attributes like about and typeof, all the {"wt":"unused value"} etc.
and 
then watch for the noise of additional markup when comparing two 
wikitexts. It's not any easier than action=parse or custom regexes, 
unless there is already some tool doing it.

Nemo

[1] As an imperfect example, if I find

ตัวดำเนินการที่ใช้ได้จะแสดงไว้ทางด้านขวา ตามลำดับ ดูที่ 
{{mediawiki|m:Help:Calculation|Help:Calculation}} สำหรับรายละเอียดเพิ่มเติม 
ของตัวดำเนินการแต่ละอย่าง, ความถูกต้องและรูปแบบของผลลัพธ์ที่คืนค่ามาอาจจะแตกต่างกันไป 
ขึ้นอยู่กับระบบปฏิบัติการของเซิร์ฟเวอร์ที่ซอฟท์แวร์มีเดียวิกิรันอยู่
และการจัดรูปแบบตัวเลขของภาษา 
ที่เซิร์ฟเวอร์ใช้

in https://www.mediawiki.org/?oldid=544536&action=edit I'm pretty sure 
that's the same paragraph as the one containing {{mediawiki}} in the 
source. What the output of {{mediawiki}} is here doesn't matter much.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Wikitext-l] Markup cleansing by clearing all linguistic elements