Re: [Wikitext-l] Markup cleansing by clearing all linguistic elements

10 Jun 2014


      Gabriel Wicke, 10/06/2014 20:08:
...
Working on Parsoid HTML can be just an easier way to manipulate wikitext.
Still, wikitext markup is the anchor to recognise similar paragraphs; 
not HTML. (I mean, when I migrate old translations manually.) The 
peculiarities telling me two paragraphs are from the same source may not 
even produce any HTML difference, or have wildly different output.[1] 
Does the HTML5 DOM tell the *whole* story about the original wikitext? 
Specs don't say so, AFAICS.
Still, by reading the specs I don't see how one could easily extract the 
(representation of) the original markup or the linguistic elements. One 
could perhaps remove all the innermost content of tags, a series of 
attributes like about and typeof, all the {"wt":"unused value"} etc. and 
then watch for the noise of additional markup when comparing two 
wikitexts. It's not any easier than action=parse or custom regexes, 
unless there is already some tool doing it.
Nemo
[1] As an imperfect example, if I find
ตัวดำเนินการที่ใช้ได้จะแสดงไว้ทางด้านขวา ตามลำดับ ดูที่ 
{{mediawiki|m:Help:Calculation|Help:Calculation}} สำหรับรายละเอียดเพิ่มเติม 
ของตัวดำเนินการแต่ละอย่าง, ความถูกต้องและรูปแบบของผลลัพธ์ที่คืนค่ามาอาจจะแตกต่างกันไป 
ขึ้นอยู่กับระบบปฏิบัติการของเซิร์ฟเวอร์ที่ซอฟท์แวร์มีเดียวิกิรันอยู่ และการจัดรูปแบบตัวเลขของภาษา 
ที่เซิร์ฟเวอร์ใช้
in https://www.mediawiki.org/?oldid=544536&action=edit I'm pretty sure 
that's the same paragraph as the one containing {{mediawiki}} in the 
source. What the output of {{mediawiki}} is here doesn't matter much.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Wikitext-l] Markup cleansing by clearing all linguistic elements