Re: [Wikitech-l] [REQ] A mixed xml/html parser for skinning

10 Feb 2011

* Daniel Friesen &lt;lists(a)nadir-seen-fire.com&gt; [Thu, 10 Feb 2011 01:37:18 
-0800]:
...
  I've been experimenting with a mixed xml/html
based template syntax  for
...
  skinning[1].
 However I've been having issues with the parsing of it.

 - DOMDocument::loadHTML throws warning and when I output it strips out
 namespaces turning <mw:foo> into <foo...

 - SimpleHTMLDOM was the most promising, in fact my current experiments
 got very far with it, however when I got to the need to insert a node
 before/after an element it completely messed up, I'm also not  optimistic
...
  of it's performance since there are no dom
operations and it's  "insert"
...
  is essentially "concatenate some html with the
outertext and set
 outertext to it"
 - html5lib choked on namespaces other than built-in handling of things
 like svg: presumably.
 - phpQuery is just a wrapper around DOMDocument
 - tidy's plugin is supposed to support dom parsing, but that is not
 deployed on every server, and even people using tidy through mw might
 not be using the plugin since we support the executable as well. Not  to
...
  mention tidy seamed to share issues stripping or
choking on <mw:......
   tags when it came to
my editsection stuff. So even the idea of piping
 through tidy then using loadXML on it is out.
 - wiseparser, well I couldn't even get that to execute.
 - XML_HTMLSax is so old and unmaintained I couldn't really get into
 looking at it.

 The requirements ideally are that it should support the normal html
 parsing we already have (ie: boolean attributes and quoteless  attributes
...
  <div foo bar=baz>, perhaps the simple implicitly
closed tags like  <br>),
...
  but also support parsing tags and attributes with mw:
in them, in  other
...
  words XML namespaces.

 Is there anyone willing to help out building a parser for it?
 Possibilities could be custom parsing directly to dom, custom parsing
 and calling a SAX-like api, or at it's simplest a light parser that
 parses the html and outputs xml we can parse with loadXML instead (I
 believe the issue in DOMDocument is it's html processing not issues  with
...
  namespaces), that would end up being a potential tidy
replacement.  Tidy
...
  can't be used in this case because it too messes
up namespaced stuff.

 [1]:

http://www.mediawiki.org/wiki/User:Dantman/Skinning_system#xml.2Fhtml_templ…
...
  Why not just use XMLReader / XMLWriter as
WikiImporter does? Performance 
concerns? It uses libxml, should that be good enough?
Dmitriy

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] [REQ] A mixed xml/html parser for skinning