I've been experimenting with a mixed xml/html based template syntax for skinning[1]. However I've been having issues with the parsing of it.
- DOMDocument::loadHTML throws warning and when I output it strips out namespaces turning mw:foo into <foo> - SimpleHTMLDOM was the most promising, in fact my current experiments got very far with it, however when I got to the need to insert a node before/after an element it completely messed up, I'm also not optimistic of it's performance since there are no dom operations and it's "insert" is essentially "concatenate some html with the outertext and set outertext to it" - html5lib choked on namespaces other than built-in handling of things like svg: presumably. - phpQuery is just a wrapper around DOMDocument - tidy's plugin is supposed to support dom parsing, but that is not deployed on every server, and even people using tidy through mw might not be using the plugin since we support the executable as well. Not to mention tidy seamed to share issues stripping or choking on mw:... tags when it came to my editsection stuff. So even the idea of piping through tidy then using loadXML on it is out. - wiseparser, well I couldn't even get that to execute. - XML_HTMLSax is so old and unmaintained I couldn't really get into looking at it.
The requirements ideally are that it should support the normal html parsing we already have (ie: boolean attributes and quoteless attributes <div foo bar=baz>, perhaps the simple implicitly closed tags like <br>), but also support parsing tags and attributes with mw: in them, in other words XML namespaces.
Is there anyone willing to help out building a parser for it? Possibilities could be custom parsing directly to dom, custom parsing and calling a SAX-like api, or at it's simplest a light parser that parses the html and outputs xml we can parse with loadXML instead (I believe the issue in DOMDocument is it's html processing not issues with namespaces), that would end up being a potential tidy replacement. Tidy can't be used in this case because it too messes up namespaced stuff.
[1]: http://www.mediawiki.org/wiki/User:Dantman/Skinning_system#xml.2Fhtml_templa...
* Daniel Friesen lists@nadir-seen-fire.com [Thu, 10 Feb 2011 01:37:18 -0800]:
I've been experimenting with a mixed xml/html based template syntax
for
skinning[1]. However I've been having issues with the parsing of it.
- DOMDocument::loadHTML throws warning and when I output it strips out
namespaces turning mw:foo into <foo>
- SimpleHTMLDOM was the most promising, in fact my current experiments
got very far with it, however when I got to the need to insert a node before/after an element it completely messed up, I'm also not
optimistic
of it's performance since there are no dom operations and it's
"insert"
is essentially "concatenate some html with the outertext and set outertext to it"
- html5lib choked on namespaces other than built-in handling of things
like svg: presumably.
- phpQuery is just a wrapper around DOMDocument
- tidy's plugin is supposed to support dom parsing, but that is not
deployed on every server, and even people using tidy through mw might not be using the plugin since we support the executable as well. Not
to
mention tidy seamed to share issues stripping or choking on mw:... tags when it came to my editsection stuff. So even the idea of piping through tidy then using loadXML on it is out.
- wiseparser, well I couldn't even get that to execute.
- XML_HTMLSax is so old and unmaintained I couldn't really get into
looking at it.
The requirements ideally are that it should support the normal html parsing we already have (ie: boolean attributes and quoteless
attributes
<div foo bar=baz>, perhaps the simple implicitly closed tags like
<br>),
but also support parsing tags and attributes with mw: in them, in
other
words XML namespaces.
Is there anyone willing to help out building a parser for it? Possibilities could be custom parsing directly to dom, custom parsing and calling a SAX-like api, or at it's simplest a light parser that parses the html and outputs xml we can parse with loadXML instead (I believe the issue in DOMDocument is it's html processing not issues
with
namespaces), that would end up being a potential tidy replacement.
Tidy
can't be used in this case because it too messes up namespaced stuff.
[1]:
http://www.mediawiki.org/wiki/User:Dantman/Skinning_system#xml.2Fhtml_templa...
Why not just use XMLReader / XMLWriter as WikiImporter does? Performance concerns? It uses libxml, should that be good enough? Dmitriy
wikitech-l@lists.wikimedia.org