phpQuery itself builds on the DOM module already in PHP, so be aware that using it for this purpose is equivalent to using DOM & Xpath functions already available.
For one thing this means that HTML will have to be run through the libxml2 HTML parser (which I have found is very sketchy with perfectly legal implied close tags and such). In addition to memory and performance concerns of parsing the whole document into a DOM tree and reserializing it, you might not get back the structure you put in... hopefully no surprises but keep an eye out.
-- brion
On Jan 3, 2011 1:49 AM, "Philip Tzou" philip.npc@gmail.com wrote:
According to its website, "phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library."
I feel it will be very convenient if we introduce such jquery-like tools into MediaWiki since we do have the need to parse HTML text. For example,
I
can replace the awful regex part of LanguageConverter::autoConvert with phpQuery.
So I want to ask is it possible to introduce phpQuery into MediaWiki?
sincerely,
Philip Tzou _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Mon, Jan 3, 2011 at 11:59 AM, Brion Vibber brion@pobox.com wrote:
phpQuery itself builds on the DOM module already in PHP, so be aware that using it for this purpose is equivalent to using DOM & Xpath functions already available.
For one thing this means that HTML will have to be run through the libxml2 HTML parser (which I have found is very sketchy with perfectly legal implied close tags and such). In addition to memory and performance concerns of parsing the whole document into a DOM tree and reserializing it, you might not get back the structure you put in... hopefully no surprises but keep an eye out.
In theory, this problem should go away in a few years when everyone converges on HTML5 parsing. I think you can get a PHP HTML5 parser, which is compatible with browser parsing, but the performance probably isn't so good, and I don't know how well-maintained it is. ("Compatible with browser parsing" means "identical to Firefox 4 and WebKit nightly parsing, and compatible enough with how they used to parse things that no appreciable number of sites have broken in the new browser versions".)
That said, we do generally output well-formed XML or something quite close to it, so the cases where PHP's DOM library will do something unexpected should be reasonably limited.
I thought we had compatibility problems with users who didn't have the DOM module installed, including default RHEL5 configuration IIRC? Or was that something else?
wikitech-l@lists.wikimedia.org