On 07/23/2015 01:07 PM, Ricordisamoa wrote:
Il 23/07/2015 15:28, Antoine Musso ha scritto:
Le 23/07/2015 08:15, Ricordisamoa a écrit :
Are there any stable APIs for an application to get a parse tree in machine-readable format, manipulate it and send the result back without touching HTML? I'm sorry if this question doesn't make any sense.
You might want to explain what you are trying to do and which wall you have hit when attempting to use Parsoid :-)
For example, adding a template transclusion as new parameter in another template. XHTML5+RDFa is the wall :-( Can't Parsoid's deserialization be caught at some point to get a higher-level structure like mwparserfromhell https://github.com/earwig/mwparserfromhell's?
Parsoid and mwparserfromhell have different design goals and hence do things differently.
Parsoid is meant to support HTML editing and hence provides semantic information as annotations over the HTML document. It effectively maintains a bidirectional/reversible mapping between segments of wikitext and DOM trees. You can manipulate the DOM trees and get back wikitext that represents the edited tree. As for useless information and duplicate information -- I think if you looked at the Parsoid DOM spec [1], you will know what to look for and what to manipulate. The information on the DOM is meant to (a) render accurately (b) support the various bots / clients / gadgets that look for specific kinds of information, and (b) be editable easily. If that spec has holes or needs updates or fixing, we are happy to do that. Do let us know.
mwparserfromhell is an entirely wikitext-centric library as far as I can tell. It is meant to manipulate wikitext directly. It is a neat library which provides a lot of utilities and makes it easy to do wikitext transformations. It doesn't know about or care about HTML because it doesn't need to. It also seems to effectively gives you some kind of wikitext-centric AST. These are all impressions based on a quick scan of its docs -- so pardon any misunderstandings.
Parsoid does not provide you a wikitext AST directly since it doesn't construct one. All wikitext information shows up indirectly as DOM annotations (either attributes or JSON information in attributes). As Scott showed, you can still do document ("wikitext") manipulations using DOM libraries, CSS-style queries, or directly by walking the DOM. There are lots of ways you can edit mediawiki pages without knowing about wikitext and using the vast array of HTML libraries. That happens to be our tagline: "we deal with wikitext so you don't have to".
But, you are right. It can indeed seem cumbersome if you want to directly manipulate wikitext without the DOM getting in between or having to deal with DOM libraries. But that is not the use case we target. There are a vastly greater number of libraries in all kinds of languages (and developers) that know about HTML and can render, handle, and manipulate HTML easily than know how to (or want to) manipulate wikitext programmatically. Kind of the difference between the wikitext editor and the visual editor. They each have their constituencies and roles.
All that said, as Scott noted, it is possible to develop a mwparserfromhell like layer on top of the Parsoid DOM annotations if you want a wikitext-centric view (as opposed to a DOM-centric view that most editing clients seem to want). But, since that is not a use case that we target, that hasn't been on our radar. If someone does want to take that on, and thinks it would be useful, we are happy to provide assistance. It should not be too difficult.
Does that help summarize this issue and clarify the differences and approaches of these two tools? I am "on vacation" :-) so responses will be delayed.
Subbu.
[1] http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec