On 07/23/2015 01:07 PM, Ricordisamoa wrote:
Il 23/07/2015 15:28, Antoine Musso ha scritto:
Le 23/07/2015 08:15, Ricordisamoa a écrit :
Are there any stable APIs for an application to
get a parse tree in
machine-readable format, manipulate it and send the result back without
touching HTML?
I'm sorry if this question doesn't make any sense.
You might want to
explain what you are trying to do and which wall you
have hit when attempting to use Parsoid :-)
For example, adding a template transclusion as new parameter in
another template.
XHTML5+RDFa is the wall :-(
Can't Parsoid's deserialization be caught at some point to get a
higher-level structure like mwparserfromhell
<https://github.com/earwig/mwparserfromhell>'s?
Parsoid and mwparserfromhell have different design goals and hence do
things differently.
Parsoid is meant to support HTML editing and hence provides semantic
information as annotations over the HTML document. It effectively
maintains a bidirectional/reversible mapping between segments of
wikitext and DOM trees. You can manipulate the DOM trees and get back
wikitext that represents the edited tree. As for useless information and
duplicate information -- I think if you looked at the Parsoid DOM spec
[1], you will know what to look for and what to manipulate. The
information on the DOM is meant to (a) render accurately (b) support the
various bots / clients / gadgets that look for specific kinds of
information, and (b) be editable easily. If that spec has holes or needs
updates or fixing, we are happy to do that. Do let us know.
mwparserfromhell is an entirely wikitext-centric library as far as I can
tell. It is meant to manipulate wikitext directly. It is a neat library
which provides a lot of utilities and makes it easy to do wikitext
transformations. It doesn't know about or care about HTML because it
doesn't need to. It also seems to effectively gives you some kind of
wikitext-centric AST. These are all impressions based on a quick scan of
its docs -- so pardon any misunderstandings.
Parsoid does not provide you a wikitext AST directly since it doesn't
construct one. All wikitext information shows up indirectly as DOM
annotations (either attributes or JSON information in attributes). As
Scott showed, you can still do document ("wikitext") manipulations using
DOM libraries, CSS-style queries, or directly by walking the DOM. There
are lots of ways you can edit mediawiki pages without knowing about
wikitext and using the vast array of HTML libraries. That happens to be
our tagline: "we deal with wikitext so you don't have to".
But, you are right. It can indeed seem cumbersome if you want to
directly manipulate wikitext without the DOM getting in between or
having to deal with DOM libraries. But that is not the use case we
target. There are a vastly greater number of libraries in all kinds of
languages (and developers) that know about HTML and can render, handle,
and manipulate HTML easily than know how to (or want to) manipulate
wikitext programmatically. Kind of the difference between the wikitext
editor and the visual editor. They each have their constituencies and roles.
All that said, as Scott noted, it is possible to develop a
mwparserfromhell like layer on top of the Parsoid DOM annotations if you
want a wikitext-centric view (as opposed to a DOM-centric view that most
editing clients seem to want). But, since that is not a use case that we
target, that hasn't been on our radar. If someone does want to take that
on, and thinks it would be useful, we are happy to provide assistance.
It should not be too difficult.
Does that help summarize this issue and clarify the differences and
approaches of these two tools? I am "on vacation" :-) so responses will
be delayed.
Subbu.
[1]
http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec