On 9/7/20 10:15 AM, Roy Smith wrote:
Joaquin,
Thanks for your reply.
Regarding the data-parsoid route, I can't reproduce the trouble I was having. I suspect I was just getting the /revision/tid part wrong.
Taking a step back, I think part of the problem was I apparently had an incorrect mental model of how parsoid works. I was envisioning something that took wikitext, parsed it into a semantic parse tree, (kind of like mwparserfromhell does), and then takes that parse tree and converts it to html. What I was trying to get at was the intermediate parse tree. Looking at https://www.mediawiki.org/wiki/Parsoid/API, this appeared to be the pagebundle format, and I was groping around trying to find the API which exposed that. I looked at the /html routes and thought to myself, "No, that's not what I want. That's the HTML. I want the parse tree".
Parsoid doesn't produce any intermediate parser tree. Parsoid's output (HTML / DOM) is the canonical representation that captures wikitext information for you and you can reliably get most information that you want by inspecting that HTML based on the HTML spec Parsoid adheres to ( see https://www.mediawiki.org/wiki/Specs/HTML ). There are caveats in that Parsoid doesn't give you detailed information about nested templates when templates are parsed, but most usecases don't need that.
So, if you parse Parsoid's HTML into DOM, you get the "parse tree" that you want. You can the modify the HTML appropriately and as long as your output confirms to Parsoid's HTML spec, you can post that HTML to Parsoid and have it converted to wikitext.
For example, https://github.com/wikimedia/parsoid-jsapi is a library (now defunct since Parsoid/JS is not going to be maintained) that uses Parsoid's DOM as the wikitext parse tree and replicates mwparserfromhell functionality.
We haven't built anything equivalent for the PHP version of Parsoid yet.
However, Kunal (@legoktm) has built a Rust version of this. See https://docs.rs/parsoid/0.2.0/parsoid/ ... So, if Rust is your thing, you can use that library to manipulate wikitext similar to mwparserfromhell. But if not, for now, you will still have to work with a DOM to replicate mwparserfromhell functionality.
Eventually, hopefully, other language implementations will show up and we expect much of the functionality provided by mwparserfromhell will be available. But, mwparserfromhell is usable on dumps which you currently cannot use Parsoid for. If you really wanted to, you can if you do a whole bunch of additional work, but for all practical purposes, it is non-trivial. So, that usecase is still not something we have targeted for now.
I think the biggest thing that could be done to improve the documentation is to update https://www.mediawiki.org/wiki/Parsoid/API. That's the page you get to most directly when searching for parsoid documentation.
As I indicated in my previous response, the information on that page is accurate. Given the responses in this thread, what would be most helpful wrt updating that page to eliminate some of the confusion around Parsoid vs. RESTBase? Feel free to edit the page directly or email me privately or respond on this thread and we'll tweak it approriately.
Thanks,
Subbu.