Are there any stable APIs for an application to get a parse tree in machine-readable format, manipulate it and send the result back without touching HTML? I'm sorry if this question doesn't make any sense.
Le 23/07/2015 08:15, Ricordisamoa a écrit :
Are there any stable APIs for an application to get a parse tree in machine-readable format, manipulate it and send the result back without touching HTML? I'm sorry if this question doesn't make any sense.
You might want to explain what you are trying to do and which wall you have hit when attempting to use Parsoid :-)
Il 23/07/2015 15:28, Antoine Musso ha scritto:
Le 23/07/2015 08:15, Ricordisamoa a écrit :
Are there any stable APIs for an application to get a parse tree in machine-readable format, manipulate it and send the result back without touching HTML? I'm sorry if this question doesn't make any sense.
You might want to explain what you are trying to do and which wall you have hit when attempting to use Parsoid :-)
For example, adding a template transclusion as new parameter in another template. XHTML5+RDFa is the wall :-( Can't Parsoid's deserialization be caught at some point to get a higher-level structure like mwparserfromhell https://github.com/earwig/mwparserfromhell's?
HTML5+RDFa is a machine-readable format. But I think what you are asking for is either better documentation of the template-related stuff (did you read through the slides in https://phabricator.wikimedia.org/T105175 ?) or HTML template parameter support (https://phabricator.wikimedia.org/T52587) which is in the codebase but not enabled by default in production. --scott
The slides are interesting, but for now it seems VisualEditor-focused and not nearly as powerful as mwparserfromhell. I don't care about presentation. I don't want HTML. And I hate getting all edits tagged as "VisualEditor".
Il 23/07/2015 22:02, C. Scott Ananian ha scritto:
HTML5+RDFa is a machine-readable format. But I think what you are asking for is either better documentation of the template-related stuff (did you read through the slides in https://phabricator.wikimedia.org/T105175 ?) or HTML template parameter support (https://phabricator.wikimedia.org/T52587) which is in the codebase but not enabled by default in production. --scott _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Well, it's really just a different way of thinking about things. Instead of: ```
import mwparserfromhell text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?" wikicode = mwparserfromhell.parse(text) templates = wikicode.filter_templates()
``` you would write: ``` js> Parsoid = require('parsoid'); js> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?"; js> Parsoid.parse(text, { document: true }).then(function(res) { templates = res.out.querySelectorAll('[typeof~="mw:Transclusion"]'); console.log(templates); }).done(); ```
That said, it wouldn't be hard to clone the API of http://mwparserfromhell.readthedocs.org/en/latest/api/mwparserfromhell.html and that would probably be a great addition to the parsoid package API.
HTML is just a tree structured data representation. Think of it as XML if it makes you happier. It just happens to come with well-defined semantics and lots of manipulation libraries.
I don't know about edits tagged as "VisualEditor". That seems like that should only be done by VE. I take it you would like an easy work flow to fetch a page, make edits, and then write the new revision back? mwparserfromhell doesn't actually seem to have that functionality, but it would also be nice to facilitate that use case if we can. --scott
Il 24/07/2015 06:35, C. Scott Ananian ha scritto:
Well, it's really just a different way of thinking about things. Instead of:
>>> import mwparserfromhell >>> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?" >>> wikicode = mwparserfromhell.parse(text) >>> templates = wikicode.filter_templates()
you would write:
js> Parsoid = require('parsoid'); js> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?"; js> Parsoid.parse(text, { document: true }).then(function(res) { templates = res.out.querySelectorAll('[typeof~="mw:Transclusion"]'); console.log(templates); }).done();
That said, it wouldn't be hard to clone the API of http://mwparserfromhell.readthedocs.org/en/latest/api/mwparserfromhell.html
Parsoid's expressiveness seems to convey useless information, overlook important details, or duplicate them in different places. If I want to resize an image, am I supposed to change "data-file-width" and "data-file-height"? "width" and "height"? Or "src"? I think what I'm looking for is sort of an 'enhanced wikitext' rather than 'annotated HTML'.
and that would probably be a great addition to the parsoid package API.
HTML is just a tree structured data representation. Think of it as XML if it makes you happier. It just happens to come with well-defined semantics and lots of manipulation libraries.
I don't know about edits tagged as "VisualEditor". That seems like that should only be done by VE.
All edits made via visualeditoredit https://www.mediawiki.org/w/api.php?action=help&modules=visualeditoredit are tagged.
I take it you would like an easy work flow to fetch a page, make edits, and then write the new revision back?
Right.
mwparserfromhell doesn't actually seem to have that functionality
It is actually pretty easy to do with Pywikibot. But since Parsoid happens to work server-side, it makes sense to request and send back the structured tree directly.
, but it would also be nice to facilitate that use case if we can. --scott
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Thanks for your time.
On 24 July 2015 at 07:34, Ricordisamoa ricordisamoa@openmailbox.org wrote:
Il 24/07/2015 06:35, C. Scott Ananian ha scritto:
Well, it's really just a different way of thinking about things. Instead of:
> import mwparserfromhell >>> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?" >>> wikicode = mwparserfromhell.parse(text) >>> templates = wikicode.filter_templates() >>> >> ``` you would write:
js> Parsoid = require('parsoid'); js> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?"; js> Parsoid.parse(text, { document: true }).then(function(res) { templates = res.out.querySelectorAll('[typeof~="mw:Transclusion"]'); console.log(templates); }).done();
That said, it wouldn't be hard to clone the API of http://mwparserfromhell.readthedocs.org/en/latest/api/mwparserfromhell.html
Parsoid's expressiveness seems to convey useless information, overlook important details, or duplicate them in different places. If I want to resize an image, am I supposed to change "data-file-width" and "data-file-height"? "width" and "height"? Or "src"? I think what I'm looking for is sort of an 'enhanced wikitext' rather than 'annotated HTML'.
and that would probably be a great addition to the parsoid package API.
HTML is just a tree structured data representation. Think of it as XML if it makes you happier. It just happens to come with well-defined semantics and lots of manipulation libraries.
I don't know about edits tagged as "VisualEditor". That seems like that should only be done by VE.
All edits made via visualeditoredit < https://www.mediawiki.org/w/api.php?action=help&modules=visualeditoredit... are tagged.
I take it you would like an easy work flow to
fetch a page, make edits, and then write the new revision back?
Right.
RESTBase could help you there. With one API call, you can get the (stored) latest HTML revision of a page in Parsoid format~[1], but without the need to wait for Parsoid to parse it (if the latest revision is in RESTBase's storage). There is also section API support (you can get individual HTML fragments of a page by ID, and send only those back for transformation into wikitext~[2]). There is also support for page editing (aka saving), but these endpoints have not yet been enabled for WMF wikis in production due to security concerns.
[1] https://en.wikipedia.org/api/rest_v1/?doc#!/Page_content/page_html__title__g... [2] https://en.wikipedia.org/api/rest_v1/?doc#!/Transforms/transform_sections_to...
Cheers, Marko
mwparserfromhell doesn't actually seem to have that functionality
It is actually pretty easy to do with Pywikibot. But since Parsoid happens to work server-side, it makes sense to request and send back the structured tree directly.
, but it
would also be nice to facilitate that use case if we can. --scott
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Thanks for your time.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Thanks Marko. Replies inline
Il 24/07/2015 15:07, Marko Obrovac ha scritto:
On 24 July 2015 at 07:34, Ricordisamoa ricordisamoa@openmailbox.org wrote:
Il 24/07/2015 06:35, C. Scott Ananian ha scritto:
Well, it's really just a different way of thinking about things. Instead of:
> import mwparserfromhell >>> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?" >>> wikicode = mwparserfromhell.parse(text) >>> templates = wikicode.filter_templates() >>> >> ``` you would write:
js> Parsoid = require('parsoid'); js> text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?"; js> Parsoid.parse(text, { document: true }).then(function(res) { templates = res.out.querySelectorAll('[typeof~="mw:Transclusion"]'); console.log(templates); }).done();
That said, it wouldn't be hard to clone the API of http://mwparserfromhell.readthedocs.org/en/latest/api/mwparserfromhell.html
Parsoid's expressiveness seems to convey useless information, overlook important details, or duplicate them in different places. If I want to resize an image, am I supposed to change "data-file-width" and "data-file-height"? "width" and "height"? Or "src"? I think what I'm looking for is sort of an 'enhanced wikitext' rather than 'annotated HTML'.
and that would probably be a great addition to the parsoid package API.
HTML is just a tree structured data representation. Think of it as XML if it makes you happier. It just happens to come with well-defined semantics and lots of manipulation libraries.
I don't know about edits tagged as "VisualEditor". That seems like that should only be done by VE.
All edits made via visualeditoredit < https://www.mediawiki.org/w/api.php?action=help&modules=visualeditoredit... are tagged.
I take it you would like an easy work flow to
fetch a page, make edits, and then write the new revision back?
Right.
RESTBase could help you there. With one API call, you can get the (stored) latest HTML revision of a page in Parsoid format~[1], but without the need to wait for Parsoid to parse it (if the latest revision is in RESTBase's storage).
What if it isn't?
There is also section API support (you can get individual HTML fragments of a page by ID, and send only those back for transformation into wikitext~[2]). There is also support for page editing (aka saving), but these endpoints have not yet been enabled for WMF wikis in production due to security concerns.
Then I guess HTML would have to be converted into wikitext before saving? +1 API call
[1] https://en.wikipedia.org/api/rest_v1/?doc#!/Page_content/page_html__title__g... [2] https://en.wikipedia.org/api/rest_v1/?doc#!/Transforms/transform_sections_to...
Cheers, Marko
mwparserfromhell doesn't actually seem to have that functionality
It is actually pretty easy to do with Pywikibot. But since Parsoid happens to work server-side, it makes sense to request and send back the structured tree directly.
, but it
would also be nice to facilitate that use case if we can. --scott
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Thanks for your time.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Fri, Jul 24, 2015 at 10:58 AM, Ricordisamoa <ricordisamoa@openmailbox.org
wrote:
RESTBase could help you there. With one API call, you can get the (stored) latest HTML revision of a page in Parsoid format~[1], but without the need to wait for Parsoid to parse it (if the latest revision is in RESTBase's storage).
What if it isn't?
If it is not in storage, then it will be generated transparently. This should only sometimes happen when you request a revision less than a handful of seconds after it was saved.
There is also section API support (you can get individual HTML
fragments of a page by ID, and send only those back for transformation into wikitext~[2]). There is also support for page editing (aka saving), but these endpoints have not yet been enabled for WMF wikis in production due to security concerns.
Then I guess HTML would have to be converted into wikitext before saving? +1 API call
As Marko mentioned, the HTML save end point is not yet enabled in production. Once it is, you will be able to directly POST modified HTML to save it, without adding a VisualEditor tag or having to perform extra API requests.
Gabriel
On Fri, Jul 24, 2015 at 12:34 AM, Ricordisamoa <ricordisamoa@openmailbox.org
wrote:
Parsoid's expressiveness seems to convey useless information, overlook important details, or duplicate them in different places. If I want to resize an image, am I supposed to change "data-file-width" and "data-file-height"? "width" and "height"? Or "src"?
These are great points, and reports from folks like you will help to improve our documentation. My goal for Parsoid's DOM[1] is that every bit of information from the wikitext is represented exactly *once* in the result.
In your example, `data-file-width` and `data-file-height` represent the *unscaled* size of the *source* image. Many image scaling operations want to know this, so we include it in the DOM. It is ignored when you convert back to wikitext.
The `width` and `height` attributes are what you should modify if you want to resize an image, just like you would do for any naive html editor.
The `src` attribute is again mostly ignored (sigh); the 'resource' attribute specifies the url of the unscaled image. Of course if 'resource' is missing we'll try to make do with `src`; we really try hard to do something reasonable with whatever we're given. --scott
[1] There is a tension between "don't repeat yourself" and the use of Parsoid DOM for read views. Certain attributes (like "alt" and "title") get duplicated by default by the PHP parser. So far I think we've been mostly successful in not letting this sort of thing infect the Parsoid DOM, but there may be corner cases we accomodate for the sake of ease-of-use for viewers.
Il 24/07/2015 15:56, C. Scott Ananian ha scritto:
On Fri, Jul 24, 2015 at 12:34 AM, Ricordisamoa <ricordisamoa@openmailbox.org
wrote: Parsoid's expressiveness seems to convey useless information, overlook important details, or duplicate them in different places. If I want to resize an image, am I supposed to change "data-file-width" and "data-file-height"? "width" and "height"? Or "src"?
These are great points, and reports from folks like you will help to improve our documentation. My goal for Parsoid's DOM[1] is that every bit of information from the wikitext is represented exactly *once* in the result.
Be it so!
In your example, `data-file-width` and `data-file-height` represent the *unscaled* size of the *source* image. Many image scaling operations want to know this, so we include it in the DOM. It is ignored when you convert back to wikitext.
The `width` and `height` attributes are what you should modify if you want to resize an image, just like you would do for any naive html editor.
AFAICS there's still no way to know exactly how an image's size was specified in the original wikitext.
The `src` attribute is again mostly ignored (sigh); the 'resource' attribute specifies the url of the unscaled image. Of course if 'resource' is missing we'll try to make do with `src`; we really try hard to do something reasonable with whatever we're given. --scott
[1] There is a tension between "don't repeat yourself" and the use of Parsoid DOM for read views. Certain attributes (like "alt" and "title") get duplicated by default by the PHP parser. So far I think we've been mostly successful in not letting this sort of thing infect the Parsoid DOM, but there may be corner cases we accomodate for the sake of ease-of-use for viewers.
On 23 July 2015 at 22:34, Ricordisamoa ricordisamoa@openmailbox.org wrote:
Il 24/07/2015 06:35, C. Scott Ananian ha scritto:
I don't know about edits tagged as "VisualEditor". That seems like that
should only be done by VE.
All edits made via visualeditoredit < https://www.mediawiki.org/w/api.php?action=help&modules=visualeditoredit... are tagged.
Yes. That's because that is the *private* API for VisualEditor. It absolutely should not ever be used by anyone else. It's not like any of the 'real' APIs in MediaWiki – it is designed for exactly one use case (VisualEditor), makes huge assumptions about the world and what is needed (like tagging edits), and we make breaking changes all the time. Unfortunately, the request to badge internal APIs got turned into flagging it and similar APIs in MediaWiki as "This module is internal or unstable.", which isn't strong enough on just how bad an idea it is to use it. I would extremely strongly suggest that you do not use it, ever.
As Marko, Subbu and Scott point out, we have actual public APIs for this kind of stuff, in the forms of RESTbase and Parsoid, and that's what you should use.
Yours,
As a proof of concept, I started to build a `mwparserfromhell`-like interface to the Parsoid DOM.
You can see it at https://gerrit.wikimedia.org/r/226734
I started by translating the template examples from the mwparserfromhell documentation, which means I'm really jumping in at the deep end. Most non-template manipulations should be much easier! --scott
Il 24/07/2015 17:18, James Forrester ha scritto:
On 23 July 2015 at 22:34, Ricordisamoa ricordisamoa@openmailbox.org wrote:
Il 24/07/2015 06:35, C. Scott Ananian ha scritto:
I don't know about edits tagged as "VisualEditor". That seems like that
should only be done by VE. All edits made via visualeditoredit < https://www.mediawiki.org/w/api.php?action=help&modules=visualeditoredit... are tagged.
Yes. That's because that is the *private* API for VisualEditor. It absolutely should not ever be used by anyone else. It's not like any of the 'real' APIs in MediaWiki – it is designed for exactly one use case (VisualEditor), makes huge assumptions about the world and what is needed (like tagging edits), and we make breaking changes all the time. Unfortunately, the request to badge internal APIs got turned into flagging it and similar APIs in MediaWiki as "This module is internal or unstable.", which isn't strong enough on just how bad an idea it is to use it. I would extremely strongly suggest that you do not use it, ever.
Oops. https://test.wikipedia.org/w/index.php?title=Tablez&action=history
As Marko, Subbu and Scott point out, we have actual public APIs for this kind of stuff, in the forms of RESTbase and Parsoid, and that's what you should use.
Yours,
Stephen Niedzielski: "it seems like, as soon as you get the HTML the first thing you want to do, perhaps a little bit ironically because it's called Parsoid, it's parse the output a little bit more" https://www.youtube.com/watch?v=3WJID_WC7BQ&t=35m14s
Il 23/07/2015 22:02, C. Scott Ananian ha scritto:
HTML5+RDFa is a machine-readable format. But I think what you are asking for is either better documentation of the template-related stuff (did you read through the slides inhttps://phabricator.wikimedia.org/T105175 ?) or HTML template parameter support (https://phabricator.wikimedia.org/T52587) which is in the codebase but not enabled by default in production. --scott _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
And as I responded there, if I gave you a JSON string instead, the first thing you'd need to do is parse the JSON to turn it into something you can use.
The difference is that JSON and html5 parsers are standard components in every programming language. Html5 even has a standard object representation and manipulation library (DOM) which is available in every major programming language. --scott On Sep 17, 2015 8:45 PM, "Ricordisamoa" ricordisamoa@openmailbox.org wrote:
Stephen Niedzielski: "it seems like, as soon as you get the HTML the first thing you want to do, perhaps a little bit ironically because it's called Parsoid, it's parse the output a little bit more" https://www.youtube.com/watch?v=3WJID_WC7BQ&t=35m14s
Il 23/07/2015 22:02, C. Scott Ananian ha scritto:
HTML5+RDFa is a machine-readable format. But I think what you are asking for is either better documentation of the template-related stuff (did you read through the slides inhttps://phabricator.wikimedia.org/T105175 ?) or HTML template parameter support (https://phabricator.wikimedia.org/T52587 ) which is in the codebase but not enabled by default in production. --scott _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 09/17/2015 07:44 PM, Ricordisamoa wrote:
Stephen Niedzielski: "it seems like, as soon as you get the HTML the first thing you want to do, perhaps a little bit ironically because it's called Parsoid, it's parse the output a little bit more" https://www.youtube.com/watch?v=3WJID_WC7BQ&t=35m14s
That is somewhat of a misunderstanding that Scott clarified.
Parsing is involved whenever you want to convert a string format to an object format.
So, unless you want to figure out a way to transfer DOM objects between server and client, you are going to continue transferring HTML strings which you then parse in the browser to rebuild the DOM representation.
Subbu.
On 07/23/2015 01:07 PM, Ricordisamoa wrote:
Il 23/07/2015 15:28, Antoine Musso ha scritto:
Le 23/07/2015 08:15, Ricordisamoa a écrit :
Are there any stable APIs for an application to get a parse tree in machine-readable format, manipulate it and send the result back without touching HTML? I'm sorry if this question doesn't make any sense.
You might want to explain what you are trying to do and which wall you have hit when attempting to use Parsoid :-)
For example, adding a template transclusion as new parameter in another template. XHTML5+RDFa is the wall :-( Can't Parsoid's deserialization be caught at some point to get a higher-level structure like mwparserfromhell https://github.com/earwig/mwparserfromhell's?
Parsoid and mwparserfromhell have different design goals and hence do things differently.
Parsoid is meant to support HTML editing and hence provides semantic information as annotations over the HTML document. It effectively maintains a bidirectional/reversible mapping between segments of wikitext and DOM trees. You can manipulate the DOM trees and get back wikitext that represents the edited tree. As for useless information and duplicate information -- I think if you looked at the Parsoid DOM spec [1], you will know what to look for and what to manipulate. The information on the DOM is meant to (a) render accurately (b) support the various bots / clients / gadgets that look for specific kinds of information, and (b) be editable easily. If that spec has holes or needs updates or fixing, we are happy to do that. Do let us know.
mwparserfromhell is an entirely wikitext-centric library as far as I can tell. It is meant to manipulate wikitext directly. It is a neat library which provides a lot of utilities and makes it easy to do wikitext transformations. It doesn't know about or care about HTML because it doesn't need to. It also seems to effectively gives you some kind of wikitext-centric AST. These are all impressions based on a quick scan of its docs -- so pardon any misunderstandings.
Parsoid does not provide you a wikitext AST directly since it doesn't construct one. All wikitext information shows up indirectly as DOM annotations (either attributes or JSON information in attributes). As Scott showed, you can still do document ("wikitext") manipulations using DOM libraries, CSS-style queries, or directly by walking the DOM. There are lots of ways you can edit mediawiki pages without knowing about wikitext and using the vast array of HTML libraries. That happens to be our tagline: "we deal with wikitext so you don't have to".
But, you are right. It can indeed seem cumbersome if you want to directly manipulate wikitext without the DOM getting in between or having to deal with DOM libraries. But that is not the use case we target. There are a vastly greater number of libraries in all kinds of languages (and developers) that know about HTML and can render, handle, and manipulate HTML easily than know how to (or want to) manipulate wikitext programmatically. Kind of the difference between the wikitext editor and the visual editor. They each have their constituencies and roles.
All that said, as Scott noted, it is possible to develop a mwparserfromhell like layer on top of the Parsoid DOM annotations if you want a wikitext-centric view (as opposed to a DOM-centric view that most editing clients seem to want). But, since that is not a use case that we target, that hasn't been on our radar. If someone does want to take that on, and thinks it would be useful, we are happy to provide assistance. It should not be too difficult.
Does that help summarize this issue and clarify the differences and approaches of these two tools? I am "on vacation" :-) so responses will be delayed.
Subbu.
[1] http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec
Il 24/07/2015 15:53, Subramanya Sastry ha scritto:
On 07/23/2015 01:07 PM, Ricordisamoa wrote:
Il 23/07/2015 15:28, Antoine Musso ha scritto:
Le 23/07/2015 08:15, Ricordisamoa a écrit :
Are there any stable APIs for an application to get a parse tree in machine-readable format, manipulate it and send the result back without touching HTML? I'm sorry if this question doesn't make any sense.
You might want to explain what you are trying to do and which wall you have hit when attempting to use Parsoid :-)
For example, adding a template transclusion as new parameter in another template. XHTML5+RDFa is the wall :-( Can't Parsoid's deserialization be caught at some point to get a higher-level structure like mwparserfromhell https://github.com/earwig/mwparserfromhell's?
Parsoid and mwparserfromhell have different design goals and hence do things differently.
Parsoid is meant to support HTML editing and hence provides semantic information as annotations over the HTML document. It effectively maintains a bidirectional/reversible mapping between segments of wikitext and DOM trees. You can manipulate the DOM trees and get back wikitext that represents the edited tree. As for useless information and duplicate information -- I think if you looked at the Parsoid DOM spec [1], you will know what to look for and what to manipulate. The information on the DOM is meant to (a) render accurately (b) support the various bots / clients / gadgets that look for specific kinds of information, and (b) be editable easily. If that spec has holes or needs updates or fixing, we are happy to do that. Do let us know.
mwparserfromhell is an entirely wikitext-centric library as far as I can tell. It is meant to manipulate wikitext directly. It is a neat library which provides a lot of utilities and makes it easy to do wikitext transformations. It doesn't know about or care about HTML because it doesn't need to. It also seems to effectively gives you some kind of wikitext-centric AST. These are all impressions based on a quick scan of its docs -- so pardon any misunderstandings.
Parsoid does not provide you a wikitext AST directly since it doesn't construct one. All wikitext information shows up indirectly as DOM annotations (either attributes or JSON information in attributes). As Scott showed, you can still do document ("wikitext") manipulations using DOM libraries, CSS-style queries, or directly by walking the DOM. There are lots of ways you can edit mediawiki pages without knowing about wikitext and using the vast array of HTML libraries. That happens to be our tagline: "we deal with wikitext so you don't have to".
But, you are right. It can indeed seem cumbersome if you want to directly manipulate wikitext without the DOM getting in between or having to deal with DOM libraries. But that is not the use case we target. There are a vastly greater number of libraries in all kinds of languages (and developers) that know about HTML and can render, handle, and manipulate HTML easily than know how to (or want to) manipulate wikitext programmatically. Kind of the difference between the wikitext editor and the visual editor. They each have their constituencies and roles.
All that said, as Scott noted, it is possible to develop a mwparserfromhell like layer on top of the Parsoid DOM annotations if you want a wikitext-centric view (as opposed to a DOM-centric view that most editing clients seem to want). But, since that is not a use case that we target, that hasn't been on our radar. If someone does want to take that on, and thinks it would be useful, we are happy to provide assistance. It should not be too difficult.
Does that help summarize this issue and clarify the differences and approaches of these two tools? I am "on vacation" :-) so responses will be delayed.
Subbu.
[1] http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec
Hi Subbu, thank you for this thoughtful insight. HTML is not a barrier by itself. The problem seems to be Parsoid being built primarily with VisualEditor in mind. It is not clear to me how can a single DOM serving both view and edit modes avoid redundancy. I see huge demand for alternative wikignome-style editors. The more Parsoid's DOM is predictable, concise and documented, the more users you get. I hope we can meet in the middle :-)
I agree that we have not (to date) spent a lot of time on APIs supporting direct editing of the Parsoid DOM. I tend to do things directly using the low-level DOM methods myself (and that's how I presented my Parsoid tutorial at wikimania this year) but I can see the attractiveness of the `mwparserfromhell` API in abstracting some of the details of the representation.
Thankfully you can have it both ways! Over the past week I've cloned the `mwparserfromhell` API, build on top of the Parsoid DOM. The initial patches have been merged, but there's a little work to do to get the API docs up on docs.wikimedia.org properly. Once that's done I'll post here with pointers.
Eventually I'd like to put the pieces together and implement something like a `pywikibot` clone based on this API and using the RESTBase APIs for read/write access to the wiki. As has been mentioned, the RESTBase API for saving edits is not yet quite complete ( https://phabricator.wikimedia.org/T101501); once that is done there should be no problem connecting the dots. (In the meantime you can use the API I just implemented to reserialize the wikitext and then use the standard PHP APIs, but that's a little bit clunky.) --scott
Il 31/07/2015 21:08, C. Scott Ananian ha scritto:
I agree that we have not (to date) spent a lot of time on APIs supporting direct editing of the Parsoid DOM. I tend to do things directly using the low-level DOM methods myself (and that's how I presented my Parsoid tutorial at wikimania this year) but I can see the attractiveness of the `mwparserfromhell` API in abstracting some of the details of the representation.
Thankfully you can have it both ways! Over the past week I've cloned the `mwparserfromhell` API, build on top of the Parsoid DOM. The initial patches have been merged, but there's a little work to do to get the API docs up on docs.wikimedia.org properly. Once that's done I'll post here with pointers.
Thanks! Unfortunately, that still requires using Node.js and depending on the parsoid package. Were the mwparserfromhell-like 'AST' exposed by RESTBase directly, there'd easily be lots of thin manipulation libraries in different programming languages.
Eventually I'd like to put the pieces together and implement something like a `pywikibot` clone based on this API and using the RESTBase APIs for read/write access to the wiki. As has been mentioned, the RESTBase API for saving edits is not yet quite complete ( https://phabricator.wikimedia.org/T101501); once that is done there should be no problem connecting the dots. (In the meantime you can use the API I just implemented to reserialize the wikitext and then use the standard PHP APIs, but that's a little bit clunky.) --scott _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Sat, Aug 1, 2015 at 2:23 AM, Ricordisamoa ricordisamoa@openmailbox.org wrote:
Il 31/07/2015 21:08, C. Scott Ananian ha scritto:
I agree that we have not (to date) spent a lot of time on APIs supporting direct editing of the Parsoid DOM. I tend to do things directly using the low-level DOM methods myself (and that's how I presented my Parsoid tutorial at wikimania this year) but I can see the attractiveness of the `mwparserfromhell` API in abstracting some of the details of the representation.
Thankfully you can have it both ways! Over the past week I've cloned the `mwparserfromhell` API, build on top of the Parsoid DOM. The initial patches have been merged, but there's a little work to do to get the API docs up on docs.wikimedia.org properly. Once that's done I'll post here with pointers.
Thanks! Unfortunately, that still requires using Node.js and depending on the parsoid package.
Clearly you're just trying to bait me into porting my code to python. I assure you there is nothing JavaScript-specific about this; there are HTML DOM-manipulation libraries available in all major programming languages. HTML *is* an AST (in this case, at least). --scott
Il 03/08/2015 22:08, C. Scott Ananian ha scritto:
On Sat, Aug 1, 2015 at 2:23 AM, Ricordisamoaricordisamoa@openmailbox.org wrote:
Il 31/07/2015 21:08, C. Scott Ananian ha scritto:
I agree that we have not (to date) spent a lot of time on APIs supporting direct editing of the Parsoid DOM. I tend to do things directly using the low-level DOM methods myself (and that's how I presented my Parsoid tutorial at wikimania this year) but I can see the attractiveness of the `mwparserfromhell` API in abstracting some of the details of the representation.
Thankfully you can have it both ways! Over the past week I've cloned the `mwparserfromhell` API, build on top of the Parsoid DOM. The initial patches have been merged, but there's a little work to do to get the API docs up on docs.wikimedia.org properly. Once that's done I'll post here with pointers.
Thanks! Unfortunately, that still requires using Node.js and depending on the parsoid package.
Clearly you're just trying to bait me into porting my code to python.
I'm baiting you into exposing a mwparserfromhell-like AST from RESTBase. Then I can deal with a Python client, a PHP one, etc. :-)
I assure you there is nothing JavaScript-specific about this; there are HTML DOM-manipulation libraries available in all major programming languages. HTML *is* an AST (in this case, at least). --scott
Good news: https://doc.wikimedia.org/Parsoid/master/#!/guide/jsapi now documents the new friendlier API for Parsoid. --scott
On Tue, Aug 11, 2015 at 3:03 PM, Ricordisamoa ricordisamoa@openmailbox.org wrote:
Il 03/08/2015 22:08, C. Scott Ananian ha scritto:
On Sat, Aug 1, 2015 at 2:23 AM, Ricordisamoa<ricordisamoa@openmailbox.org
wrote:
Il 31/07/2015 21:08, C. Scott Ananian ha scritto:
I agree that we have not (to date) spent a lot of time on APIs supporting
direct editing of the Parsoid DOM. I tend to do things directly using the low-level DOM methods myself (and that's how I presented my Parsoid tutorial at wikimania this year) but I can see the attractiveness of the `mwparserfromhell` API in abstracting some of the details of the representation.
Thankfully you can have it both ways! Over the past week I've cloned the `mwparserfromhell` API, build on top of the Parsoid DOM. The initial patches have been merged, but there's a little work to do to get the API docs up on docs.wikimedia.org properly. Once that's done I'll post here with pointers.
Thanks!
Unfortunately, that still requires using Node.js and depending on the parsoid package.
Clearly you're just trying to bait me into porting my code to python.
I'm baiting you into exposing a mwparserfromhell-like AST from RESTBase. Then I can deal with a Python client, a PHP one, etc. :-)
I assure you there is nothing JavaScript-specific about this; there are
HTML DOM-manipulation libraries available in all major programming languages. HTML *is* an AST (in this case, at least). --scott
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
This is awesome, thank you Scott
DJ
On 14 aug. 2015, at 00:20, C. Scott Ananian cananian@wikimedia.org wrote:
Good news: https://doc.wikimedia.org/Parsoid/master/#!/guide/jsapi now documents the new friendlier API for Parsoid. --scott
On Tue, Aug 11, 2015 at 3:03 PM, Ricordisamoa ricordisamoa@openmailbox.org wrote:
Il 03/08/2015 22:08, C. Scott Ananian ha scritto:
On Sat, Aug 1, 2015 at 2:23 AM, Ricordisamoa<ricordisamoa@openmailbox.org
wrote:
Il 31/07/2015 21:08, C. Scott Ananian ha scritto:
I agree that we have not (to date) spent a lot of time on APIs supporting
direct editing of the Parsoid DOM. I tend to do things directly using the low-level DOM methods myself (and that's how I presented my Parsoid tutorial at wikimania this year) but I can see the attractiveness of the `mwparserfromhell` API in abstracting some of the details of the representation.
Thankfully you can have it both ways! Over the past week I've cloned the `mwparserfromhell` API, build on top of the Parsoid DOM. The initial patches have been merged, but there's a little work to do to get the API docs up on docs.wikimedia.org properly. Once that's done I'll post here with pointers.
Thanks!
Unfortunately, that still requires using Node.js and depending on the parsoid package.
Clearly you're just trying to bait me into porting my code to python.
I'm baiting you into exposing a mwparserfromhell-like AST from RESTBase. Then I can deal with a Python client, a PHP one, etc. :-)
I assure you there is nothing JavaScript-specific about this; there are
HTML DOM-manipulation libraries available in all major programming languages. HTML *is* an AST (in this case, at least). --scott
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- (http://cscott.net) _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Great! As a further improvement, it should be a separate package using the service's REST API.
Il 14/08/2015 00:20, C. Scott Ananian ha scritto:
Good news: https://doc.wikimedia.org/Parsoid/master/#!/guide/jsapi now documents the new friendlier API for Parsoid. --scott
On Tue, Aug 11, 2015 at 3:03 PM, Ricordisamoa ricordisamoa@openmailbox.org wrote:
Il 03/08/2015 22:08, C. Scott Ananian ha scritto:
On Sat, Aug 1, 2015 at 2:23 AM, Ricordisamoa<ricordisamoa@openmailbox.org wrote:
Il 31/07/2015 21:08, C. Scott Ananian ha scritto:
I agree that we have not (to date) spent a lot of time on APIs supporting
direct editing of the Parsoid DOM. I tend to do things directly using the low-level DOM methods myself (and that's how I presented my Parsoid tutorial at wikimania this year) but I can see the attractiveness of the `mwparserfromhell` API in abstracting some of the details of the representation.
Thankfully you can have it both ways! Over the past week I've cloned the `mwparserfromhell` API, build on top of the Parsoid DOM. The initial patches have been merged, but there's a little work to do to get the API docs up on docs.wikimedia.org properly. Once that's done I'll post here with pointers.
Thanks!
Unfortunately, that still requires using Node.js and depending on the parsoid package.
Clearly you're just trying to bait me into porting my code to python.
I'm baiting you into exposing a mwparserfromhell-like AST from RESTBase. Then I can deal with a Python client, a PHP one, etc. :-)
I assure you there is nothing JavaScript-specific about this; there are
HTML DOM-manipulation libraries available in all major programming languages. HTML *is* an AST (in this case, at least). --scott
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 07/31/2015 12:55 PM, Ricordisamoa wrote:
Hi Subbu, thank you for this thoughtful insight.
And thank you for starting this thread. :-)
HTML is not a barrier by itself. The problem seems to be Parsoid being built primarily with VisualEditor in mind.
While we want the DOM to be VE-friendly, we definitely don't want the DOM to be VE-centric and that has been the intention from the very beginning. Flow, CX also use the Parsoid DOM for their functionality. There are other users too [1]. We definitely want Parsoid's output to be useful and usable more broadly as the canonical output representation of wikitext and are open to fixing whatever prevents that.
As Scott noted in the other email on the thread, inspired (and maybe challenged by :-) ) by mwparserfromhell's utilities, he has already whipped out a layer that provides an easier interface for manipulating the DOM.
It is not clear to me how can a single DOM serving both view and edit modes avoid redundancy.
You are right that there are some redundancies in information representation (because of having to serve multiple needs), but as far as I know, it is mostly around image attributes. If there is anything else specific (beyond image attributes) that is bothering you, can you flag that?
I see huge demand for alternative wikignome-style editors. The more Parsoid's DOM is predictable, concise and documented, the more users you get.
I think Parsoid's DOM is predictable :-) but, can you say more about what prompted you to say that? As for documentation, we document the DOM we generate and its semantics here [2]. As for size, I just looked at the Barack Obama page and here are some size numbers.
1540407 /tmp/Barack_Obama.parsoid.html 1197318 /tmp/Barack_Obama.parsoid.no-data-mw.html 1045161 /tmp/Barack_Obama.php-parser.output.footer-stripped.html
Right now, because we inline template and other editable information (as inline JSON attributes of the DOM), it is a bit bulky. However, we have always had plans to move the data-mw attribute into its own bucket which we might at some point in which case the size will be closer to the current PHP parser output. If we moved page properties and other metadata out, it will shrink it a little bit more.
For views that don't need to support editing or any other manipulation or analyses, we can more aggressively strip more from the HTML without affecting the rendering and get close to or even shrink the size below the PHP parser output size (there might be use cases where that might be appropriate thing to do). I could get this down to under 1M by stripping rel attributes, element ids, and about ids for identifying template output.
But, for editing (not just in VE) use cases, because of additional markup in place on the page (element ids, other markup for transclusions, extensions, links, etc.), the output will probably be somewhat larger than the corresponding PHP parser output. If we can keep it under 1.1x of php parser output size, I think we are good.
I hope we can meet in the middle :-)
Please file bugs and continue to report things that get in the way of using Parsoid.
Subbu.
[1] https://www.mediawiki.org/wiki/Parsoid/Users [2] http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec
Il 01/08/2015 01:20, Subramanya Sastry ha scritto:
On 07/31/2015 12:55 PM, Ricordisamoa wrote:
Hi Subbu, thank you for this thoughtful insight.
And thank you for starting this thread. :-)
HTML is not a barrier by itself. The problem seems to be Parsoid being built primarily with VisualEditor in mind.
While we want the DOM to be VE-friendly, we definitely don't want the DOM to be VE-centric and that has been the intention from the very beginning. Flow, CX also use the Parsoid DOM for their functionality. There are other users too [1].
VE, Flow, CX all take advantage of HTML. And I can't make any sense out of editProtectedHelper.js https://en.wikipedia.org/wiki/User:Jackmcbarn/editProtectedHelper.js :'(
We definitely want Parsoid's output to be useful and usable more broadly as the canonical output representation of wikitext and are open to fixing whatever prevents that.
As Scott noted in the other email on the thread, inspired (and maybe challenged by :-) ) by mwparserfromhell's utilities, he has already whipped out a layer that provides an easier interface for manipulating the DOM.
It is not clear to me how can a single DOM serving both view and edit modes avoid redundancy.
You are right that there are some redundancies in information representation (because of having to serve multiple needs), but as far as I know, it is mostly around image attributes. If there is anything else specific (beyond image attributes) that is bothering you, can you flag that?
https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Transclusion_conte... All template parameters are in data-mw but not parsed. Parameters ending up in the 'final' wikitext are parsed separately.
I see huge demand for alternative wikignome-style editors. The more Parsoid's DOM is predictable, concise and documented, the more users you get.
I think Parsoid's DOM is predictable :-) but, can you say more about what prompted you to say that?
For example, to find images I have to search elements where typeof is one of mw:Image, mw:Image/Thumb, mw:Image/Frame, mw:Image/Frameless, then see if it's a figure or a span, and expect either a <figcaption> or data-mw accordingly. Add that the img tag's parent can be <a> or <span>... Instead, this is what I'd expect a proper structure to look like:
Image .src = title, internal or external link? .repository? .page = number or null .language = code or null .format = thumb etc. .caption = wikitext parsed recursively .link = internal or external link or null .size .original .width = 1234 .height = 4321 .specified .width = 2468 .computed .width = 2468 .height = 8642
As for documentation, we document the DOM we generate and its semantics here [2].
It seems that some sections need updates, e.g. noinclude / includeonly / onlyinclude https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#noinclude_.2F_includeonly_.2F_onlyinclude
As for size, I just looked at the Barack Obama page and here are some size numbers.
By "concise" I meant an antonym for redundant, not lengthy :-)
1540407 /tmp/Barack_Obama.parsoid.html 1197318 /tmp/Barack_Obama.parsoid.no-data-mw.html 1045161 /tmp/Barack_Obama.php-parser.output.footer-stripped.html
Right now, because we inline template and other editable information (as inline JSON attributes of the DOM), it is a bit bulky. However, we have always had plans to move the data-mw attribute into its own bucket which we might at some point in which case the size will be closer to the current PHP parser output. If we moved page properties and other metadata out, it will shrink it a little bit more.
For views that don't need to support editing or any other manipulation or analyses, we can more aggressively strip more from the HTML without affecting the rendering
Stripping HTML altogether would be a huge step forward. :-)
and get close to or even shrink the size below the PHP parser output size (there might be use cases where that might be appropriate thing to do). I could get this down to under 1M by stripping rel attributes, element ids, and about ids for identifying template output.
But, for editing (not just in VE) use cases, because of additional markup in place on the page (element ids, other markup for transclusions, extensions, links, etc.), the output will probably be somewhat larger than the corresponding PHP parser output. If we can keep it under 1.1x of php parser output size, I think we are good.
I hope we can meet in the middle :-)
Please file bugs and continue to report things that get in the way of using Parsoid.
Subbu.
[1] https://www.mediawiki.org/wiki/Parsoid/Users [2] http://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Sat, Aug 1, 2015 at 3:39 AM, Ricordisamoa ricordisamoa@openmailbox.org wrote:
You are right that there are some redundancies in information
representation (because of having to serve multiple needs), but as far as I know, it is mostly around image attributes. If there is anything else specific (beyond image attributes) that is bothering you, can you flag that?
https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Transclusion_conte... All template parameters are in data-mw but not parsed. Parameters ending up in the 'final' wikitext are parsed separately.
Parsed template parameters require the `addHTMLTemplateParameters` parameter to Parsoid. We are actively discussing how to expose this sort of functionality via the Parsoid API. But it's not strictly required: you can just recursively invoke Parsoid on the template arguments. I'll try to whip up an example of this soon.
I see huge demand for alternative wikignome-style editors. The more
Parsoid's DOM is predictable, concise and documented, the more users you get.
I think Parsoid's DOM is predictable :-) but, can you say more about what prompted you to say that?
For example, to find images I have to search elements where typeof is one of mw:Image, mw:Image/Thumb, mw:Image/Frame, mw:Image/Frameless, then see if it's a figure or a span, and expect either a <figcaption> or data-mw accordingly. Add that the img tag's parent can be <a> or <span>... Instead, this is what I'd expect a proper structure to look like:
The CSS selector `figure, [typeof~="mw:Image"]` will capture all of the image elements. Similarly, `figure > *:last-child, [typeof~="mw:Image"] > *:last-child` will always capture the caption element (more or less). The structure is actually pretty locked down. (And my mwparserfromhell clone has some image-related helpers to make it even easier.)
Part of the problem here is that media-related markup in wikitext is quite fiendishly complicated, with lots of interlocking parts. The presence of one sort of option can completely change the meaning of others. The Parsoid DOM is designed to try to simplify this complexity, rather than directly mirror the wikitext craziness. --scott
wikitech-l@lists.wikimedia.org