Can Parsoid be used to convert arbitrary HTML to wikitext? It's not clear to me whether it will only work with Parsoid's HTML+RDFa. I'm wondering if I could take snippets of HTML from non-MediaWiki webpages and convert them into wikitext.
Thanks, James
On Fri, Nov 6, 2015 at 10:18 AM, James Montalvo jamesmontalvo3@gmail.com wrote:
Can Parsoid be used to convert arbitrary HTML to wikitext? It's not clear to me whether it will only work with Parsoid's HTML+RDFa. I'm wondering if I could take snippets of HTML from non-MediaWiki webpages and convert them into wikitext.
That is possible, yes. For example (via RESTBase):
curl -X POST --header "Content-Type: application/x-www-form-urlencoded" --header "Accept: text/plain; profile="mediawiki.org/specs/wikitext/1.0.0"" -d 'html=<h1>Heading</h1><p>Hello world</p>' " https://en.wikipedia.org/api/rest_v1/transform/html/to/wikitext"
Thanks for the quick response. Is there a simple way to do this without RESTBase? On Nov 6, 2015 10:32 AM, "Eric Evans" eevans@wikimedia.org wrote:
On Fri, Nov 6, 2015 at 10:18 AM, James Montalvo jamesmontalvo3@gmail.com wrote:
Can Parsoid be used to convert arbitrary HTML to wikitext? It's not clear to me whether it will only work with Parsoid's HTML+RDFa. I'm wondering
if
I could take snippets of HTML from non-MediaWiki webpages and convert
them
into wikitext.
That is possible, yes. For example (via RESTBase):
curl -X POST --header "Content-Type: application/x-www-form-urlencoded" --header "Accept: text/plain; profile="mediawiki.org/specs/wikitext/1.0.0 "" -d 'html=<h1>Heading</h1><p>Hello world</p>' " https://en.wikipedia.org/api/rest_v1/transform/html/to/wikitext"
-- Eric Evans eevans@wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 11/06/2015 10:18 AM, James Montalvo wrote:
Can Parsoid be used to convert arbitrary HTML to wikitext? It's not clear to me whether it will only work with Parsoid's HTML+RDFa. I'm wondering if I could take snippets of HTML from non-MediaWiki webpages and convert them into wikitext.
The right answer is: "It depends" :-)
As Eric responded in his reply, Parsoid does convert some kinds of arbitrary HTML to clean wikitext. See some additional examples at the end of this email.
However, if you really threw arbitrary HTML at it (ex: <em>..</em> or <strong>..</strong>) Parsoid wouldn't know that it could potentially use '' or ''' for those tags. Or, if you gave it input with all kinds of css and other inlined attributes, you won't necessarily get the best wikitext from it.
But, if you tried to convert HTML that you got from say Google docs, Open Office, Word, or other HTML-generation tools, the wikitext you get may not be very pretty.
We do want to keep improving Parsoid's abilities to get there, but it has not been a high priority for us, but it would be a great GSoC or volunteer project if someone wants to play with this and improve this feature given that we are always playing catch up with all the other things we need to get done.
But, if you didn't have really arbitrary HTML, you can get some reasonable looking wikitext out of it even without the markers. But, things like images, templates, extensions .. obviously require the additional attributes for Parsoid to generate canonical wikitext for that.
Hope this helps.
Subbu.
-------------------------------------------------------------------------------------------
Some html -> wt examples:
[subbu@earth bin] echo "<h2>foo</h2><p>a</p><p>b</p>" | node parse --html2wt == foo == a
b [subbu@earth bin] echo "<a href='http://en.wikipedia.org/wiki/Hampi'>Hampi</a>" | node parse --html2wt [[Hampi]]
[subbu@earth bin] echo "<a href='http://it.wikipedia.org/wiki/Luna'>Luna</a>" | node parse --html2wt [[:it:Luna|Luna]]
[subbu@earth bin] echo "<a href='http://it.wikipedia.org/wiki/Luna'>Luna</a>" | node parse --html2wt --prefix itwiki [[Luna]]
[subbu@earth bin] echo "<ul><li>a</li><li>b</li><li>c</li></ul>" | node parse --html2wt * a * b * c
[subbu@earth bin] echo <em>foo</em>" | node parse --html2wt <em>foo</em>
To add to what Eric & Subbu have said, here is a link to the API documentation for this end point:
https://en.wikipedia.org/api/rest_v1/?doc#!/Transforms/post_transform_html_t...
On Fri, Nov 6, 2015 at 8:47 AM, Subramanya Sastry ssastry@wikimedia.org wrote:
On 11/06/2015 10:18 AM, James Montalvo wrote:
Can Parsoid be used to convert arbitrary HTML to wikitext? It's not clear to me whether it will only work with Parsoid's HTML+RDFa. I'm wondering if I could take snippets of HTML from non-MediaWiki webpages and convert them into wikitext.
The right answer is: "It depends" :-)
As Eric responded in his reply, Parsoid does convert some kinds of arbitrary HTML to clean wikitext. See some additional examples at the end of this email.
However, if you really threw arbitrary HTML at it (ex: <em>..</em> or <strong>..</strong>) Parsoid wouldn't know that it could potentially use '' or ''' for those tags. Or, if you gave it input with all kinds of css and other inlined attributes, you won't necessarily get the best wikitext from it.
But, if you tried to convert HTML that you got from say Google docs, Open Office, Word, or other HTML-generation tools, the wikitext you get may not be very pretty.
We do want to keep improving Parsoid's abilities to get there, but it has not been a high priority for us, but it would be a great GSoC or volunteer project if someone wants to play with this and improve this feature given that we are always playing catch up with all the other things we need to get done.
But, if you didn't have really arbitrary HTML, you can get some reasonable looking wikitext out of it even without the markers. But, things like images, templates, extensions .. obviously require the additional attributes for Parsoid to generate canonical wikitext for that.
Hope this helps.
Subbu.
Some html -> wt examples:
[subbu@earth bin] echo "<h2>foo</h2><p>a</p><p>b</p>" | node parse --html2wt == foo == a
b [subbu@earth bin] echo "<a href='http://en.wikipedia.org/wiki/Hampi'>Hampi</a>" | node parse --html2wt [[Hampi]]
[subbu@earth bin] echo "<a href='http://it.wikipedia.org/wiki/Luna'>Luna</a>" | node parse --html2wt [[:it:Luna|Luna]]
[subbu@earth bin] echo "<a href='http://it.wikipedia.org/wiki/Luna'>Luna</a>" | node parse --html2wt --prefix itwiki [[Luna]]
[subbu@earth bin] echo "<ul><li>a</li><li>b</li><li>c</li></ul>" | node parse --html2wt
- a
- b
- c
[subbu@earth bin] echo <em>foo</em>" | node parse --html2wt <em>foo</em>
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Thanks for the responses. I do want to convert HTML that cannot be assumed to be clean, so it sounds like Parsoid will not solve the problem for now.
--James
On Fri, Nov 6, 2015 at 11:06 AM, Gabriel Wicke gwicke@wikimedia.org wrote:
To add to what Eric & Subbu have said, here is a link to the API documentation for this end point:
https://en.wikipedia.org/api/rest_v1/?doc#!/Transforms/post_transform_html_t...
On Fri, Nov 6, 2015 at 8:47 AM, Subramanya Sastry ssastry@wikimedia.org wrote:
On 11/06/2015 10:18 AM, James Montalvo wrote:
Can Parsoid be used to convert arbitrary HTML to wikitext? It's not
clear
to me whether it will only work with Parsoid's HTML+RDFa. I'm wondering
if
I could take snippets of HTML from non-MediaWiki webpages and convert
them
into wikitext.
The right answer is: "It depends" :-)
As Eric responded in his reply, Parsoid does convert some kinds of arbitrary HTML to clean wikitext. See some additional examples at the end of this email.
However, if you really threw arbitrary HTML at it (ex: <em>..</em> or <strong>..</strong>) Parsoid wouldn't know that it could potentially use
''
or ''' for those tags. Or, if you gave it input with all kinds of css and other inlined attributes, you won't necessarily get the best wikitext
from
it.
But, if you tried to convert HTML that you got from say Google docs, Open Office, Word, or other HTML-generation tools, the wikitext you get may
not
be very pretty.
We do want to keep improving Parsoid's abilities to get there, but it has not been a high priority for us, but it would be a great GSoC or
volunteer
project if someone wants to play with this and improve this feature given that we are always playing catch up with all the other things we need to get done.
But, if you didn't have really arbitrary HTML, you can get some
reasonable
looking wikitext out of it even without the markers. But, things like images, templates, extensions .. obviously require the additional attributes for Parsoid to generate canonical wikitext for that.
Hope this helps.
Subbu.
Some html -> wt examples:
[subbu@earth bin] echo "<h2>foo</h2><p>a</p><p>b</p>" | node parse --html2wt == foo == a
b [subbu@earth bin] echo "<a href='http://en.wikipedia.org/wiki/Hampi
'>Hampi</a>"
| node parse --html2wt [[Hampi]]
[subbu@earth bin] echo "<a href='http://it.wikipedia.org/wiki/Luna
'>Luna</a>"
| node parse --html2wt [[:it:Luna|Luna]]
[subbu@earth bin] echo "<a href='http://it.wikipedia.org/wiki/Luna
'>Luna</a>"
| node parse --html2wt --prefix itwiki [[Luna]]
[subbu@earth bin] echo "<ul><li>a</li><li>b</li><li>c</li></ul>" | node parse --html2wt
- a
- b
- c
[subbu@earth bin] echo <em>foo</em>" | node parse --html2wt <em>foo</em>
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 11/06/2015 11:15 AM, James Montalvo wrote:
Thanks for the responses. I do want to convert HTML that cannot be assumed to be clean, so it sounds like Parsoid will not solve the problem for now.
If you give us a sample of the kind of HTML you are looking at, we can see what kind of wikitext comes up and if there are simple tweaks that can fix any problems.
You can also try @ http://parsoid-lb.eqiad.wikimedia.org/_html/
Note that this public access point to Parsoid will not be around much longer.
Subbu.
--James
On Fri, Nov 6, 2015 at 11:06 AM, Gabriel Wicke gwicke@wikimedia.org wrote:
To add to what Eric & Subbu have said, here is a link to the API documentation for this end point:
https://en.wikipedia.org/api/rest_v1/?doc#!/Transforms/post_transform_html_t...
On Fri, Nov 6, 2015 at 8:47 AM, Subramanya Sastry ssastry@wikimedia.org wrote:
On 11/06/2015 10:18 AM, James Montalvo wrote:
Can Parsoid be used to convert arbitrary HTML to wikitext? It's not
clear
to me whether it will only work with Parsoid's HTML+RDFa. I'm wondering
if
I could take snippets of HTML from non-MediaWiki webpages and convert
them
into wikitext.
The right answer is: "It depends" :-)
As Eric responded in his reply, Parsoid does convert some kinds of arbitrary HTML to clean wikitext. See some additional examples at the end of this email.
However, if you really threw arbitrary HTML at it (ex: <em>..</em> or <strong>..</strong>) Parsoid wouldn't know that it could potentially use
''
or ''' for those tags. Or, if you gave it input with all kinds of css and other inlined attributes, you won't necessarily get the best wikitext
from
it.
But, if you tried to convert HTML that you got from say Google docs, Open Office, Word, or other HTML-generation tools, the wikitext you get may
not
be very pretty.
We do want to keep improving Parsoid's abilities to get there, but it has not been a high priority for us, but it would be a great GSoC or
volunteer
project if someone wants to play with this and improve this feature given that we are always playing catch up with all the other things we need to get done.
But, if you didn't have really arbitrary HTML, you can get some
reasonable
looking wikitext out of it even without the markers. But, things like images, templates, extensions .. obviously require the additional attributes for Parsoid to generate canonical wikitext for that.
Hope this helps.
Subbu.
Some html -> wt examples:
[subbu@earth bin] echo "<h2>foo</h2><p>a</p><p>b</p>" | node parse --html2wt == foo == a
b [subbu@earth bin] echo "<a href='http://en.wikipedia.org/wiki/Hampi
'>Hampi</a>"
| node parse --html2wt [[Hampi]]
[subbu@earth bin] echo "<a href='http://it.wikipedia.org/wiki/Luna
'>Luna</a>"
| node parse --html2wt [[:it:Luna|Luna]]
[subbu@earth bin] echo "<a href='http://it.wikipedia.org/wiki/Luna
'>Luna</a>"
| node parse --html2wt --prefix itwiki [[Luna]]
[subbu@earth bin] echo "<ul><li>a</li><li>b</li><li>c</li></ul>" | node parse --html2wt
- a
- b
- c
[subbu@earth bin] echo <em>foo</em>" | node parse --html2wt <em>foo</em>
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Gabriel Wicke Principal Engineer, Wikimedia Foundation _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org