On 11/06/2015 10:18 AM, James Montalvo wrote:
Can Parsoid be used to convert arbitrary HTML to
wikitext? It's not clear
to me whether it will only work with Parsoid's HTML+RDFa. I'm wondering if
I could take snippets of HTML from non-MediaWiki webpages and convert them
into wikitext.
The right answer is: "It depends" :-)
As Eric responded in his reply, Parsoid does convert some kinds of
arbitrary HTML to clean wikitext. See some additional examples at the
end of this email.
However, if you really threw arbitrary HTML at it (ex: <em>..</em> or
<strong>..</strong>) Parsoid wouldn't know that it could potentially use
'' or ''' for those tags. Or, if you gave it input with all kinds of
css
and other inlined attributes, you won't necessarily get the best
wikitext from it.
But, if you tried to convert HTML that you got from say Google docs,
Open Office, Word, or other HTML-generation tools, the wikitext you get
may not be very pretty.
We do want to keep improving Parsoid's abilities to get there, but it
has not been a high priority for us, but it would be a great GSoC or
volunteer project if someone wants to play with this and improve this
feature given that we are always playing catch up with all the other
things we need to get done.
But, if you didn't have really arbitrary HTML, you can get some
reasonable looking wikitext out of it even without the markers. But,
things like images, templates, extensions .. obviously require the
additional attributes for Parsoid to generate canonical wikitext for that.
Hope this helps.
Subbu.
-------------------------------------------------------------------------------------------
Some html -> wt examples:
[subbu@earth bin] echo
"<h2>foo</h2><p>a</p><p>b</p>" | node parse
--html2wt
== foo ==
a
b
[subbu@earth bin] echo "<a
href='http://en.wikipedia.org/wiki/Hampi'>Hampi</a>" | node parse
--html2wt
[[Hampi]]
[subbu@earth bin] echo "<a
href='http://it.wikipedia.org/wiki/Luna'>Luna</a>" | node parse
--html2wt
[[:it:Luna|Luna]]
[subbu@earth bin] echo "<a
href='http://it.wikipedia.org/wiki/Luna'>Luna</a>" | node parse
--html2wt --prefix itwiki
[[Luna]]
[subbu@earth bin] echo
"<ul><li>a</li><li>b</li><li>c</li></ul>"
| node
parse --html2wt
* a
* b
* c
[subbu@earth bin] echo <em>foo</em>" | node parse --html2wt
<em>foo</em>