Thanks for the responses. I do want to convert HTML that cannot be assumed
to be clean, so it sounds like Parsoid will not solve the problem for now.
--James
On Fri, Nov 6, 2015 at 11:06 AM, Gabriel Wicke <gwicke(a)wikimedia.org> wrote:
To add to what Eric & Subbu have said, here is a
link to the API
documentation for this end point:
https://en.wikipedia.org/api/rest_v1/?doc#!/Transforms/post_transform_html_…
On Fri, Nov 6, 2015 at 8:47 AM, Subramanya Sastry <ssastry(a)wikimedia.org>
wrote:
On 11/06/2015 10:18 AM, James Montalvo wrote:
> Can Parsoid be used to convert arbitrary HTML to wikitext? It's not
clear
> to me whether it will only work with
Parsoid's HTML+RDFa. I'm wondering
if
> I could take snippets of HTML from
non-MediaWiki webpages and convert
them
into
wikitext.
The right answer is: "It depends" :-)
As Eric responded in his reply, Parsoid does convert some kinds of
arbitrary HTML to clean wikitext. See some additional examples at the end
of this email.
However, if you really threw arbitrary HTML at it (ex: <em>..</em> or
<strong>..</strong>) Parsoid wouldn't know that it could potentially use
''
or ''' for those tags. Or, if you
gave it input with all kinds of css and
other inlined attributes, you won't necessarily get the best wikitext
from
it.
But, if you tried to convert HTML that you got from say Google docs, Open
Office, Word, or other HTML-generation tools, the wikitext you get may
not
be very pretty.
We do want to keep improving Parsoid's abilities to get there, but it has
not been a high priority for us, but it would be a great GSoC or
volunteer
project if someone wants to play with this and
improve this feature given
that we are always playing catch up with all the other things we need to
get done.
But, if you didn't have really arbitrary HTML, you can get some
reasonable
looking wikitext out of it even without the
markers. But, things like
images, templates, extensions .. obviously require the additional
attributes for Parsoid to generate canonical wikitext for that.
Hope this helps.
Subbu.
-------------------------------------------------------------------------------------------
Some html -> wt examples:
[subbu@earth bin] echo
"<h2>foo</h2><p>a</p><p>b</p>" | node parse
--html2wt
== foo ==
a
b
[subbu@earth bin] echo "<a
href='http://en.wikipedia.org/wiki/Hampi
'>Hampi</a>"
| node parse --html2wt
[[Hampi]]
[subbu@earth bin] echo "<a
href='http://it.wikipedia.org/wiki/Luna
'>Luna</a>"
| node parse --html2wt
[[:it:Luna|Luna]]
[subbu@earth bin] echo "<a
href='http://it.wikipedia.org/wiki/Luna
'>Luna</a>"
| node parse --html2wt --prefix itwiki
[[Luna]]
[subbu@earth bin] echo
"<ul><li>a</li><li>b</li><li>c</li></ul>"
| node
parse --html2wt
* a
* b
* c
[subbu@earth bin] echo <em>foo</em>" | node parse --html2wt
<em>foo</em>
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
--
Gabriel Wicke
Principal Engineer, Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l