Hi Parsoid Support Team,
I am reaching out to you to know about the usage of this tool. We have a very old version *1.17.5* of Mediawiki in our organization and want to convert the pages of it to html pages and store it on disk for archiving. As you know internally Mediawiki stores pages as WikiText.
Can parsoid (https://www.mediawiki.org/wiki/Parsoid) help us here? I also saw the documentation of VisualEditor extension ( https://www.mediawiki.org/wiki/VisualEditor) which uses parsoid internally to convert wikitext pages. Which tool among these 2 should we use to do my job? Can you please suggest? can parsoid be used as a standalone application or tool instead of VE?
If we use any of them do we need to just provide the url of our Mediawiki page (example - *https://<our_dns_host>/wiki/TestPage* or do we need to extract the content from DB which is in WikiText format and feed it to parsoid for converting it to html page?
Thanks
On Feb 6, 2020, at 4:35 AM, Ayaskant Swain ayaskant.swain@gmail.com wrote:
Hi Parsoid Support Team,
I am reaching out to you to know about the usage of this tool. We have a very old version 1.17.5 of Mediawiki in our organization and want to convert the pages of it to html pages and store it on disk for archiving. As you know internally Mediawiki stores pages as WikiText.
Can parsoid (https://www.mediawiki.org/wiki/Parsoid) help us here?
Maybe? It's very likely that Parsoid will have some compatibility issues that you'll need to hack around.
I also saw the documentation of VisualEditor extension (https://www.mediawiki.org/wiki/VisualEditor) which uses parsoid internally to convert wikitext pages. Which tool among these 2 should we use to do my job? Can you please suggest?
Parsoid is not included in VE, it just offers an API for VE to query. Adding VE to the mix is an unnecessary complication.
can parsoid be used as a standalone application or tool instead of VE?
Yes
If we use any of them do we need to just provide the url of our Mediawiki page (example - https://<our_dns_host>/wiki/TestPage or do we need to extract the content from DB which is in WikiText format and feed it to parsoid for converting it to html page?
Parsoid has traditionally interacted with MediaWiki's action API (the thing at /api.php). You would not need to do any manual extraction.
There seems to be an active project similar to what you're describing at, https://github.com/openzim/mwoffliner
However, it might be less complicated to just use the parser that ships with the MediaWiki version you're running. In other words, screen scrape the pages MediaWiki is already serving you.
Thanks _______________________________________________ Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
Thanks Arlo for replying.
Can you please give me some referenc elink to the native parser pf Mediawiki that you have suggested? A native parser will always be the easiest way to cater our need. We want to convert the pages pf our Mediawiki (1.17.5) to either pdf or html pages . All the attachments (images), comments should also come as part of the output file.
Thanks Ayaskant
On Fri, Feb 7, 2020 at 9:55 PM Arlo Breault abreault@wikimedia.org wrote:
On Feb 6, 2020, at 4:35 AM, Ayaskant Swain ayaskant.swain@gmail.com
wrote:
Hi Parsoid Support Team,
I am reaching out to you to know about the usage of this tool. We have a
very old version 1.17.5 of Mediawiki in our organization and want to convert the pages of it to html pages and store it on disk for archiving. As you know internally Mediawiki stores pages as WikiText.
Can parsoid (https://www.mediawiki.org/wiki/Parsoid) help us here?
Maybe? It's very likely that Parsoid will have some compatibility issues that you'll need to hack around.
I also saw the documentation of VisualEditor extension (
https://www.mediawiki.org/wiki/VisualEditor) which uses parsoid internally to convert wikitext pages. Which tool among these 2 should we use to do my job? Can you please suggest?
Parsoid is not included in VE, it just offers an API for VE to query. Adding VE to the mix is an unnecessary complication.
can parsoid be used as a standalone application or tool instead of VE?
Yes
If we use any of them do we need to just provide the url of our
Mediawiki page (example - https://<our_dns_host>/wiki/TestPage or do we need to extract the content from DB which is in WikiText format and feed it to parsoid for converting it to html page?
Parsoid has traditionally interacted with MediaWiki's action API (the thing at /api.php). You would not need to do any manual extraction.
There seems to be an active project similar to what you're describing at, https://github.com/openzim/mwoffliner
However, it might be less complicated to just use the parser that ships with the MediaWiki version you're running. In other words, screen scrape the pages MediaWiki is already serving you.
Thanks _______________________________________________ Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
On Feb 8, 2020, at 1:07 AM, Ayaskant Swain ayaskant.swain@gmail.com wrote:
Thanks Arlo for replying.
Can you please give me some referenc elink to the native parser pf Mediawiki that you have suggested? A native parser will always be the easiest way to cater our need. We want to convert the pages pf our Mediawiki (1.17.5) to either pdf or html pages . All the attachments (images), comments should also come as part of the output file.
When you visit https://<host>/wiki/TestPage, MediaWiki has already parsed the content to HTML for you.
I was suggesting you scrape those pages using wget, Scrapy, HTTrack, or some other tool.
It's also possible this extension works for you, https://www.mediawiki.org/wiki/Extension:DumpHTML
Maybe it's obvious, but remember first of all to make an XML dump. Everything else can be regenerated from it. https://www.mediawiki.org/wiki/API:Parsing_wikitext#API_documentation
Arlo Breault, 08/02/20 19:13:
I was suggesting you scrape those pages using wget, Scrapy, HTTrack, or some other tool.
It's also possible this extension works for you, https://www.mediawiki.org/wiki/Extension:DumpHTML
The main issue with archiving the "usual" HTML is that it's hard to tell whether you're including the resources you actually need, for instance CSS for templates. https://phabricator.wikimedia.org/T50295 https://phabricator.wikimedia.org/T40259
I don't recommend using typical scrapers for MediaWiki, it's a can of worms. If you want something that simple, you can get the HTML from the API: https://www.mediawiki.org/wiki/API:Parsing_wikitext#API_documentation
Depending on your installation, using dumpHTML might actually be easier. It was a pain for Wikimedia wikis, but mostly because they're huge and very complicated, plus Kiwix had to import the XML first.
The only advantage of using wget is that you can generate a WARC file with it. WARC can be fed into a warc-proxy which could then serve your website statically: at the moment this is the closest we have to a general purpose static website generator from any CMS. However, you'll still need to train wget (or whatever you're using) how to handle recursion. https://www.archiveteam.org/index.php?title=The_WARC_Ecosystem
Federico
wikitext-l@lists.wikimedia.org