Re: [Wikitext-l] Query on parsoid and Visual Editor

9 Feb 2020


      Maybe it's obvious, but remember first of all to make an XML dump. 
Everything else can be regenerated from it.
https://www.mediawiki.org/wiki/API:Parsing_wikitext#API_documentation
Arlo Breault, 08/02/20 19:13:
...
I was suggesting you scrape those pages using
wget, Scrapy, HTTrack, or some other tool.
It's also possible this extension works for you,
https://www.mediawiki.org/wiki/Extension:DumpHTML
The main issue with archiving the "usual" HTML is that it's hard to tell 
whether you're including the resources you actually need, for instance 
CSS for templates.
https://phabricator.wikimedia.org/T50295
https://phabricator.wikimedia.org/T40259
I don't recommend using typical scrapers for MediaWiki, it's a can of 
worms. If you want something that simple, you can get the HTML from the API:
https://www.mediawiki.org/wiki/API:Parsing_wikitext#API_documentation
Depending on your installation, using dumpHTML might actually be easier. 
It was a pain for Wikimedia wikis, but mostly because they're huge and 
very complicated, plus Kiwix had to import the XML first.
The only advantage of using wget is that you can generate a WARC file 
with it. WARC can be fed into a warc-proxy which could then serve your 
website statically: at the moment this is the closest we have to a 
general purpose static website generator from any CMS. However, you'll 
still need to train wget (or whatever you're using) how to handle recursion.
https://www.archiveteam.org/index.php?title=The_WARC_Ecosystem
Federico

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Wikitext-l] Query on parsoid and Visual Editor