On Sat, Feb 14, 2015 at 12:06 PM, Emmanuel Engelhart <kelson@kiwix.org> wrote:On 14.02.2015 20:52, Nitin Gupta wrote:
I hope HTML would be made available with the same frequency as XML
(wikitext) dumps; it would save me yet another attempt to make
wikitext
parser. Thanks.
For any API points you provide, it would be helpful if you could
also
mention expected maximum load from a client (req/s), so client
writers
can throttle accordingly.
Kiwix already publish full HTML snapshots packed in ZIM files
(snapshots with and without pictures). We publish monthly updates
for most of Wikimedia projects and are working to to it for all the
projects:
http://www.kiwix.org
I somehow missed the Kiwix project and HTML dump is all I'm interested
in (text only for now since images can have copyright issues).
Surprisingly, I could not find link to kiwix ZIM dump without images,
assuming default offered for download has thumbnails.
Have a look to the "all_nopic" links:
http://www.kiwix.org/wiki/Wikipedia_in_all_languages
The latest all_nopic dump for english wikipedia I can see is from 2014-01. Anyways, as Gabriel mentioned, it looks like wikimedia is going to generate and provide regularly updated HTML dumps for various projects directly -- hopefully sometime soon, so maybe that can then be used as gold source.The solution is coded in Node.js and uses the Parsoid API:
https://sourceforge.net/p/__kiwix/other/ci/master/tree/__mwoffliner/
<https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/>
We face recurring stability problems (with the 'http' module) which
is impairing the rollout for all project. If you are a Node.js
expert your help is really welcome.
I'm no http expert but I see that you are downloading full article
content from Parsoid API. Have you considered the approach of just
downloading the entire XML dump and then extracting articles out of
that. You would still need to download images, do template expansion
over http but still it saves a lot. I have used this approach here:
Parsing wiki code is a nightmare (if you want to reach Mediawiki quality of output & maintain that code base). It's far more easy to write a scraper based on Parsoid API.Yes, it's a nightmare to parse wikitext markup but in this case, the frontend (wikipedia.go) is not parsing wikitext at all. The XML dump simply encapsulates wikitext in a well structured XML. So, all the frontent is doing is extract the wikitext from XML and pass it to the backend (server.js -- running locally) service which uses Parsoid module to parse this wikitext to HTML.Thanks,Nitin
_______________________________________________
Wikitext-l mailing list
Wikitext-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitext-l