Re: [Wikitext-l] Parsoid template expansion

14 Feb 2015

On 14.02.2015 20:52, Nitin Gupta wrote:
...
          I hope HTML would be made available with the
same frequency as XML
         (wikitext) dumps; it would save me yet another attempt to make
         wikitext
         parser. Thanks.

         For any API points you provide, it would be helpful if you could
         also
         mention expected maximum load from a client (req/s), so client
         writers
         can throttle accordingly.

     Kiwix already publish full HTML snapshots packed in ZIM files
     (snapshots with and without pictures). We publish monthly updates
     for most of Wikimedia projects and are working to to it for all the
     projects:
     http://www.kiwix.org

 I somehow missed the Kiwix project and HTML dump is all I'm interested
 in (text only for now since images can have copyright issues).
 Surprisingly, I could not find link to kiwix ZIM dump without images,
 assuming default offered for download has thumbnails. 
Have a look to the "all_nopic" links:
http://www.kiwix.org/wiki/Wikipedia_in_all_languages

...
      The solution is coded in Node.js and uses the
Parsoid API:
     https://sourceforge.net/p/__kiwix/other/ci/master/tree/__mwoffliner/
     <https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/>

     We face recurring stability problems (with the 'http' module) which
     is impairing the rollout for all project. If you are a Node.js
     expert your help is really welcome.

 I'm no http expert but I see that you are downloading full article
 content from Parsoid API. Have you considered the approach of just
 downloading the entire XML dump and then extracting articles out of
 that. You would still need to download images, do template expansion
 over http but still it saves a lot. I have used this approach here: 
Parsing wiki code is a nightmare (if you want to reach Mediawiki quality 
of output & maintain that code base). It's far more easy to write a 
scraper based on Parsoid API.

Emmanuel
-- 
Kiwix - Wikipedia Offline & more
* Web: http://www.kiwix.org
* Twitter: https://twitter.com/KiwixOffline
* more: http://www.kiwix.org/wiki/Communication

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Wikitext-l] Parsoid template expansion