Re: [Wikitext-l] Parsoid template expansion

20 Mar 2015

Hi Dmitrijs,

we are currently waiting for hardware to be allocated. We hope to have a
first set of dumps 1-2 weeks from now, with the intention to provide dumps
at regular intervals. See https://phabricator.wikimedia.org/T17017 and
dependencies for the progress on this.

We are also considering which distribution format to use for the HTML
dumps. One option is a lzma-compressed sqlite database. Please weigh in on
this at https://phabricator.wikimedia.org/T93396.

Thanks,

Gabriel

On Mon, Mar 16, 2015 at 3:29 AM, Dmitrijs Milajevs &lt;dimazest(a)gmail.com&gt;
wrote:

...
  Hi,

 Is there any progress regarding html dumps?

 I'm not interested in html dumps as such, but I believe that HTML is way
 nicer way of getting raw text of articles out of a wiki dump. See this
 proof of concept [1].

 However, what I believe would be very useful for the scientific community
 are syntacticly parsed dumps of Wikipedia. Right now everyone uses
 different pipelines to parsed Wikipedia, which are often undocumented,
 outdated and unreproducible.

 At IWCS we are running a two day hackathon [2] and I think that one useful
 project would be to come up with a documented and easily reproducible way
 of getting parsed versions of wikipedia dumps. I've started some noted as
 part of NLTK corpus readers [3], but this might grow into a separate
 project.

 So, I see an easily deployable pipeline of:

   enwiki.bz2 -> raw_text.bz2 -> parsed_text.bz2

 as a perfect project for the hackathon. Ideally, this should be picked up
 by someone to produce regular dumps (but I don't know who will be willing
 to invest computational resources).

 Do you have any ideas/suggestions that I should take care of?

 In case you are in London on April 11-12 you are welcome to take part in
 the hackathon.

 [1]

http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/ti…
 [2] http://iwcs2015.github.io/hackathon.html
 [3] http://iwcs2015.github.io/hackathon.html#nltk-corpus-readers

 --
 Dima

 On Sun, Feb 15, 2015 at 8:36 AM, Nitin Gupta &lt;nitingupta910(a)gmail.com&gt;
 wrote:

 On Sat, Feb 14, 2015 at 12:06 PM, Emmanuel Engelhart &lt;kelson(a)kiwix.org&gt;
 wrote:

  On 14.02.2015 20:52, Nitin Gupta wrote:

          I hope HTML would be made available with
the same frequency as
 XML
         (wikitext) dumps; it would save me yet another attempt to make
         wikitext
         parser. Thanks.

         For any API points you provide, it would be helpful if you could
         also
         mention expected maximum load from a client (req/s), so client
         writers
         can throttle accordingly.

     Kiwix already publish full HTML snapshots packed in ZIM files
     (snapshots with and without pictures). We publish monthly updates
     for most of Wikimedia projects and are working to to it for all the
     projects:
     http://www.kiwix.org

 I somehow missed the Kiwix project and HTML dump is all I'm interested
 in (text only for now since images can have copyright issues).
 Surprisingly, I could not find link to kiwix ZIM dump without images,
 assuming default offered for download has thumbnails.

 Have a look to the "all_nopic" links:
 http://www.kiwix.org/wiki/Wikipedia_in_all_languages

 The latest all_nopic dump for english wikipedia I can see is from
 2014-01.  Anyways, as Gabriel mentioned, it looks like wikimedia is going
 to generate and provide regularly updated HTML dumps for various projects
 directly -- hopefully sometime soon, so maybe that can then be used as gold
 source.

       The solution is coded in Node.js and uses
the Parsoid API:

https://sourceforge.net/p/__kiwix/other/ci/master/tree/__
 mwoffliner/
     <https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/>

     We face recurring stability problems (with the 'http' module) which
     is impairing the rollout for all project. If you are a Node.js
     expert your help is really welcome.

 I'm no http expert but I see that you are downloading full article
 content from Parsoid API. Have you considered the approach of just
 downloading the entire XML dump and then extracting articles out of
 that. You would still need to download images, do template expansion
 over http but still it saves a lot. I have used this approach here:

 Parsing wiki code is a nightmare (if you want to reach Mediawiki quality
 of output & maintain that code base). It's far more easy to write a scraper
 based on Parsoid API.

  Yes, it's a nightmare to parse wikitext markup but in this case, the
 frontend (wikipedia.go) is not parsing wikitext at all. The XML dump simply
 encapsulates wikitext in a well structured XML. So, all the frontent is
 doing is extract the wikitext from XML and pass it to the backend
 (server.js -- running locally) service which uses Parsoid module to parse
 this wikitext to HTML.

 Thanks,
 Nitin

 _______________________________________________
 Wikitext-l mailing list
 Wikitext-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitext-l

 _______________________________________________
 Wikitext-l mailing list
 Wikitext-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitext-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Wikitext-l] Parsoid template expansion