[Foundation-l] Project to create offline-wikipedia DVD distribution

shantanu choudhary shantanu at sarai.net
Fri Mar 20 10:21:02 UTC 2009


Hello all,
I am working on this project from past few months
http://code.google.com/p/offline-wikipedia/, i have presented a talk related
to this in freed.in 09 too.

My aim with this project is:

   - To create DVD distribution for English wikipedia up to the standards
   that it can make match to http://download.wikimedia.org/dvd.html.
   - Making it easy to install and usable straight from DVD.

Target Audience are:

   - Those who don't have Internet access.
   - Those who want to access content to wikipedia irrespective of Internet
   connection.
   - Those who use existing proprietary encyclopedias available in market.

Present status:

   - Apart from source code hosted at google, whole setup is also available
   at http://92.243.5.147/offline-wiki there are two parts, for complete
   English wikipedia you have to get blocks.tgz and offline-wikipedia.tgz,
   there are instructions in README file available there. There is also also a
   small prototype  sample.tar.bz2 available, in case one wants to check the
   quality of work.
   - I am following approach taken by
   http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html<http://users.softlab.ece.ntua.gr/%7Ettsiod/buildWikipediaOffline.html>,
   but with some difference, i am using python to convert wiki-text to html and
   django for server.
   - As of now, with XML dumps provided by media-wiki, last year's October
   dump was 4.1G, i have csv files to locate articles inside those dumps of
   size ~300M, and small django configuration to access and convert the
   articles to html, and all this fits into a DVD.

Issues at hand:

   - My python parser to create html out of wiki-text if not perfect, i can
   replace it with something which is better and existing, but am yet to find
   that.
   - To access articles faster i am breaking single bz2 using bz2recover,
   and it gives me 20k odd files for English content, i am trying to avoid
   those many files and not compromising with the speed of browsing the
   articles.
   - We can replace django server with something more light and simple given
   they don't have dependency cycles and making it hard to access/use/install.
   - March 09 English content is 4.6G making things more tight
   - It is only text content excluding multimedia, pictures(which are
   improtant part and cant be neglected).

Target:

   - Make it updatebale.
   - To make it editable.
   - To manage different categories of articles, and segregation based on
   that to make refined and better education/learning tool.

There are other issues too, i know of other attempts like wiki-taxi,
wikipedia-dumpreader, and am trying to patch things to get better results.
But don't know why those present parser/attempts never made to that DVD
distribution list. I am working to improve it, in the meanwhile any
suggestion, feedback, contribution are most welcome.

-- 
Regards
Shantanu Choudhary


More information about the foundation-l mailing list