Hello all,
I am working on this project from past few months http://code.google.com/p/offline-wikipedia/, i have presented a talk related to this in freed.in 09 too.
My aim with this project is:
- To create DVD distribution for English wikipedia up to the standards that it can make match to http://download.wikimedia.org/dvd.html.
- Making it easy to install and usable straight from DVD.
Target Audience are:
- Those who don't have Internet access.
- Those who want to access content to wikipedia irrespective of Internet connection.
- Those who use existing proprietary encyclopedias available in market.
Present status:
- Apart from source code hosted at google, whole setup is also available at http://92.243.5.147/offline-wiki
there are two parts, for complete English wikipedia you have to get
blocks.tgz and offline-wikipedia.tgz, there are instructions in README
file available there. There is also also a small prototype
sample.tar.bz2 available, in case one wants to check the quality of
work.
- I am following approach taken by http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html, but with some difference, i am using python to convert wiki-text to html and django for server.
- As of now, with XML dumps provided by media-wiki, last year's
October dump was 4.1G, i have csv files to locate articles inside those
dumps of size ~300M, and small django configuration to access and
convert the articles to html, and all this fits into a DVD.
Issues at hand:
- My python parser to create html out of
wiki-text if not perfect, i can replace it with something which is
better and existing, but am yet to find that.
- To access
articles faster i am breaking single bz2 using bz2recover, and it gives
me 20k odd files for English content, i am trying to avoid those many
files and not compromising with the speed of browsing the articles.
- We can replace django server with something more light and simple
given they don't have dependency cycles and making it hard to
access/use/install.
- March 09 English content is 4.6G making things more tight
- It is only text content excluding multimedia, pictures(which are improtant part and cant be neglected).
Target:
There are other issues too, i know of other attempts like
wiki-taxi, wikipedia-dumpreader, and am trying to patch things to get
better results. But don't know why those present parser/attempts never
made to that DVD distribution list. I am working to improve it, in the
meanwhile any suggestion, feedback, contribution are most welcome.
--
Regards
Shantanu Choudhary