Hello all, I am working on this project from past few months http://code.google.com/p/offline-wikipedia/, i have presented a talk related to this in freed.in 09 too.
My aim with this project is:
- To create DVD distribution for English wikipedia up to the standards that it can make match to http://download.wikimedia.org/dvd.html. - Making it easy to install and usable straight from DVD.
Target Audience are:
- Those who don't have Internet access. - Those who want to access content to wikipedia irrespective of Internet connection. - Those who use existing proprietary encyclopedias available in market.
Present status:
- Apart from source code hosted at google, whole setup is also available at http://92.243.5.147/offline-wiki there are two parts, for complete English wikipedia you have to get blocks.tgz and offline-wikipedia.tgz, there are instructions in README file available there. There is also also a small prototype sample.tar.bz2 available, in case one wants to check the quality of work. - I am following approach taken by http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.htmlhttp://users.softlab.ece.ntua.gr/%7Ettsiod/buildWikipediaOffline.html, but with some difference, i am using python to convert wiki-text to html and django for server. - As of now, with XML dumps provided by media-wiki, last year's October dump was 4.1G, i have csv files to locate articles inside those dumps of size ~300M, and small django configuration to access and convert the articles to html, and all this fits into a DVD.
Issues at hand:
- My python parser to create html out of wiki-text if not perfect, i can replace it with something which is better and existing, but am yet to find that. - To access articles faster i am breaking single bz2 using bz2recover, and it gives me 20k odd files for English content, i am trying to avoid those many files and not compromising with the speed of browsing the articles. - We can replace django server with something more light and simple given they don't have dependency cycles and making it hard to access/use/install. - March 09 English content is 4.6G making things more tight - It is only text content excluding multimedia, pictures(which are improtant part and cant be neglected).
Target:
- Make it updatebale. - To make it editable. - To manage different categories of articles, and segregation based on that to make refined and better education/learning tool.
There are other issues too, i know of other attempts like wiki-taxi, wikipedia-dumpreader, and am trying to patch things to get better results. But don't know why those present parser/attempts never made to that DVD distribution list. I am working to improve it, in the meanwhile any suggestion, feedback, contribution are most welcome.