[Foundation-l] Project to create offline-wikipedia DVD distribution
shantanu choudhary
shantanu at sarai.net
Fri Mar 20 10:21:02 UTC 2009
Hello all,
I am working on this project from past few months
http://code.google.com/p/offline-wikipedia/, i have presented a talk related
to this in freed.in 09 too.
My aim with this project is:
- To create DVD distribution for English wikipedia up to the standards
that it can make match to http://download.wikimedia.org/dvd.html.
- Making it easy to install and usable straight from DVD.
Target Audience are:
- Those who don't have Internet access.
- Those who want to access content to wikipedia irrespective of Internet
connection.
- Those who use existing proprietary encyclopedias available in market.
Present status:
- Apart from source code hosted at google, whole setup is also available
at http://92.243.5.147/offline-wiki there are two parts, for complete
English wikipedia you have to get blocks.tgz and offline-wikipedia.tgz,
there are instructions in README file available there. There is also also a
small prototype sample.tar.bz2 available, in case one wants to check the
quality of work.
- I am following approach taken by
http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html<http://users.softlab.ece.ntua.gr/%7Ettsiod/buildWikipediaOffline.html>,
but with some difference, i am using python to convert wiki-text to html and
django for server.
- As of now, with XML dumps provided by media-wiki, last year's October
dump was 4.1G, i have csv files to locate articles inside those dumps of
size ~300M, and small django configuration to access and convert the
articles to html, and all this fits into a DVD.
Issues at hand:
- My python parser to create html out of wiki-text if not perfect, i can
replace it with something which is better and existing, but am yet to find
that.
- To access articles faster i am breaking single bz2 using bz2recover,
and it gives me 20k odd files for English content, i am trying to avoid
those many files and not compromising with the speed of browsing the
articles.
- We can replace django server with something more light and simple given
they don't have dependency cycles and making it hard to access/use/install.
- March 09 English content is 4.6G making things more tight
- It is only text content excluding multimedia, pictures(which are
improtant part and cant be neglected).
Target:
- Make it updatebale.
- To make it editable.
- To manage different categories of articles, and segregation based on
that to make refined and better education/learning tool.
There are other issues too, i know of other attempts like wiki-taxi,
wikipedia-dumpreader, and am trying to patch things to get better results.
But don't know why those present parser/attempts never made to that DVD
distribution list. I am working to improve it, in the meanwhile any
suggestion, feedback, contribution are most welcome.
--
Regards
Shantanu Choudhary
More information about the foundation-l
mailing list