On Sat, 2002-11-16 at 18:36, Steve Rawlinson wrote:
There seems to be some interest in creating a static
HTML distribution
(dump) of Wikipedia, most notably it is requested on the
Wikipedia:Database_download page and in Feature Requests #596830 on
Sourceforge. This would allow people to download the Wikipedia for use
offline, for example from a CDROM.
So, I have started work and made my initial version (English only)
available online for anyone on this list to evaluate and test. I am
looking for feedback, suggestions, bug reports and general comments.
http://www.rawlinson.ca:8080/wikipedia/index.html
A cool beginning, thanks! :)
I'm not sure how to distribute this static HTML
version when it's ready
for a public release. Currently it's about 500 Meg in size (that includes
everything). As I mentioned above I have limited server resources. For
distribution maybe it could be put on the Sourceforge download page, or on
the
Wikipedia.org server somewhere (/tarballs)?
I expect we could provide both a tarball and a static tree which could
be rsync'ed.
Finally, since I am new to Wikipedia and this list,
please excuse me while
I learn how things work around here. I am open to criticism, suggestions
and discussion. I am looking forward to working with everyone on
Wikipedia and contributing where I can.
Some Technical Details (for those interested):
- English only (currently)
That will need to be fixed, of course! :)
- uses "printable" pages, no top or side
navigation bars
Could probably stand to be purtied up at least a little bit.
- added links to home, back, copyright and
Wikipedia.org to bottom of all
pages (TODO: if a talk page exists a link should be added)
A link to that particular page on the live server would be a *very* good
idea. The regular printable pages include this.
- pages are stored in directories based on first two
characters of MD5
hash, same as image storage scheme
Some things to think about as far as the actual filenames:
* Length. Wikipedia titles can I think get up to ~255 characters; this
may be too long for some systems.
* Acceptable characters. Colons, slashes, quotes, and various non-ascii
characters may appear in titles that cannot be reliably reproduced on
many filesystems. I notice that colons and commas at least are changed
to underscores, possibly some other characters too; conflicts may occur.
Non-ascii chars appear to be left intact; will this work consistently
across different filesystems which may be configured for different
character encodings?
* Case sensitivity. Many filesystems are not case sensitive; we may have
conflicts.
- includes all namespaces (talk, users, users_talk,
wikipedia_talk, etc.)
User and talk pages are probably not necessary; if you're looking to
discuss the page, you'll be doing it on the live site where you can edit
it (and see the last 6 months' worth of edits which aren't on your
CD-ROM). And, of course, they take up a large chunk of valuable CD real
estate better devoted to future articles.
Thoughts?
- created a list with links to all the items in each
namespace to allow
for basic searching of page titles
A simple JavaScript-based title search could probably be rigged up out
of that.
- redirects replaced with direct link to article
Nice.
-- brion vibber (brion @
pobox.com)