Static html dump

List overview All Threads
Download

newer

older

Experimental gzip compression

script shows up with IP address

Alfio Puglisi

20 May 2003 20 May '03

12:53 p.m.

Hello,

I just subscribed (I'm the wikipedia user At18) to ask about the automatic html dump function. I see from the database page that it's "in development".

If anyone is interested, I have a rudimental Perl script that is capable of reading the downloadable SQL dump and output all the articles as separate files in a number of alphabetical directories. It's not very fast, but it works.

What's missing from the script: wikimarkup -> HTML conversion, some intelligence to autodetect redirects, dealing with images, and so on. I don't know if someone is in charge of this fuction. If so, I can post the script. Otherwise, I can further develop it myself, given some directions.

Alfio

Show replies by date

Brion Vibber

20 May 20 May

2:04 p.m.

Je Mardo 20 Majo 2003 03:53, Alfio Puglisi skribis:

...

I just subscribed (I'm the wikipedia user At18) to ask about the automatic html dump function. I see from the database page that it's "in development".

Welcome!

...

If anyone is interested, I have a rudimental Perl script that is capable of reading the downloadable SQL dump and output all the articles as separate files in a number of alphabetical directories. It's not very fast, but it works.

What's missing from the script: wikimarkup -> HTML conversion, some intelligence to autodetect redirects, dealing with images, and so on. I don't know if someone is in charge of this fuction. If so, I can post the script. Otherwise, I can further develop it myself, given some directions.

Cool! I don't think anyone's really actively working on this at the moment, so if you'd like to, that would be great.

A few things to consider:

Last year someone started on a static HTML dump system with a hacked-up version of the wiki code and some post-processing, but never quite finished it up. I don't think he posted the code, but if you can get ahold of him he may still have it available: http://mail.wikipedia.org/pipermail/wikitech-l/2002-November/001292.html

There's also a partial, very experimental offline reader program which sucks the data out of the dump files. This includes a simplified wiki parser which, I believe, outputs HTML to use in the wxWindows HTML viewer widget: http://meta.wikipedia.org/wiki/WINOR This may be useful to you.

The latest revisions of the wikipedia code can cache the HTML output pages, but it's not clear whether this would be easy to adapt for purposes of generating static output.

A couple of the big questions that have come up before are:

* filenames -- making sure they can stay within reasonable limits on common filesystems, keeping in mind that non-ascii characters and case-sensitivity may be handled differently on different OSs, and there may be stronger limits on filename lengths.

* search -- an offline search would be very useful for an offline reader. JavaScript, Java, local programs are various possibilties.

* size! with markup, header and footer text tacked onto every page, a static html dump can be very large. The English wiki could at this point approach or exceed the size of a CD-ROM without compression. Is there a way to get the data compressed and still let it be accessible to common web browsers accessing the filesystem directly? Less important for a mirror site than a CD, perhaps.

* interlanguage links - it would be nice to be able to include all languages in a single browsable tree, with appropriate cross-links.

-- brion vibber (brion @ pobox.com)

Alfio Puglisi

2:33 p.m.

On Tue, 20 May 2003, Brion Vibber wrote:

...

Welcome!

Thanks!

...

Cool! I don't think anyone's really actively working on this at the moment, so if you'd like to, that would be great.

A few things to consider:

[...]

Thanks, I'll take a look at the links

...

A couple of the big questions that have come up before are:

filenames -- making sure they can stay within reasonable limits on

common filesystems, keeping in mind that non-ascii characters and case-sensitivity may be handled differently on different OSs, and there may be stronger limits on filename lengths.

I'll have to find some minimum common denominator. I have already run into the upper/lower case problem, that works in URL and on Unix machines, but not in Windows. I expect the problem of truncated filenames to be similar.

...

search -- an offline search would be very useful for an offline

reader. JavaScript, Java, local programs are various possibilties.

This would be hard to do without some sort of index file, at least for article titles. We don't want the search app to scan an entire CD-ROM! :-) I suspect that fulltext search would be impossible (or deadly slow) from CD, but quite possible from an "installed" version. Article titles search may be workable from CD.

...

size! with markup, header and footer text tacked onto every page, a

static html dump can be very large. The English wiki could at this point approach or exceed the size of a CD-ROM without compression. Is there a way to get the data compressed and still let it be accessible to common web browsers accessing the filesystem directly? Less important for a mirror site than a CD, perhaps.

header/footer overhead may be avoided using frames, but it's a less portable solution. I will investigate the compression options.

...

interlanguage links - it would be nice to be able to include all

languages in a single browsable tree, with appropriate cross-links.

I think i'll leave this for a future improvement plan.... :-))

Ciao, Alfio

7903

Age (days ago)

7903

Last active (days ago)

wikitech-l@lists.wikimedia.org

2 comments

2 participants

tags (0)

participants (2)

Alfio Puglisi
Brion Vibber