[Mediawiki-l] Export mediawiki articles to static HTML using WebZIP spyder (long)

Michael Kelley michael.kelley at argonst.com
Thu Oct 27 19:36:02 UTC 2005


I've embarked on an experiment to try to spyder my wiki and create
HTML pages for offline use.  Using WebZIP (http://www.spidersoft.com/)
on Windows, I've been able to get an almost entirely offline wiki
representation, complete with full css and image support.  I use ugly
URLs in my wiki (MW 1.4.X with page restrict patch).  In WebZIP,
define a URL exclusion list to prevent spydering down certain links
(spydering down delete, move, or rollback links is especially bad).  I
wanted to try this approach because a user could take a snapshot
whenever they want.  To get WebZIP started, I have to bring up my wiki
main page in WebZIP's internal browser window and log in.  I've
created an "archive" user that I use for this experiment and the
archive user is a member of the (restrict) group.  My WebZIP project
properties includes the full URL of my main page.  I start it up and
let it crank.  I'm still refining my WebZIP setup, but it basically
works.  I've got a backup WAMP server running MW that I can load
backup databases on so I don't unintentionally make changes to my
production site.

My WebZIP URL exclusion list for MediaWiki 1.4.X is currently:
-------------------------
action=edit
action=history
action=protect
action=restrict
action=unrestrict
action=delete
action=move
action=watch
title=Help:
Movepage
Recentchanges
Userlogin
Whatlinkshere
Userlogout
diff=
action=unprotect
Undelete
action=markpatrolled
action=rollback
redirect=no
Maintenance&subfunction
Contributions
Log&
Popularpages&
Wantedpages&
Uncategorizedpages&
Longpages&
Shortpages&
action=revert
Special_Newpages
Special_Newimages
Special_Lonelypages
Special_Listusers
Special_Listadmins
Special_Listadmins
Special_DoubleRedirects
Special_DoubleRedirects
Special_Deadendpages
Special_Categories
Special_BrokenRedirects
-------------------------

I'm also using the same approach to create an offline version of my
Bugzilla database.  In my archive user, I create stored queries that I
might want to use offline (show all, show open, show (dev) closed,
show project open, show project (dev) closed, etc).  I've still got
some work to do to get my offline Bugzilla finished, but I'm will on
the way.

My WebZIP URL exclusion list for Bugzilla is starting to look like:
-------------------------
action=del
action=add
action=edit
action=enter
showdependencytree
showdependencygraph
showactivity.cgi
dobugcounts=1
action=copy
action=confirmdelete
action=changeform
relogin.cgi
editparams.cgi
userprefs.cgi
remaction=forget
colchange.cgi
format=advanced
enter
query.cgi
editusers.cgi
-------------------------

This spydering approach does leave the exclusion URLs pointed to your
"production" URLs.  In my desired case, I'm making an offline copy for
onsite technical access and I would either not be connected to the
Internet or the URL addresses wouldn't resolve because they are for
inaccessible intranet servers.

I thought I'd share this approach in case someone else would like to
try a client-side approach to generating an offline wiki.

Regards,
-Mike Kelley





More information about the MediaWiki-l mailing list