There seems to be some interest in creating a static HTML distribution (dump) of Wikipedia, most notably it is requested on the Wikipedia:Database_download page and in Feature Requests #596830 on Sourceforge. This would allow people to download the Wikipedia for use offline, for example from a CDROM.
So, I have started work and made my initial version (English only) available online for anyone on this list to evaluate and test. I am looking for feedback, suggestions, bug reports and general comments.
http://www.rawlinson.ca:8080/wikipedia/index.html
Please do not attempt to mirror the site as my server and bandwidth won't be able to handle it. The site is only intended for developers to try and give feedback. Once everyone is happy with it I will make .tar and .iso packages available for distribution.
At the moment, the method I use to create the static HTML version is very lengthy in terms of processing time and requires a number of manual step. It takes my 1 GHz machine about 5 hours to generate all the pages. Ideally, I'll have something more automated and efficient as time goes on. My plan for how things would work is I will produce an updated static HTML version every few months or significant milestones.
I'm not sure how to distribute this static HTML version when it's ready for a public release. Currently it's about 500 Meg in size (that includes everything). As I mentioned above I have limited server resources. For distribution maybe it could be put on the Sourceforge download page, or on the Wikipedia.org server somewhere (/tarballs)?
Finally, since I am new to Wikipedia and this list, please excuse me while I learn how things work around here. I am open to criticism, suggestions and discussion. I am looking forward to working with everyone on Wikipedia and contributing where I can.
Some Technical Details (for those interested):
- English only (currently)
- uses "printable" pages, no top or side navigation bars
- added links to home, back, copyright and Wikipedia.org to bottom of all pages (TODO: if a talk page exists a link should be added)
- pages are stored in directories based on first two characters of MD5 hash, same as image storage scheme
- includes all namespaces (talk, users, users_talk, wikipedia_talk, etc.)
- created a list with links to all the items in each namespace to allow for basic searching of page titles
- redirects replaced with direct link to article
Regards, Steve Rawlinson
Cool! I like it! Hope to see the CDs at Wal-Mart soon ;-)
A few minor thoughts: * On every page, there should be a link to the online article, and to the online edit. * Images should link to the image text, like wikipedia does, not to the image itself. One can save the image directly from the page. * A nice CSS will make it look less "plain", without compromising compatibility with older browsers. The CD could also have a recent Mozilla installer on it ;-) * There could be other index.html files with frames in them, where a small frame on top, bottom, or side, could carry useful links (Main Page, Copyrights, the "live" wikipedia, "live" search engine, etc.) and maybe the logo. * If you want statistics (date of last edit) at all, maybe add some more ("this was edited 123 times"). [Might need modification of the "printable" function] * Should non-existent links be displayed as links? I think so, because people might feel tempted to write something ;-) [Might need modification of the "printable" function]
There should also be a "real" offline search function. Shouldn't be too hard to hack some smallish programs for Windows, Linux, maybe Mac. Anyone knows if there's something like that out there?
We might also think of adding special "view" functions to the software (like the "printable version"), to aid projects like this, and maybe Larry's wikipedia sifter.
Magnus
Steve Rawlinson wrote:
There seems to be some interest in creating a static HTML distribution (dump) of Wikipedia, most notably it is requested on the Wikipedia:Database_download page and in Feature Requests #596830 on Sourceforge. This would allow people to download the Wikipedia for use offline, for example from a CDROM.
So, I have started work and made my initial version (English only) available online for anyone on this list to evaluate and test. I am looking for feedback, suggestions, bug reports and general comments.
http://www.rawlinson.ca:8080/wikipedia/index.html
Please do not attempt to mirror the site as my server and bandwidth won't be able to handle it. The site is only intended for developers to try and give feedback. Once everyone is happy with it I will make .tar and .iso packages available for distribution.
At the moment, the method I use to create the static HTML version is very lengthy in terms of processing time and requires a number of manual step. It takes my 1 GHz machine about 5 hours to generate all the pages. Ideally, I'll have something more automated and efficient as time goes on. My plan for how things would work is I will produce an updated static HTML version every few months or significant milestones.
I'm not sure how to distribute this static HTML version when it's ready for a public release. Currently it's about 500 Meg in size (that includes everything). As I mentioned above I have limited server resources. For distribution maybe it could be put on the Sourceforge download page, or on the Wikipedia.org server somewhere (/tarballs)?
Finally, since I am new to Wikipedia and this list, please excuse me while I learn how things work around here. I am open to criticism, suggestions and discussion. I am looking forward to working with everyone on Wikipedia and contributing where I can.
Some Technical Details (for those interested):
English only (currently)
uses "printable" pages, no top or side navigation bars
added links to home, back, copyright and Wikipedia.org to bottom of all
pages (TODO: if a talk page exists a link should be added)
- pages are stored in directories based on first two characters of MD5
hash, same as image storage scheme
includes all namespaces (talk, users, users_talk, wikipedia_talk, etc.)
created a list with links to all the items in each namespace to allow
for basic searching of page titles
- redirects replaced with direct link to article
Regards, Steve Rawlinson
Wikitech-l mailing list Wikitech-l@wikipedia.org http://www.wikipedia.org/mailman/listinfo/wikitech-l
Magnus and Brion:
[I've merged both of your comments into one email. The message is a bit long, but it keeps it all in one place. Note the M> and B> refer to Magnus and Brion respectively.]
M> Cool! I like it! Hope to see the CDs at Wal-Mart soon ;-) B> A cool beginning, thanks! :)
Thanks, I'm glad to see the idea receive such a positive response :)
M> * On every page, there should be a link to the online article, and to M> the online edit. B> A link to that particular page on the live server would be a *very* good B> idea. The regular printable pages include this.
I will add links to the current online article, online edit, and offline talk. The offline talk page will link to the online talk and online talk edit. I think that provides access to the online version of everything.
B> User and talk pages are probably not necessary; if you're looking to B> discuss the page, you'll be doing it on the live site where you can edit B> it (and see the last 6 months' worth of edits which aren't on your B> CD-ROM). And, of course, they take up a large chunk of valuable CD real B> estate better devoted to future articles.
I have debated whether to include the User and Talk pages myself. As space on the CD gets tight they'll probably be the first things to go. However, I noticed that articles sometimes provide a link to the corresponding Talk page. If the talk page was somehow relevant to the article (which it probably shouldn't be) I wanted to make sure people had access to it offline.
As for the User pages, I thought it was a good way to give credit to the Wikipedia contributors. Also, many of the Wikipedia pages and Talk pages link to User pages where people have made comments.
M> * Images should link to the image text, like wikipedia does, not to the M> image itself. One can save the image directly from the page.
I'm not sure what happened there, I'll fix it so it behaves the same as Wikipedia.
M> * A nice CSS will make it look less "plain", without compromising M> compatibility with older browsers. B> Could probably stand to be purtied up at least a little bit.
I agree, I'll try using a different CSS to improve things. My intention is to keep a simple look and feel, but also to stay true to the Wikipedia style.
M> * The CD could also have a recent Mozilla installer on it ;-)
My assumption is that everyone has a HTML browser on their machine or can easily get one. Maybe I didn't get your joke?
M> * There could be other index.html files with frames in them, where a M> small frame on top, bottom, or side, could carry useful links (Main M> Page, Copyrights, the "live" wikipedia, "live" search engine, etc.) and M> maybe the logo.
Good idea. Thanks for the sample you emailed me personally. I'll create two interfaces, one frame based and the other with just the pages.
M> * If you want statistics (date of last edit) at all, maybe add some more M> ("this was edited 123 times").
The number of edits could be useful to a person reading the article to determine how "mature" the article is. I'll look into how easy it would be to implement.
M> * Should non-existent links be displayed as links? I think so, because M> people might feel tempted to write something ;-)
I agree that links to non-existent articles would be a good way to encourage people to contribute, but it's at the expense of linking online which might not be available to them. I have tried to keep the number of links to online sources low so that offline users aren't frustrated by numerous broken (in their mind) links. Maybe using the stub marker (like an ! or *) after the topic would be best (I've seen this preference setting somewhere). Using an "external" CSS tag with a different color, as is done with the online version, would also help people distinguish.
B> Some things to think about as far as the actual filenames:
Some good observations about potential filename problems. I suppose I could use something other than the title, like the unique cur_id integer, but it wouldn't be as meaningful to people. Using eight characters (26 letters and 10 numbers) there are about 2.8 trillion possible filenames, so each article could be mapped to a unique 8 character filename. I think this would avoid all the length and conflict issues you've raised. Or maybe 64 characters would be better since some meaning could be kept in the name. I'll think about these issues and see what I can come up with.
CD-ROM formats such as Rock Ridge Extension and Joliet support longer filenames (at least 64 characters), but the ISO 9660 standard only supports the 8+3 convention. I read that Joliet also supports Unicode characters for international support. Perhaps it would be enough to ensure the filenames conform to one of these CD-ROM standards and leave the rest up to the OS?
The other thing to note about potential filename conflicts with my current setup is that the MD5 hash directory structure provides some safety. Even if two filenames conflicted (through removing invalid characters) with the 256 subdirectories their is only a 0.4% chance they'll be put in the same directory and actually conflict. It's not perfect, but it's not very likely to happen.
B> Non-ascii chars appear to be left intact; will this work consistently B> across different filesystems which may be configured for different B> character encodings?
I don't know much about different character encodings. I'll have to do some research on this unless anyone here knows the answer?
M> There should also be a "real" offline search function. Shouldn't be too M> hard to hack some smallish programs for Windows, Linux, maybe Mac. B> A simple JavaScript-based title search could probably be rigged up out B> of that.
I agree, there needs to be a "real" way to search the articles. My focus at the moment is to get the HTML and layout correct. The JavaScript-based search is a nice idea to provide some basic functionality.
One of my goals is to try and keep everything OS and browser independent. This may not be possible for extra features like a search function, but I think it should be the goal. It's also one of my goals to try and ensure it works on the most basic setup that libraries, schools and third-world countries might have.
M> We might also think of adding special "view" functions to the software M> (like the "printable version"), to aid projects like this, and maybe M> Larry's wikipedia sifter.
By software I assume you mean the Wikipedia code base. Doesn't the current Skin object provide something like this already? If you want to view things differently you just need to implement a new skin.
- English only (currently)
B> That will need to be fixed, of course! :)
It's on my todo list :) Once the English version is working, I'll try some of the other language Wikipedias. I'm looking forward to learning more about different character sets and internationalization.
B> > - redirects replaced with direct link to article B> B> Nice.
I was really fun to implement because I got to use recursion to follow the redirects to the final page :) Hopefully, Wikipedia never has any infinite redirection loops.
Thank you both Brion and Magnus for your comments. Anyone else with some thoughts or ideas? I'll post my next revision when it's ready.
Best Regards, Steve Rawlinson
Steve Rawlinson heeft geschreven: [cut]
Thank you both Brion and Magnus for your comments. Anyone else with some thoughts or ideas? I'll post my next revision when it's ready.
Why should someone buy or use a encyclopedia on a CD-ROM and not use the “normal way” online?
I suppose because the do not have internet or only a very slow and/or expencive connection.
Mayby those CD-ROM readers would also like to write or inprove articles. It would be nice if there is a way for off-line Wikipedians to work at wikipedia.
Idea; someone reads a article on the Wikipedia cd-rom and wants to change something. He clicks on “edit this page”. He gets a form that sends a email to Wikipedia by use of his email program.
Something like: to: offlinewiki-request@wikipedia.org subject: request [[Babel fish]]
He gets back a email whit instructions how to import the attachment in the off line wikipedia. After the update of the article he can modify it like on the on line wikipedia. When he is ready whit the article he select “Send update to on line Wikipedia”. The update is send back to Wikipedia by use of his email program. It also sends the revision number of that article. If the on line article is not changed it gets directly on line. If it has been changed it goes to a que were a on line Wikipedian must check it.
Or something like that, you get the general idea.
On Sun, 17 Nov 2002, Giskart wrote:
Why should someone buy or use a encyclopedia on a CD-ROM and not use the "normal way" online?
As you suggest, I expect the primary use for the static HTML will be for a CDROM that people without (or expensive) Internet access could use. I think that schools and libraries might also greatly benefit from it.
There are some other good reasons for doing this. Plain HTML format means that anyone can easily put Wikipedia on their server or LAN for read-only use. They won't need to install PHP, MySQL, configure the software, import the database, etc.
Another possibility: if the updates to the static HTML were frequent enough, other's could mirror the site which would reduce the load on Wikipedia. Editing and new articles would still need to be done on the main Wikipedia site, but read-only access could be distributed.
Finally, having lots of HTML copies of Wikipedia around also servers as a backup. If for some unfortunate reason the Wikipedia server was ever destroyed or taken offline, the articles (in HTML format) wouldn't be lost.
It's also just neat to have your own copy of Wikipedia on CD :)
Mayby those CD-ROM readers would also like to write or inprove articles. It would be nice if there is a way for off-line Wikipedians to work at wikipedia.
I agree, the more people working on Wikipedia the better, but I think having some type of network connection is absolutely necessary for contributing to Wikipedia.
Idea; someone reads a article on the Wikipedia cd-rom and wants to change something. He clicks on �edit this page�. He gets a form that sends a email to Wikipedia by use of his email program.
An interesting idea, but I think if they have email then they are also very likely to have a web access and could submit changes in the "normal" way online. I know of people who currently work this way because their Internet connection is expensive. They only connect long enough to get the most recent changes and upload any updates.
Best Regards, Steve Rawlinson
--- Steve Rawlinson steve@rawlinson.ca wrote:
As you suggest, I expect the primary use for the static HTML will be for a CDROM that people without (or expensive) Internet access could use.
It doesn't even have to be expensive: as soon as your internet access is metered, and the clock is constantly ticking, your surfing habits change drastically. You simply don't read a three page article about Catherine the Great "just for fun" online in such a setting. And metered internet access and metered telephone charges are the norm throughout the world.
So I think the CDROM Wikipedia meets a great need. In fact, I think many users in this situation would even want to download Wikipedia so that they can read it afterwards in peace and quiet. For that, a minimalistic version that cuts out User: and Talk: and is optimally compressed would be desirable. Maybe even put the images in a separate package. How small can you make it?
Axel
__________________________________________________ Do you Yahoo!? Yahoo! Web Hosting - Let the expert host your site http://webhosting.yahoo.com
Steve Rawlinson heeft geschreven: [cut]
Idea; someone reads a article on the Wikipedia cd-rom and wants to change something. He clicks on ?edit this page?. He gets a form that sends a email to Wikipedia by use of his email program.
An interesting idea, but I think if they have email then they are also very likely to have a web access and could submit changes in the "normal" way online. I know of people who currently work this way because their Internet connection is expensive. They only connect long enough to get the most recent changes and upload any updates.
I think there are people who have email but no www. And to connect only for downloading and submit en disconnect again. Connecting is the most expencieve thing to do. It depens of your telecom operator of cource, but i now from the past when I used a modem and not ADSL that making the conncetion is very expencive.
Thanks for doing this!
Two little comments:
* The Back link on every page seems redundant: every browser has a back button, but not every browser has javascript. * The link to www.wikipedia.org on every page should be turned into a link to [[Wikipedia]] where the offline user can find some information about the project and the online user can find the URL.
Axel
__________________________________________________ Do you Yahoo!? Yahoo! Web Hosting - Let the expert host your site http://webhosting.yahoo.com
Axel, thanks for the comments.
- The Back link on every page seems redundant: every browser has a back
button, but not every browser has javascript.
Now that I think about it some more, you're right. The back button is redundant and the Javascript has the potential to cause problems (or not work if Javascript has been disabled). I'll remove it.
- The link to www.wikipedia.org on every page should be turned into a
link to [[Wikipedia]] where the offline user can find some information about the project and the online user can find the URL.
An excellent idea. My original reason for the link was to give people a way to find the online Wikipedia. A link to [[Wikipedia]] will accomplish this and provide more information about the project as you suggest. Also, Magnus and Brion suggested adding a links to the current online version of the article and a link to the online edit page, which should also help people find the online version.
Thanks, Steve
On Sat, 2002-11-16 at 18:36, Steve Rawlinson wrote:
There seems to be some interest in creating a static HTML distribution (dump) of Wikipedia, most notably it is requested on the Wikipedia:Database_download page and in Feature Requests #596830 on Sourceforge. This would allow people to download the Wikipedia for use offline, for example from a CDROM.
So, I have started work and made my initial version (English only) available online for anyone on this list to evaluate and test. I am looking for feedback, suggestions, bug reports and general comments.
A cool beginning, thanks! :)
I'm not sure how to distribute this static HTML version when it's ready for a public release. Currently it's about 500 Meg in size (that includes everything). As I mentioned above I have limited server resources. For distribution maybe it could be put on the Sourceforge download page, or on the Wikipedia.org server somewhere (/tarballs)?
I expect we could provide both a tarball and a static tree which could be rsync'ed.
Finally, since I am new to Wikipedia and this list, please excuse me while I learn how things work around here. I am open to criticism, suggestions and discussion. I am looking forward to working with everyone on Wikipedia and contributing where I can.
Some Technical Details (for those interested):
- English only (currently)
That will need to be fixed, of course! :)
- uses "printable" pages, no top or side navigation bars
Could probably stand to be purtied up at least a little bit.
- added links to home, back, copyright and Wikipedia.org to bottom of all pages (TODO: if a talk page exists a link should be added)
A link to that particular page on the live server would be a *very* good idea. The regular printable pages include this.
- pages are stored in directories based on first two characters of MD5 hash, same as image storage scheme
Some things to think about as far as the actual filenames: * Length. Wikipedia titles can I think get up to ~255 characters; this may be too long for some systems. * Acceptable characters. Colons, slashes, quotes, and various non-ascii characters may appear in titles that cannot be reliably reproduced on many filesystems. I notice that colons and commas at least are changed to underscores, possibly some other characters too; conflicts may occur. Non-ascii chars appear to be left intact; will this work consistently across different filesystems which may be configured for different character encodings? * Case sensitivity. Many filesystems are not case sensitive; we may have conflicts.
- includes all namespaces (talk, users, users_talk, wikipedia_talk, etc.)
User and talk pages are probably not necessary; if you're looking to discuss the page, you'll be doing it on the live site where you can edit it (and see the last 6 months' worth of edits which aren't on your CD-ROM). And, of course, they take up a large chunk of valuable CD real estate better devoted to future articles.
Thoughts?
- created a list with links to all the items in each namespace to allow for basic searching of page titles
A simple JavaScript-based title search could probably be rigged up out of that.
- redirects replaced with direct link to article
Nice.
-- brion vibber (brion @ pobox.com)
On 16-11-2002, Steve Rawlinson wrote thusly :
There seems to be some interest in creating a static HTML distribution (dump) of Wikipedia, most notably it is requested on the Wikipedia:Database_download page and in Feature Requests #596830 on Sourceforge. This would allow people to download the Wikipedia for use offline, for example from a CDROM.
So, I have started work and made my initial version (English only) available online for anyone on this list to evaluate and test. I am looking for feedback, suggestions, bug reports and general comments.
http://www.rawlinson.ca:8080/wikipedia/index.html
Please do not attempt to mirror the site as my server and bandwidth won't be able to handle it. The site is only intended for developers to try and give feedback. Once everyone is happy with it I will make .tar and .iso packages available for distribution.
Thanks for your work. I am looking forward to the "final" release.
I think this information should be passed on to the major Linux distros manufacturers. It might be a plus for them to bundle a free encyclopedia in their distribution packs. This way Wikipedia reaches very wide audience.
Regards, Kpjas.
wikitech-l@lists.wikimedia.org