I've done some HTML dumps of Wikipedia:
This is something I've been working on for the last 6 months or so. It was requested by WiderNet, for their eGranary Digital Library project:
http://www.widernet.org/digitalLibrary/index.htm
The project distributes information to universities in developing countries, by filling large hard drives with free or donated content and delivering them. The dumps have been produced to their specifications, but I thought they might be useful to other people as well.
This is a "beta" release, I would recommend that most people wait for the second release, when a number of known bugs will be fixed.
-- Tim Starling
--- Tim Starling t.starling@physics.unimelb.edu.au wrote:
I've done some HTML dumps of Wikipedia:
Neat! Although things like stub and clean-up messages don't make any sense in a static version. Would it be possible to somehow meta tag those templates so they don't show up in a static dump?
stub example: http://static.wikipedia.org/en/m/a/n/MANPADS_5a96.html
-- mav
__________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com
Earlier, elian wrote:
If we're going to offer it for download, we should keep in mind that people won't put it only at their computers at home, but also back on the web. In this case, it absolutely should conform to the license, with the usual wikipedia back links, history links and a copy of the license included. And we should also remove the wikipedia logo not to invite any imitations and trademark violations.
I just wanted to be clear that I agree with this completely. At the bare minimum of a first change, the page http://static.wikipedia.org/
Should make it very clear that this dump is not for commercial use with our logo still on it.
Jimmy Wales wrote:
Earlier, elian wrote:
If we're going to offer it for download, we should keep in mind that people won't put it only at their computers at home, but also back on the web. In this case, it absolutely should conform to the license, with the usual wikipedia back links, history links and a copy of the license included. And we should also remove the wikipedia logo not to invite any imitations and trademark violations.
I just wanted to be clear that I agree with this completely. At the bare minimum of a first change, the page http://static.wikipedia.org/
Should make it very clear that this dump is not for commercial use with our logo still on it.
I thought it did. I wrote:
"There are a number of known bugs, also it is likely that these dumps do not fully satisfy the license requirements of the GFDL. Also, note that putting one of these dumps on the web unmodified will constitute a trademark violation. They are intended for private viewing in an intranet or desktop installation."
-- Tim Starling
Tim Starling wrote:
I just wanted to be clear that I agree with this completely. At the bare minimum of a first change, the page http://static.wikipedia.org/
Should make it very clear that this dump is not for commercial use with our logo still on it.
I thought it did. I wrote:
"There are a number of known bugs, also it is likely that these dumps do not fully satisfy the license requirements of the GFDL. Also, note that putting one of these dumps on the web unmodified will constitute a trademark violation. They are intended for private viewing in an intranet or desktop installation."
Ok, my mistake, I'm very very sorry. I didn't see that. I apologize for any confusion.
--Jimbo
I like it, but I think that you should make some sort of convention for redlinks. Seeing as there's no page there, it should link to a generic 'nothing here' page or something of the sort.
On 10/15/05, Tim Starling t.starling@physics.unimelb.edu.au wrote:
I've done some HTML dumps of Wikipedia:
This is something I've been working on for the last 6 months or so. It was requested by WiderNet, for their eGranary Digital Library project:
http://www.widernet.org/digitalLibrary/index.htm
The project distributes information to universities in developing countries, by filling large hard drives with free or donated content and delivering them. The dumps have been produced to their specifications, but I thought they might be useful to other people as well.
This is a "beta" release, I would recommend that most people wait for the second release, when a number of known bugs will be fixed.
-- Tim Starling
Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
-- ~Ilya N.
Ilya N. wrote:
I like it, but I think that you should make some sort of convention for redlinks. Seeing as there's no page there, it should link to a generic 'nothing here' page or something of the sort.
On a static dump such links aren't really very useful: they don't go anywhere, and you can't do anything once you've got there. So they're just silently dropped.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Ilya N. wrote:
I like it, but I think that you should make some sort of convention for redlinks. Seeing as there's no page there, it should link to a generic 'nothing here' page or something of the sort.
On a static dump such links aren't really very useful: they don't go anywhere, and you can't do anything once you've got there. So they're just silently dropped.
They aren't useful as links, but sometimes people do imbue wikilinks with some semantic meaning, most often an implied "note that this word is being used in a technical sense rather than colloquially". So the article might read a bit strangely if the link is dropped. Of course, whether this happens often enough to outweigh the strangeness of having useless links in an article is another matter...
-Mark
I like it, but I think that you should make some sort of convention for redlinks. Seeing as there's no page there, it should link to a generic 'nothing here' page or something of the sort.
On 10/15/05, Tim Starling t.starling@physics.unimelb.edu.au wrote:
I've done some HTML dumps of Wikipedia:
This is something I've been working on for the last 6 months or so. It was requested by WiderNet, for their eGranary Digital Library project:
http://www.widernet.org/digitalLibrary/index.htm
The project distributes information to universities in developing countries, by filling large hard drives with free or donated content and delivering them. The dumps have been produced to their specifications, but I thought they might be useful to other people as well.
This is a "beta" release, I would recommend that most people wait for the second release, when a number of known bugs will be fixed.
-- Tim Starling
Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
-- ~Ilya N.
"Tim Starling" t.starling@physics.unimelb.edu.au wrote in message news:dir0g4$47q$1@sea.gmane.org...
I've done some HTML dumps of Wikipedia:
Looking good so far. Congratulations.
This is a "beta" release, I would recommend that most people wait for the second release, when a number of known bugs will be fixed.
Would this include the "syntax error" which is all that IE will grudgingly divulge as an explanation why the little yellow triangle appears?
Is there a Wiki page for this somewhere? I'd like to keep track of progress...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
Phil Boswell wrote:
"Tim Starling" t.starling@physics.unimelb.edu.au wrote in message news:dir0g4$47q$1@sea.gmane.org...
I've done some HTML dumps of Wikipedia:
Looking good so far. Congratulations.
This is a "beta" release, I would recommend that most people wait for the second release, when a number of known bugs will be fixed.
Would this include the "syntax error" which is all that IE will grudgingly divulge as an explanation why the little yellow triangle appears?
Is there a Wiki page for this somewhere? I'd like to keep track of progress...
I've just made one at http://meta.wikimedia.org/wiki/Static_dumps - go forth and make it sensible!
- -- Alphax | /"\ Encrypted Email Preferred | \ / ASCII Ribbon Campaign OpenPGP key ID: 0xF874C613 | X Against HTML email & vCards http://tinyurl.com/cc9up | / \
Tim Starling schreef:
I've done some HTML dumps of Wikipedia:
It is very nice. But I do not see any reason to include the talk page.
Walter Vermeir wrote:
Tim Starling schreef:
I've done some HTML dumps of Wikipedia:
It is very nice. But I do not see any reason to include the talk page.
Talk pages include information vital to the diligent reader, such as factual disputes and source information. Once we are sure Wikipedia is perfect, we can remove the talk pages, and the reader can base their trust in the accuracy of our material on the force of our authority alone.
-- Tim Starling
Tim Starling wrote:
Talk pages include information vital to the diligent reader, such as factual disputes and source information. Once we are sure Wikipedia is perfect, we can remove the talk pages, and the reader can base their trust in the accuracy of our material on the force of our authority alone.
I agree with you completely. Many users will want to see the talk pages.
However, as a potential user of this, I find the 27GB size a bit daunting. I have, at this moment, 8GB free space on my laptop.
I wonder how much work it would be for you to generate a few different versions...
Complete (current version) Complete, no images Minimum (no talk pages, no images)
A really cool thing to have would be some kind of cleverly constructed "middle" version -- some images, but not all the images -- but I can't at the moment think of any clever way to decide algorithmically which images to include. Maybe exclude images over a certain size, or include a maximum of one image per article or similar?
--Jimbo
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
Jimmy Wales wrote:
Tim Starling wrote:
Talk pages include information vital to the diligent reader, such as factual disputes and source information. Once we are sure Wikipedia is perfect, we can remove the talk pages, and the reader can base their trust in the accuracy of our material on the force of our authority alone.
I agree with you completely. Many users will want to see the talk pages.
However, as a potential user of this, I find the 27GB size a bit daunting. I have, at this moment, 8GB free space on my laptop.
I wonder how much work it would be for you to generate a few different versions...
Complete (current version) Complete, no images Minimum (no talk pages, no images)
A really cool thing to have would be some kind of cleverly constructed "middle" version -- some images, but not all the images -- but I can't at the moment think of any clever way to decide algorithmically which images to include. Maybe exclude images over a certain size, or include a maximum of one image per article or similar?
Theoretically, all images in articles are thumbnails. Just include the thumbnail versions of the images.
- -- Alphax | /"\ Encrypted Email Preferred | \ / ASCII Ribbon Campaign OpenPGP key ID: 0xF874C613 | X Against HTML email & vCards http://tinyurl.com/cc9up | / \
On 10/20/05, Alphax alphasigmax@gmail.com wrote:
Theoretically, all images in articles are thumbnails. Just include the thumbnail versions of the images.
I suppose its the same issue for the wikitext itself, but keep in mind the need to offer a copy in the preferred form for editing.
I'm not a huge fan of html dumps in general. Wikitext tends to be much smaller (even compressed), and if you're doing to do fancy compression stuff you'll need to include a reader which makes me wonder why not a full wikitext viewer.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
Gregory Maxwell wrote:
On 10/20/05, Alphax alphasigmax@gmail.com wrote:
Theoretically, all images in articles are thumbnails. Just include the thumbnail versions of the images.
I suppose its the same issue for the wikitext itself, but keep in mind the need to offer a copy in the preferred form for editing.
I'm not a huge fan of html dumps in general. Wikitext tends to be much smaller (even compressed), and if you're doing to do fancy compression stuff you'll need to include a reader which makes me wonder why not a full wikitext viewer.
Pilaf and Lupin have done some work on implementing a wikitext viewer in Javascript; see http://en.wikipedia.org/wiki/User:Pilaf/Live_Preview and http://en.wikipedia.org/wiki/Wikipedia:Tools/Navigation_popups for more information.
- -- Alphax | /"\ Encrypted Email Preferred | \ / ASCII Ribbon Campaign OpenPGP key ID: 0xF874C613 | X Against HTML email & vCards http://tinyurl.com/cc9up | / \
Tim Starling wrote:
Walter Vermeir wrote:
Tim Starling schreef:
I've done some HTML dumps of Wikipedia:
It is very nice. But I do not see any reason to include the talk page.
Talk pages include information vital to the diligent reader, such as factual disputes and source information. Once we are sure Wikipedia is perfect, we can remove the talk pages, and the reader can base their trust in the accuracy of our material on the force of our authority alone.
That sounds like a predictions that talk pages will stay forever. :-)
Ec
wikipedia-l@lists.wikimedia.org