Hi all,
I have made a list of all the 1.9M articles in NS0 (including redirects / short pages) using the Toolserver; now I have the list I'm going to download every single of 'em (after the trial period tonight, I want to see how this works out. I'd like to begin with downloading the whole thing in 3 or 4 days, if noone objects) and then publish a static dump of it. Data collection will be on the Toolserver (/mnt/user-store/dewiki-static/articles/); the request rate will be 1 article per second and I'll download the new files once or twice a day to my home PC, so there should be no problem with the TS or Wikimedia server load. When this is finished in ~ 21-22 days, I'm going to compress them and upload them to my private server (well, if Wikimedia has an archive server, that 'd be better) as a tgz file so others can play with it. Furthermore, though I have no idea if I'll succeed, I plan on hacking a static Vector skin file which will load the articles using jQuery's excellent .load() feature, so that everyone with JS can enjoy a truly offline Wikipedia.
Marco
PS: When trying to invoke /w/index.php?action=render with an invalid oldid, the server returns HTTP/1.1 200 OK and an error message, but shouldn't this be a 404 or 500?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
On 9/23/2010 6:57 PM, Marco Schuster wrote:
Hi all,
I have made a list of all the 1.9M articles in NS0 (including redirects / short pages) using the Toolserver; now I have the list I'm going to download every single of 'em (after the trial period tonight, I want to see how this works out. I'd like to begin with downloading the whole thing in 3 or 4 days, if noone objects) and then publish a static dump of it.
Why not work with wikimedia to fix/resume their static html dumps?
On Fri, Sep 24, 2010 at 3:21 AM, Q overlordq@gmail.com wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
On 9/23/2010 6:57 PM, Marco Schuster wrote:
Hi all,
I have made a list of all the 1.9M articles in NS0 (including redirects / short pages) using the Toolserver; now I have the list I'm going to download every single of 'em (after the trial period tonight, I want to see how this works out. I'd like to begin with downloading the whole thing in 3 or 4 days, if noone objects) and then publish a static dump of it.
Why not work with wikimedia to fix/resume their static html dumps?
Given the fact that static dumps have been broken for *years* now, static dumps are on the bottom of WMFs priority list; I thought it would be the best if I just went ahead and built something that can be used (and, of course, improved).
Marco
Given the fact that static dumps have been broken for *years* now, static dumps are on the bottom of WMFs priority list; I thought it would be the best if I just went ahead and built something that can be used (and, of course, improved).
Marco
That's what I just said. Work with them to fix it, IE: volunteer. IE: you fix it.
Στις 23-09-2010, ημέρα Πεμ, και ώρα 21:27 -0500, ο/η Q έγραψε:
Given the fact that static dumps have been broken for *years* now, static dumps are on the bottom of WMFs priority list; I thought it would be the best if I just went ahead and built something that can be used (and, of course, improved).
Marco
That's what I just said. Work with them to fix it, IE: volunteer. IE: you fix it.
Actually it's not so much that they are on the bottom of the list as that there are two people potentially looking at them, and they are Tomasz (who is also doing mobile) and me (and I am doing the XML dumps rather than the HTML ones, until they are reliable and happy).
However if you are interested in working on these, I am *very* happy to help with suggestions, testing, feedback, etc., even while I am still woroking on the XML dumps. Do yuu have time and interest?
Ariel
" I want to see how this works out." - Marco
Rather than trying to rope him into something just yet, let's give him the chance to see if what he wants to do works first and then ask.
Cheers,
-Promethean
-----Original Message----- From: toolserver-l-bounces@lists.wikimedia.org [mailto:toolserver-l-bounces@lists.wikimedia.org] On Behalf Of Ariel T. Glenn Sent: Friday, 24 September 2010 2:05 PM To: toolserver-l@lists.wikimedia.org Subject: Re: [Toolserver-l] Static dump of German Wikipedia
Στις 23-09-2010, ημέρα Πεμ, και ώρα 21:27 -0500, ο/η Q έγραψε:
Given the fact that static dumps have been broken for *years* now, static dumps are on the bottom of WMFs priority list; I thought it would be the best if I just went ahead and built something that can be used (and, of course, improved).
Marco
That's what I just said. Work with them to fix it, IE: volunteer. IE: you fix it.
Actually it's not so much that they are on the bottom of the list as that there are two people potentially looking at them, and they are Tomasz (who is also doing mobile) and me (and I am doing the XML dumps rather than the HTML ones, until they are reliable and happy).
However if you are interested in working on these, I am *very* happy to help with suggestions, testing, feedback, etc., even while I am still woroking on the XML dumps. Do yuu have time and interest?
Ariel
_______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Στις 24-09-2010, ημέρα Παρ, και ώρα 14:45 +0930, ο/η Brett Hillebrand έγραψε:
" I want to see how this works out." - Marco
Rather than trying to rope him into something just yet, let's give him the chance to see if what he wants to do works first and then ask.
Well... our priorities are roping people in sooner rather than later :-D
Ariel
On Fri, Sep 24, 2010 at 6:35 AM, Ariel T. Glenn ariel@wikimedia.org wrote:
Στις 23-09-2010, ημέρα Πεμ, και ώρα 21:27 -0500, ο/η Q έγραψε:
Given the fact that static dumps have been broken for *years* now, static dumps are on the bottom of WMFs priority list; I thought it would be the best if I just went ahead and built something that can be used (and, of course, improved).
Marco
That's what I just said. Work with them to fix it, IE: volunteer. IE: you fix it.
Actually it's not so much that they are on the bottom of the list as that there are two people potentially looking at them, and they are Tomasz (who is also doing mobile) and me (and I am doing the XML dumps rather than the HTML ones, until they are reliable and happy).
However if you are interested in working on these, I am *very* happy to help with suggestions, testing, feedback, etc., even while I am still woroking on the XML dumps. Do yuu have time and interest?
Yep, have both - can we talk on IRC?
Marco
Ariel T. Glenn wrote:
Στις 23-09-2010, ημέρα Πεμ, και ώρα 21:27 -0500, ο/η Q έγραψε:
Given the fact that static dumps have been broken for *years* now, static dumps are on the bottom of WMFs priority list; I thought it would be the best if I just went ahead and built something that can be used (and, of course, improved).
Marco
That's what I just said. Work with them to fix it, IE: volunteer. IE: you fix it.
Actually it's not so much that they are on the bottom of the list as that there are two people potentially looking at them, and they are Tomasz (who is also doing mobile) and me (and I am doing the XML dumps rather than the HTML ones, until they are reliable and happy).
However if you are interested in working on these, I am *very* happy to help with suggestions, testing, feedback, etc., even while I am still woroking on the XML dumps. Do yuu have time and interest?
Ariel
Most (all?) articles should be already parsed in memcached. I think the bottleneck would be the compression. Note however that the ParserOutput would still need postprocessing, as would ?action=render. The first thing that comes to my mind is to remove the edit links (this use case alone seems enough for implementing editsection stripping). Sadly, we can't (easily) add the edit sections after the rendering.
On Sat, Sep 25, 2010 at 12:56 AM, Platonides platonides@gmail.com wrote:
Ariel T. Glenn wrote:
Στις 23-09-2010, ημέρα Πεμ, και ώρα 21:27 -0500, ο/η Q έγραψε:
Given the fact that static dumps have been broken for *years* now, static dumps are on the bottom of WMFs priority list; I thought it would be the best if I just went ahead and built something that can be used (and, of course, improved).
Marco
That's what I just said. Work with them to fix it, IE: volunteer. IE: you fix it.
Actually it's not so much that they are on the bottom of the list as that there are two people potentially looking at them, and they are Tomasz (who is also doing mobile) and me (and I am doing the XML dumps rather than the HTML ones, until they are reliable and happy).
However if you are interested in working on these, I am *very* happy to help with suggestions, testing, feedback, etc., even while I am still woroking on the XML dumps. Do yuu have time and interest?
Ariel
Most (all?) articles should be already parsed in memcached. I think the bottleneck would be the compression. Note however that the ParserOutput would still need postprocessing, as would ?action=render. The first thing that comes to my mind is to remove the edit links (this use case alone seems enough for implementing editsection stripping). Sadly, we can't (easily) add the edit sections after the rendering.
This should be doable using a simple regex which plainly goes for <span class="editsection">.
Marco
Perhaps build the base compile from a database dump and then any updates and such directly/live from the DB, that would decrease some load/requirement on the api. -Peachey
Marco Schuster wrote:
On Sat, Sep 25, 2010 at 12:56 AM, Platonides platonides@gmail.com wrote:
Actually it's not so much that they are on the bottom of the list as that there are two people potentially looking at them, and they are Tomasz (who is also doing mobile) and me (and I am doing the XML dumps rather than the HTML ones, until they are reliable and happy).
However if you are interested in working on these, I am *very* happy to help with suggestions, testing, feedback, etc., even while I am still woroking on the XML dumps. Do yuu have time and interest?
Ariel
Most (all?) articles should be already parsed in memcached. I think the bottleneck would be the compression. Note however that the ParserOutput would still need postprocessing, as would ?action=render. The first thing that comes to my mind is to remove the edit links (this use case alone seems enough for implementing editsection stripping). Sadly, we can't (easily) add the edit sections after the rendering.
This should be doable using a simple regex which plainly goes for <span class="editsection">.
Marco
It is (with the current skins). I meant as a core feature, which would need to be more precise.
Well, afaik PediaPress, openZIM and a few others started working to enhance the Extension:Collection to create ZIM files which is actually a special compressed HTML format.
We had a Skype conference two weeks ago, but I am not in the loop what happened since then. My last status is that Tommi from openZIM was going to fix the zimwriter interfaces so the filesource plugin can be used for this.
/Manuel
Am 24.09.2010 04:27, schrieb Q:
Given the fact that static dumps have been broken for *years* now, static dumps are on the bottom of WMFs priority list; I thought it would be the best if I just went ahead and built something that can be used (and, of course, improved).
Marco
That's what I just said. Work with them to fix it, IE: volunteer. IE: you fix it.
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Marco Schuster marco@harddisk.is-a-geek.org wrote:
Hi all,
I have made a list of all the 1.9M articles in NS0 (including redirects / short pages) using the Toolserver; now I have the list I'm going to download every single of 'em
There are static dumps available here:
http://download.wikimedia.org/dewiki/
Is there any problem with using them?
//Marcin
On 9/24/10, Marcin Cieslak saper@saper.info wrote:
There are static dumps available here:
http://download.wikimedia.org/dewiki/
Is there any problem with using them?
I think they are from June 2008.
A fresh static dump would be good.
-- John Vandenberg
John Vandenberg jayvdb@gmail.com wrote:
http://download.wikimedia.org/dewiki/
Is there any problem with using them?
I think they are from June 2008.
Are they?
http://download.wikimedia.org/dewiki/20100903/
//Marcin
On Fri, Sep 24, 2010 at 3:44 AM, Marcin Cieslak saper@saper.info wrote:
John Vandenberg jayvdb@gmail.com wrote:
http://download.wikimedia.org/dewiki/
Is there any problem with using them?
I think they are from June 2008.
Are they?
These are the database dumps. In order to get any HTML out of it, you need to set up either MediaWiki and/or a replacement parser; not to mention the delicate things enWP folks did with template magic, which requires setting up ParserFunctions - these might even depend on whatever version is currently running live. That's why static dumps (or ?action=render output) are the thing you need when you want to create offline versions or things like Mobipocket Wikipedia (which is my actual goal with the static dump).
Marco
I suppose you have already read about doing requests single threaded, the maxlag parameter and so on. Make sure you use a User Agent that clearly leads to you in case it gives problems.
toolserver-l@lists.wikimedia.org