Static dump of German Wikipedia

List overview All Threads
Download

newer

older

Perl 5.10 removed

Marco Schuster

24 Sep 2010 24 Sep '10

1:57 a.m.

Hi all,

I have made a list of all the 1.9M articles in NS0 (including redirects / short pages) using the Toolserver; now I have the list I'm going to download every single of 'em (after the trial period tonight, I want to see how this works out. I'd like to begin with downloading the whole thing in 3 or 4 days, if noone objects) and then publish a static dump of it. Data collection will be on the Toolserver (/mnt/user-store/dewiki-static/articles/); the request rate will be 1 article per second and I'll download the new files once or twice a day to my home PC, so there should be no problem with the TS or Wikimedia server load. When this is finished in ~ 21-22 days, I'm going to compress them and upload them to my private server (well, if Wikimedia has an archive server, that 'd be better) as a tgz file so others can play with it. Furthermore, though I have no idea if I'll succeed, I plan on hacking a static Vector skin file which will load the articles using jQuery's excellent .load() feature, so that everyone with JS can enjoy a truly offline Wikipedia.

Marco

PS: When trying to invoke /w/index.php?action=render with an invalid oldid, the server returns HTTP/1.1 200 OK and an error message, but shouldn't this be a 404 or 500?

-- VMSoft GbR Nabburger Str. 15 81737 München Geschäftsführer: Marco Schuster, Volker Hemmert http://vmsoft-gbr.de

Show replies by date

24 Sep 24 Sep

3:21 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256

On 9/23/2010 6:57 PM, Marco Schuster wrote:

...

Hi all,

I have made a list of all the 1.9M articles in NS0 (including redirects / short pages) using the Toolserver; now I have the list I'm going to download every single of 'em (after the trial period tonight, I want to see how this works out. I'd like to begin with downloading the whole thing in 3 or 4 days, if noone objects) and then publish a static dump of it.

Why not work with wikimedia to fix/resume their static html dumps?

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBCAAGBQJMm/00AAoJEL+AqFCTAyc2JwcH/3//HdJLpE+BCE38Dhn8rgdU b9XXUnjUeiNXMy7IlIetHcWpWi+E6y8dZDCfOTSW1bNtE/YkEJN93s824BP84yv9 OBpFpms8JoQM0751fjujpWS5IysNnhAS+IWZ9V3hBXwHq2BFbQjKKaMeydNtw39d OP59tjGqUHLJRYzjxWO+o3E05tSQ9LVCwR268aSXY9EuwM8BwclZpzlwm+zyPXyC AL0F9nuRhxWhMWZxnb3hdQJY+T8C56e5d3d1+RM0gmUpnkd/Cjj0S7B4B/HS4CQg S35pTO9CRHhKVOncK9R+qm/JyC6Y780MjqohCPg43NXY7OAjFd3EURjMxVTIfbs= =jDyk -----END PGP SIGNATURE-----

Marco Schuster

3:57 a.m.

On Fri, Sep 24, 2010 at 3:21 AM, Q overlordq@gmail.com wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256

On 9/23/2010 6:57 PM, Marco Schuster wrote:

...
Hi all,

I have made a list of all the 1.9M articles in NS0 (including redirects / short pages) using the Toolserver; now I have the list I'm going to download every single of 'em (after the trial period tonight, I want to see how this works out. I'd like to begin with downloading the whole thing in 3 or 4 days, if noone objects) and then publish a static dump of it.

Why not work with wikimedia to fix/resume their static html dumps?

Given the fact that static dumps have been broken for *years* now, static dumps are on the bottom of WMFs priority list; I thought it would be the best if I just went ahead and built something that can be used (and, of course, improved).

Marco

-- VMSoft GbR Nabburger Str. 15 81737 München Geschäftsführer: Marco Schuster, Volker Hemmert http://vmsoft-gbr.de

4:27 a.m.

...

Given the fact that static dumps have been broken for *years* now, static dumps are on the bottom of WMFs priority list; I thought it would be the best if I just went ahead and built something that can be used (and, of course, improved).

Marco

That's what I just said. Work with them to fix it, IE: volunteer. IE: you fix it.

Ariel T. Glenn

6:35 a.m.

Στις 23-09-2010, ημέρα Πεμ, και ώρα 21:27 -0500, ο/η Q έγραψε:

...

...
Given the fact that static dumps have been broken for *years* now, static dumps are on the bottom of WMFs priority list; I thought it would be the best if I just went ahead and built something that can be used (and, of course, improved).

Marco

That's what I just said. Work with them to fix it, IE: volunteer. IE: you fix it.

Actually it's not so much that they are on the bottom of the list as that there are two people potentially looking at them, and they are Tomasz (who is also doing mobile) and me (and I am doing the XML dumps rather than the HTML ones, until they are reliable and happy).

However if you are interested in working on these, I am *very* happy to help with suggestions, testing, feedback, etc., even while I am still woroking on the XML dumps. Do yuu have time and interest?

Ariel

Brett Hillebrand

7:15 a.m.

" I want to see how this works out." - Marco

Rather than trying to rope him into something just yet, let's give him the chance to see if what he wants to do works first and then ask.

Cheers,

-Promethean

-----Original Message----- From: toolserver-l-bounces@lists.wikimedia.org [mailto:toolserver-l-bounces@lists.wikimedia.org] On Behalf Of Ariel T. Glenn Sent: Friday, 24 September 2010 2:05 PM To: toolserver-l@lists.wikimedia.org Subject: Re: [Toolserver-l] Static dump of German Wikipedia

Στις 23-09-2010, ημέρα Πεμ, και ώρα 21:27 -0500, ο/η Q έγραψε:

...

...
Given the fact that static dumps have been broken for *years* now, static dumps are on the bottom of WMFs priority list; I thought it would be the best if I just went ahead and built something that can be used (and, of course, improved).

Marco

That's what I just said. Work with them to fix it, IE: volunteer. IE: you fix it.

However if you are interested in working on these, I am *very* happy to help with suggestions, testing, feedback, etc., even while I am still woroking on the XML dumps. Do yuu have time and interest?

Ariel

_______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

Ariel T. Glenn

8:20 a.m.

Στις 24-09-2010, ημέρα Παρ, και ώρα 14:45 +0930, ο/η Brett Hillebrand έγραψε:

...

" I want to see how this works out." - Marco

Rather than trying to rope him into something just yet, let's give him the chance to see if what he wants to do works first and then ask.

Well... our priorities are roping people in sooner rather than later :-D

Ariel

Marco Schuster

2:24 p.m.

On Fri, Sep 24, 2010 at 6:35 AM, Ariel T. Glenn ariel@wikimedia.org wrote:

...

Στις 23-09-2010, ημέρα Πεμ, και ώρα 21:27 -0500, ο/η Q έγραψε:

...
...
Given the fact that static dumps have been broken for *years* now, static dumps are on the bottom of WMFs priority list; I thought it would be the best if I just went ahead and built something that can be used (and, of course, improved).

Marco

That's what I just said. Work with them to fix it, IE: volunteer. IE: you fix it.

Actually it's not so much that they are on the bottom of the list as that there are two people potentially looking at them, and they are Tomasz (who is also doing mobile) and me (and I am doing the XML dumps rather than the HTML ones, until they are reliable and happy).

However if you are interested in working on these, I am *very* happy to help with suggestions, testing, feedback, etc., even while I am still woroking on the XML dumps. Do yuu have time and interest?

Yep, have both - can we talk on IRC?

Marco

-- VMSoft GbR Nabburger Str. 15 81737 München Geschäftsführer: Marco Schuster, Volker Hemmert http://vmsoft-gbr.de

Platonides

25 Sep 25 Sep

12:56 a.m.

Ariel T. Glenn wrote:

...

Στις 23-09-2010, ημέρα Πεμ, και ώρα 21:27 -0500, ο/η Q έγραψε:

...
...
Given the fact that static dumps have been broken for *years* now, static dumps are on the bottom of WMFs priority list; I thought it would be the best if I just went ahead and built something that can be used (and, of course, improved).

Marco

That's what I just said. Work with them to fix it, IE: volunteer. IE: you fix it.

Actually it's not so much that they are on the bottom of the list as that there are two people potentially looking at them, and they are Tomasz (who is also doing mobile) and me (and I am doing the XML dumps rather than the HTML ones, until they are reliable and happy).

However if you are interested in working on these, I am *very* happy to help with suggestions, testing, feedback, etc., even while I am still woroking on the XML dumps. Do yuu have time and interest?

Ariel

Most (all?) articles should be already parsed in memcached. I think the bottleneck would be the compression. Note however that the ParserOutput would still need postprocessing, as would ?action=render. The first thing that comes to my mind is to remove the edit links (this use case alone seems enough for implementing editsection stripping). Sadly, we can't (easily) add the edit sections after the rendering.

Marco Schuster

1 a.m.

On Sat, Sep 25, 2010 at 12:56 AM, Platonides platonides@gmail.com wrote:

...

Ariel T. Glenn wrote:

...
Στις 23-09-2010, ημέρα Πεμ, και ώρα 21:27 -0500, ο/η Q έγραψε:

...
...
Given the fact that static dumps have been broken for *years* now, static dumps are on the bottom of WMFs priority list; I thought it would be the best if I just went ahead and built something that can be used (and, of course, improved).

Marco

That's what I just said. Work with them to fix it, IE: volunteer. IE: you fix it.

Actually it's not so much that they are on the bottom of the list as that there are two people potentially looking at them, and they are Tomasz (who is also doing mobile) and me (and I am doing the XML dumps rather than the HTML ones, until they are reliable and happy).

However if you are interested in working on these, I am *very* happy to help with suggestions, testing, feedback, etc., even while I am still woroking on the XML dumps. Do yuu have time and interest?

Ariel

Most (all?) articles should be already parsed in memcached. I think the bottleneck would be the compression. Note however that the ParserOutput would still need postprocessing, as would ?action=render. The first thing that comes to my mind is to remove the edit links (this use case alone seems enough for implementing editsection stripping). Sadly, we can't (easily) add the edit sections after the rendering.

This should be doable using a simple regex which plainly goes for <span class="editsection">.

Marco

-- VMSoft GbR Nabburger Str. 15 81737 München Geschäftsführer: Marco Schuster, Volker Hemmert http://vmsoft-gbr.de

K. Peachey

4:10 a.m.

New subject: [Wikitech-l] Static dump of German Wikipedia

Perhaps build the base compile from a database dump and then any updates and such directly/live from the DB, that would decrease some load/requirement on the api. -Peachey

Platonides

8:02 p.m.

Marco Schuster wrote:

...

On Sat, Sep 25, 2010 at 12:56 AM, Platonides platonides@gmail.com wrote:

...
...
Actually it's not so much that they are on the bottom of the list as that there are two people potentially looking at them, and they are Tomasz (who is also doing mobile) and me (and I am doing the XML dumps rather than the HTML ones, until they are reliable and happy).

However if you are interested in working on these, I am *very* happy to help with suggestions, testing, feedback, etc., even while I am still woroking on the XML dumps. Do yuu have time and interest?

Ariel

Most (all?) articles should be already parsed in memcached. I think the bottleneck would be the compression. Note however that the ParserOutput would still need postprocessing, as would ?action=render. The first thing that comes to my mind is to remove the edit links (this use case alone seems enough for implementing editsection stripping). Sadly, we can't (easily) add the edit sections after the rendering.

This should be doable using a simple regex which plainly goes for <span class="editsection">.

Marco

It is (with the current skins). I meant as a core feature, which would need to be more precise.

Manuel Schneider

24 Sep 24 Sep

5:05 p.m.

Well, afaik PediaPress, openZIM and a few others started working to enhance the Extension:Collection to create ZIM files which is actually a special compressed HTML format.

We had a Skype conference two weeks ago, but I am not in the loop what happened since then. My last status is that Tommi from openZIM was going to fix the zimwriter interfaces so the filesource plugin can be used for this.

/Manuel

Am 24.09.2010 04:27, schrieb Q:

...

...
Given the fact that static dumps have been broken for *years* now, static dumps are on the bottom of WMFs priority list; I thought it would be the best if I just went ahead and built something that can be used (and, of course, improved).

Marco

That's what I just said. Work with them to fix it, IE: volunteer. IE: you fix it.

Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

-- Regards Manuel Schneider Wikimedia CH - Verein zur Förderung Freien Wissens Wikimedia CH - Association for the advancement of free knowledge www.wikimedia.ch

Marcin Cieslak

3:22 a.m.

...

...
Marco Schuster marco@harddisk.is-a-geek.org wrote:

Hi all,

I have made a list of all the 1.9M articles in NS0 (including redirects / short pages) using the Toolserver; now I have the list I'm going to download every single of 'em

There are static dumps available here:

http://download.wikimedia.org/dewiki/

Is there any problem with using them?

//Marcin

John Vandenberg

3:26 a.m.

On 9/24/10, Marcin Cieslak saper@saper.info wrote:

...

There are static dumps available here:

http://download.wikimedia.org/dewiki/

Is there any problem with using them?

I think they are from June 2008.

A fresh static dump would be good.

-- John Vandenberg

Marcin Cieslak

3:44 a.m.

...

...
John Vandenberg jayvdb@gmail.com wrote:

http://download.wikimedia.org/dewiki/

Is there any problem with using them?

I think they are from June 2008.

Are they?

http://download.wikimedia.org/dewiki/20100903/

//Marcin

Marco Schuster

3:51 a.m.

On Fri, Sep 24, 2010 at 3:44 AM, Marcin Cieslak saper@saper.info wrote:

...

...
...
John Vandenberg jayvdb@gmail.com wrote:

http://download.wikimedia.org/dewiki/

Is there any problem with using them?

I think they are from June 2008.

Are they?

http://download.wikimedia.org/dewiki/20100903/

These are the database dumps. In order to get any HTML out of it, you need to set up either MediaWiki and/or a replacement parser; not to mention the delicate things enWP folks did with template magic, which requires setting up ParserFunctions - these might even depend on whatever version is currently running live. That's why static dumps (or ?action=render output) are the thing you need when you want to create offline versions or things like Mobipocket Wikipedia (which is my actual goal with the static dump).

Marco

-- VMSoft GbR Nabburger Str. 15 81737 München Geschäftsführer: Marco Schuster, Volker Hemmert http://vmsoft-gbr.de

Platonides

25 Sep 25 Sep

12:50 a.m.

I suppose you have already read about doing requests single threaded, the maxlag parameter and so on. Make sure you use a User Agent that clearly leads to you in case it gives problems.

5200

Age (days ago)

5202

Last active (days ago)

toolserver-l@lists.wikimedia.org

17 comments

9 participants

tags (0)

participants (9)

Ariel T. Glenn
Brett Hillebrand
John Vandenberg
K. Peachey
Manuel Schneider
Marcin Cieslak
Marco Schuster
Platonides
Q