Little Project Idea

List overview All Threads
Download

newer

older

downtime

[stable] server migration

K. Peachey

27 Jun 2009 27 Jun '09

7:13 a.m.

Heres a little side project idea for anyone interested: I often see people wanting dumps of one or two articles from a Wikipedia wiki (most commonly en) and most commonly have to resort to screen scraping via third party tools which isn't very nice and some times even blocked (at the squid level) although there is the API, it isn't exactly the most easier system for anyone to use. What might be nice is a little tool where people can enter a few article names in a box and click a button and have it produce the static html dumps of the desired article(/s).

Show replies by date

Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯

27 Jun 27 Jun

9:40 a.m.

The tools exist. Ample documentation exists. Both programmatic interfaces and easy form-based interfaces exist.

Screen scraping still happens not only because of laziness but also because the correct way is not promoted. For example, if I access the English Wikipedia main page with libwww-perl (a banned UA), the response body says (among other things):

Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please _try again_ in a few minutes.

This is of course bullshit of the highest degree. It's certainly a permanent problem, and no amount of retrying will get me around the ban. The response body should say something like this:

Screen scraping is forbidden as it causes undue burden on the infrastructure and servers. Use the export feature to parse single pages, use the dump feature to parse a whole wiki.

http://mediawiki.org/wiki/Special:Export http://en.wikipedia.org/wiki/WP:Export

http://download.wikimedia.org/ http://en.wikipedia.org/wiki/WP:Download

Can anyone with access to the appropriate bugtracker file a bug on this, please?

K. Peachey

9:52 a.m.

On Sat, Jun 27, 2009 at 5:40 PM, Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯daxim@cpan.org wrote:

...

The tools exist. Ample documentation exists. Both programmatic interfaces and easy form-based interfaces exist.

Screen scraping still happens not only because of laziness but also because the correct way is not promoted. For example, if I access the English Wikipedia main page with libwww-perl (a banned UA), the response body says (among other things):

Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please _try again_ in a few minutes.

This is of course bullshit of the highest degree. It's certainly a permanent problem, and no amount of retrying will get me around the ban. The response body should say something like this:

Screen scraping is forbidden as it causes undue burden on the infrastructure and servers. Use the export feature to parse single pages, use the dump feature to parse a whole wiki.

http://mediawiki.org/wiki/Special:Export http://en.wikipedia.org/wiki/WP:Export

http://download.wikimedia.org/ http://en.wikipedia.org/wiki/WP:Download

Can anyone with access to the appropriate bugtracker file a bug on this, please?

The [[Special:Export]] function exports the page in a xml format whilst most people want a plain/static html dump of the page, and the download packages are for the whole database collections which for the people i'm suggesting this tool for is way too much, they just want seveal articles not a several gigabyte collection of every article page.

Daniel Kinzler

9:58 a.m.

K. Peachey schrieb:

...

On Sat, Jun 27, 2009 at 5:40 PM, Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯daxim@cpan.org wrote:

...
The tools exist. Ample documentation exists. Both programmatic interfaces and easy form-based interfaces exist.

Screen scraping still happens not only because of laziness but also because the correct way is not promoted. For example, if I access the English Wikipedia main page with libwww-perl (a banned UA), the response body says (among other things):

Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please _try again_ in a few minutes.

This is of course bullshit of the highest degree. It's certainly a permanent problem, and no amount of retrying will get me around the ban. The response body should say something like this:

Screen scraping is forbidden as it causes undue burden on the infrastructure and servers. Use the export feature to parse single pages, use the dump feature to parse a whole wiki.

http://mediawiki.org/wiki/Special:Export http://en.wikipedia.org/wiki/WP:Export

http://download.wikimedia.org/ http://en.wikipedia.org/wiki/WP:Download

Can anyone with access to the appropriate bugtracker file a bug on this, please?

The [[Special:Export]] function exports the page in a xml format whilst most people want a plain/static html dump of the page, and the download packages are for the whole database collections which for the people i'm suggesting this tool for is way too much, they just want seveal articles not a several gigabyte collection of every article page.

You are looking for action=render then. I don't think there is a form based interface for it, but creating one would be very trivial.

Example: http://en.wikipedia.org/wiki/Tool?action=render

-- daniel

Aryeh Gregor

28 Jun 28 Jun

4:26 a.m.

On Sat, Jun 27, 2009 at 3:40 AM, Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯daxim@cpan.org wrote:

...

This is of course bullshit of the highest degree. It's certainly a permanent problem, and no amount of retrying will get me around the ban. The response body should say something like this:

. . .

Can anyone with access to the appropriate bugtracker file a bug on this, please?

https://bugzilla.wikimedia.org/ is open for anyone to make an account. I'm not sure if any surgery would have to be done on Squid to get this to work, or what. I'm pretty sure you'd need a sysadmin to do it.

River Tarnell

27 Jun 27 Jun

1:09 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

K. Peachey:

...

What might be nice is a little tool where people can enter a few article names in a box and click a button and have it produce the static html dumps of the desired article(/s).

before anyone runs off to implement this, please remember that the Toolserver rules forbid serving article content to users.

- river.

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (HP-UX) iEYEARECAAYFAkpF/f0ACgkQIXd7fCuc5vIrPACgk28bH/qh5SOiRtYgrh2d4Dxx Aa0An3zyfVKc546SutgQRltJTo/fZi5g =GuGV -----END PGP SIGNATURE-----

Platonides

11:34 p.m.

River Tarnell wrote:

...

K. Peachey:

...
What might be nice is a little tool where people can enter a few article names in a box and click a button and have it produce the static html dumps of the desired article(/s).

before anyone runs off to implement this, please remember that the Toolserver rules forbid serving article content to users.

river.

It could be done in the downloads server. However, the result is pretty much the same as downloading the webpage.

Peachey, can you explain more thoroughly whe exact format you want and the benefits it provides? The static html dump format isn't that useful for just a couple of articles.

River Tarnell

28 Jun 28 Jun

3:09 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Platonides:

...

It could be done in the downloads server.

what is the downloads server? (since this is toolserver-l, i'm talking about the toolserver; it can't be done on any of our servers, as it's against the rules.)

- river.

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (HP-UX) iEYEARECAAYFAkpHa40ACgkQIXd7fCuc5vJgwQCdHkH/uWJE3xe2h6FwQmX85U/5 sjUAn0vRRUK9l98XnIwRf7uBBaAk6y42 =fsEg -----END PGP SIGNATURE-----

Simon Walker

3:26 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

River Tarnell wrote:

...

Platonides:

...
It could be done in the downloads server.

what is the downloads server?

I presume he meant download.wikimedia.org

- -- Regards,

Simon Walker User:Stwalkerster on all public Wikimedia Foundation wikis Administrator on the English Wikipedia Developer of Helpmebot, the ACC tool, and Nubio 2 FAQ repository

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkpHb4UACgkQrNHjXkWJaKuoCwCeMBcNouxSyBZ6DnfpKEq4E7Wx 4G0An2oxsuoSfv439a0P7KF3/EPtM0Jj =33dq -----END PGP SIGNATURE-----

River Tarnell

3:33 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Simon Walker:

...

I presume he meant download.wikimedia.org

i think it's very unlikely we'd run something like this on the static file server.

- river.

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (HP-UX) iEYEARECAAYFAkpHcTMACgkQIXd7fCuc5vK2IQCffTm4sqhwPzwYKRR03BX2uLon suEAn24l9D8mhge4ajT7804Td/gCi6Ip =+eG4 -----END PGP SIGNATURE-----

Platonides

5:54 p.m.

River Tarnell wrote:

...

Simon Walker:

...
I presume he meant download.wikimedia.org

i think it's very unlikely we'd run something like this on the static file server.

river.

*If* there's such a need, it would be worth to take some CPU time from the servers doing dumps to fulfill it instead of making dumps faster when people is not using them and wishing something different. Changing a rendered mediawiki page into the "static dump format" is really cheap. I don't see the benefit, though. So let Peachey convince us and perhaps then it can be done :)

5654

Age (days ago)

5655

Last active (days ago)

toolserver-l@lists.wikimedia.org

10 comments

7 participants

tags (0)

participants (7)

Aryeh Gregor
Daniel Kinzler
K. Peachey
Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯
Platonides
River Tarnell
Simon Walker