(no subject)

List overview All Threads
Download

newer

older

Re: [Wikitech-l] [Announcement]...

Range Contribs tool; anti-vandal...

李琴

28 Jan 2010 28 Jan '10

4:06 p.m.

Hi all, I have built a LocalWiki. Now I want the data of it to keep consistent with the Wikipedia and one work I should do is to get the data of update from Wikipedia. I get the URLs through analyzing the RSS (http://zh.wikipedia.org/w/index.php?title=Special:%E6%9C%80%E8%BF%91%E6%9B%B...) and get all HTML content of the edit box by analyzing these URLs after opening an URL and clicking the ’edit this page’. (eg: http://zh.wikipedia.org/w/index.php?title=%E8%B2%A1%E7%A5%9E%E5%88%B0_(%E9%8... and its edit interface is http://zh.wikipedia.org/w/index.php?title=%E8%B2%A1%E7%A5%9E%E5%88%B0_(%E9%8... . However, I encounter two problems during my work. Firstly, sometimes I can’t open a URL which is from the RSS and I don’t know why. That’s because I visit it too frequently and my IP address is prohibited or the network is too slow? If the reason is the former, how often can I visit a page of Wikipedia? Is there a timeout? Secondly, just as mentioned before I want to download all HTML of the content in the edit box from Wikipedia, however, I can do sometimes but other times I just can download part of it, what’s the reason?

Thanks

vanessa

Show replies by date

Tei

28 Jan 28 Jan

6:02 p.m.

On 28 January 2010 15:06, 李琴 qli@ica.stc.sh.cn wrote:

...

Hi all, I have built a LocalWiki. Now I want the data of it to keep consistent with the Wikipedia and one work I should do is to get the data of update from Wikipedia. I get the URLs through analyzing the RSS (http://zh.wikipedia.org/w/index.php?title=Special:%E6%9C%80%E8%BF%91%E6%9B%B...) and get all HTML content of the edit box by analyzing these URLs after opening an URL and clicking the ’edit this page’.

....

...

That’s because I visit it too frequently and my IP address is prohibited or the network is too slow?

李琴 well.. thats webscrapping, that is a poor tecnique, one with lots of errors that generate lots of trafic.

One thing a robot must do is read and follow the http://zh.wikipedia.org/robots.txt file ( probably you sould read it too) As a general rule of Internet, a "rude" robot will be banned by the site admins.

It would be a good idea to anounce your bot as a bot in the user_agent string . Good bot beavior is one that read a website like a human. I don't know, like 10 request minute?. I don't know about this "Wikipedia" site rules about it.

What you are suffering could be automatic or manual throttling, since is detected a abusive number of request from your IP.

"Wikipedia" seems to provide fulldumps of his wiki, but are unusable for you, since are giganteous :-/, trying to rebuilt wikipedia on your PC with a snapshot would be like summoning Tchulu in a teapot. But.. I don't know, maybe the zh version is smaller, or your resources powerfull enough. One feels that what you have built has a severe overload (wastage of resources) and there must be better ways to do it...

-- -- ℱin del ℳensaje.

Marco Schuster

9:21 p.m.

On Thu, Jan 28, 2010 at 5:02 PM, Tei oscar.vives@gmail.com wrote:

...

On 28 January 2010 15:06, 李琴 qli@ica.stc.sh.cn wrote:

...
Hi all, I have built a LocalWiki. Now I want the data of it to keep

consistent

...
with the Wikipedia and one work I should do is to get the data of update from Wikipedia. I get the URLs through analyzing the RSS (

http://zh.wikipedia.org/w/index.php?title=Special:%E6%9C%80%E8%BF%91%E6%9B%B... )

...
and get all HTML content of the edit box by analyzing these URLs after opening an URL and clicking the ’edit this page’.

....

...
That’s because I visit it too frequently and my IP address is prohibited or the network is too slow?

李琴 well.. thats webscrapping, that is a poor tecnique, one with lots of errors that generate lots of trafic.

One thing a robot must do is read and follow the http://zh.wikipedia.org/robots.txt file ( probably you sould read it too) As a general rule of Internet, a "rude" robot will be banned by the site admins.

It would be a good idea to anounce your bot as a bot in the user_agent string . Good bot beavior is one that read a website like a human. I don't know, like 10 request minute?. I don't know about this "Wikipedia" site rules about it.

What you are suffering could be automatic or manual throttling, since is detected a abusive number of request from your IP.

"Wikipedia" seems to provide fulldumps of his wiki, but are unusable for you, since are giganteous :-/, trying to rebuilt wikipedia on your PC with a snapshot would be like summoning Tchulu in a teapot. But.. I don't know, maybe the zh version is smaller, or your resources powerfull enough. One feels that what you have built has a severe overload (wastage of resources) and there must be better ways to do it...

Indeed there are. What you need: 1) the Wikimedia IRC live feed - last time I've looked at it, it was at irc://irc.wikimedia.org/ and then each project had its own channel. 2) A PHP IRC bot framework - Net_SmartIRC is well-written and easy to get started with 3) the page source you can EASILY get either in rendered form http://zh.wikipedia.org/w/index.php?title=TITLE&action=render or in raw form http://zh.wikipedia.org/w/index.php?title=TITLE&action=raw (this is page source).

Marco

-- VMSoft GbR Nabburger Str. 15 81737 München Geschäftsführer: Marco Schuster, Volker Hemmert http://vmsoft-gbr.de

Platonides

11:18 p.m.

李琴 wrote:

...

Hi all, I have built a LocalWiki. Now I want the data of it to keep consistent with the Wikipedia and one work I should do is to get the data of update from Wikipedia. I get the URLs through analyzing the RSS (http://zh.wikipedia.org/w/index.php?title=Special:%E6%9C%80%E8%BF%91%E6%9B%B...) and get all HTML content of the edit box by analyzing these URLs after opening an URL and clicking the ’edit this page’. (eg: http://zh.wikipedia.org/w/index.php?title=%E8%B2%A1%E7%A5%9E%E5%88%B0_(%E9%8... and its edit interface is http://zh.wikipedia.org/w/index.php?title=%E8%B2%A1%E7%A5%9E%E5%88%B0_(%E9%8... . However, I encounter two problems during my work. Firstly, sometimes I can’t open a URL which is from the RSS and I don’t know why. That’s because I visit it too frequently and my IP address is prohibited or the network is too slow? If the reason is the former, how often can I visit a page of Wikipedia? Is there a timeout? Secondly, just as mentioned before I want to download all HTML of the content in the edit box from Wikipedia, however, I can do sometimes but other times I just can download part of it, what’s the reason?

Thanks

vanessa

Using the api or special:export you can request several pages per http request, which is nicer to the system. You should also add a maxlag parameter. Obviously you must put a proper User-Agent, so that if your bot causes issues you can be contacted/banned.

Wikimedia Foundation offers a live feed to keep the wikis up-to-date, check http://meta.wikimedia.org/wiki/Wikimedia_update_feed_service

5446

Age (days ago)

5446

Last active (days ago)

wikitech-l@lists.wikimedia.org

3 comments

4 participants

tags (0)

participants (4)

Marco Schuster
Platonides
Tei
李琴