On 28 January 2010 15:06, 李琴 <qli(a)ica.stc.sh.cn>
wrote:
Hi all,
I have built a LocalWiki. Now I want the data of it to keep
consistent
with the
Wikipedia and one work I should do is to get the data of update from
Wikipedia.
I get the URLs through analyzing the RSS
(
http://zh.wikipedia.org/w/index.php?title=Special:%E6%9C%80%E8%BF%91%E6%9B%…
)
and get all HTML content of the edit box by
analyzing
these URLs after opening an URL and clicking the ’edit this page’.
....
That’s because I visit it too frequently and my
IP address is prohibited
or the network is too slow?
李琴 well.. thats webscrapping, that is a poor tecnique, one with lots
of errors that generate lots of trafic.
One thing a robot must do is read and follow the
http://zh.wikipedia.org/robots.txt file ( probably you sould read it
too)
As a general rule of Internet, a "rude" robot will be banned by the
site admins.
It would be a good idea to anounce your bot as a bot in the user_agent
string . Good bot beavior is one that read a website like a human. I
don't know, like 10 request minute?. I don't know about this
"Wikipedia" site rules about it.
What you are suffering could be automatic or manual throttling, since
is detected a abusive number of request from your IP.
"Wikipedia" seems to provide fulldumps of his wiki, but are unusable
for you, since are giganteous :-/, trying to rebuilt wikipedia on your
PC with a snapshot would be like summoning Tchulu in a teapot. But.. I
don't know, maybe the zh version is smaller, or your resources
powerfull enough. One feels that what you have built has a severe
overload (wastage of resources) and there must be better ways to do
it...
Indeed there are. What you need:
1) the Wikimedia IRC live feed - last time I've looked at it, it was at
and then each project had its own channel.
2) A PHP IRC bot framework - Net_SmartIRC is well-written and easy to get
started with
3) the page source you can EASILY get either in rendered form
(this is page
source).
Marco
--
VMSoft GbR
Nabburger Str. 15
81737 München
Geschäftsführer: Marco Schuster, Volker Hemmert