李琴 wrote:
Hi all, I have built a LocalWiki. Now I want the data of it to keep consistent with the Wikipedia and one work I should do is to get the data of update from Wikipedia. I get the URLs through analyzing the RSS (http://zh.wikipedia.org/w/index.php?title=Special:%E6%9C%80%E8%BF%91%E6%9B%B...) and get all HTML content of the edit box by analyzing these URLs after opening an URL and clicking the ’edit this page’. (eg: http://zh.wikipedia.org/w/index.php?title=%E8%B2%A1%E7%A5%9E%E5%88%B0_(%E9%8... and its edit interface is http://zh.wikipedia.org/w/index.php?title=%E8%B2%A1%E7%A5%9E%E5%88%B0_(%E9%8... . However, I encounter two problems during my work. Firstly, sometimes I can’t open a URL which is from the RSS and I don’t know why. That’s because I visit it too frequently and my IP address is prohibited or the network is too slow? If the reason is the former, how often can I visit a page of Wikipedia? Is there a timeout? Secondly, just as mentioned before I want to download all HTML of the content in the edit box from Wikipedia, however, I can do sometimes but other times I just can download part of it, what’s the reason?
Thanks
vanessa
Using the api or special:export you can request several pages per http request, which is nicer to the system. You should also add a maxlag parameter. Obviously you must put a proper User-Agent, so that if your bot causes issues you can be contacted/banned.
Wikimedia Foundation offers a live feed to keep the wikis up-to-date, check http://meta.wikimedia.org/wiki/Wikimedia_update_feed_service