Hello. I'm Jong Beom Kim, web search product manager of Naver Corporation(www.naver.com) Please, check the robots rule issues
===================================================== We offer search results by collecting the data of Wikipedia. However, transmitting data by dumping is not satisfy freshness, so we want to collect the data by API( https://www.mediawiki.org/wiki/API) ) for freshness.
Your sites robots rules are restricting our API access(/w/api.php) Therefore, YETI(Naver Corporation Web Robot Crawler) would collect the data by API ignoring "robots.txt' If the above method is not allowed, can you tell us the correct process and policy of access?
I will wait for your guidance about policy and process of collecting data. ==========================================
Best regards. Thanks
-----Original Message----- From: "Wikipedia information team"<info-en@wikimedia.org> To: <jongbeom.kim@nhn.com>; Cc: <answers@wikimedia.org>; Sent: 2013-09-03 (화) 11:33:13 Subject: Re: [Ticket#2013090310000891] Questions about Wiki robots rule Policy
Dear 김종범,
For the best chance of a quick resolution to the issue you are having, you should email the mailing list that has a team of volunteers that look over technical matters relating to MediaWiki software and interface. This team can be reached at mediawiki-l@lists.wikimedia.org.
I should note at this point that while this correspondence is private, emails to most Wikimedia mailing lists (including mediawiki-l) are public.
Yours sincerely, Kosten Frosch
Jong Beom Kim, if I am not mistaken, robot.txt is designed for crawlers - blind following of all links. If your API client knows how to use MediaWiki API to download exactly the data it needs, you should follow general API guidelines described on the etiquettehttps://www.mediawiki.org/wiki/API:Etiquettepage and the getting started https://www.mediawiki.org/wiki/API:Main_page#Getting_started page.
--Yuri
On Tue, Sep 3, 2013 at 1:12 AM, 김종범 jongbeom.kim@nhn.com wrote:
Hello.
I'm Jong Beom Kim, web search product manager of Naver Corporation( www.naver.com)
Please, check the robots rule issues
=====================================================
We offer search results by collecting the data of Wikipedia.
However, transmitting data by dumping is not satisfy freshness,
so we want to collect the data by API( https://www.mediawiki.org/wiki/API) ) for freshness.
Your sites robots rules are restricting our API access(/w/api.php)
Therefore, YETI(Naver Corporation Web Robot Crawler) would collect the data by API ignoring "robots.txt'
If the above method is not allowed, can you tell us the correct process and policy of access?
I will wait for your guidance about policy and process of collecting data.
==========================================
Best regards.
Thanks
-----Original Message----- *From:* "Wikipedia information team"info-en@wikimedia.org *To:* jongbeom.kim@nhn.com; *Cc:* answers@wikimedia.org; *Sent:* 2013-09-03 (화) 11:33:13 *Subject:* Re: [Ticket#2013090310000891] Questions about Wiki robots rule Policy
Dear 김종범,
For the best chance of a quick resolution to the issue you are having, you should email the mailing list that has a team of volunteers that look over technical matters relating to MediaWiki software and interface. This team can be reached at mediawiki-l@lists.wikimedia.org.
I should note at this point that while this correspondence is private, emails to most Wikimedia mailing lists (including mediawiki-l) are public.
Yours sincerely, Kosten Frosch
-- Wikipedia - https://en.wikipedia.org/
Disclaimer: all mail to this address is answered by volunteers, and responses are not to be considered an official statement of the Wikimedia Foundation. For official correspondence, please contact the Wikimedia Foundation by certified mail at the address listed on https:// www.wikimediafoundation.org/
09/03/2013 02:04 - 김종범 wrote:
Hello.
I'm Jong Beom Kim, web search product manager of Naver Corporation(
www.naver.com)
We offer search results by collecting the data of Wikipedia. However, transmitting data by dumping is not satisfy freshness, so we want to collect the data by API( https://
www.mediawiki.org/wiki/API) ) for
freshness.
Your sites robots rules are restricting our API access(/w/api.php) Therefore, YETI(Naver Corporation Web Robot Crawler) would collect the
data by API
ignoring "robots.txt' If the above method is not allowed, can you tell us the correct process
and policy
of access?
I will wait for your guidance about policy and process of collecting
data.
Thank you.
Kim Jong Beom User DB Search Team / Assistant Manager
4th FL., NAVER Green Factory, 178-1 Jeongja-dong, Bundang-gu,
Seongnam-si,
Gyeonggi-do, KOREA Tel 031-784-2718 Email jongbeom.kim@nhn.com
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
On Tue, Sep 3, 2013 at 1:12 AM, 김종범 jongbeom.kim@nhn.com wrote:
We offer search results by collecting the data of Wikipedia.
However, transmitting data by dumping is not satisfy freshness,
so we want to collect the data by API( https://www.mediawiki.org/wiki/API) ) for freshness.
Search engine crawlers should be crawling the normal webpages as an anonymous user with no cookies set, to take maximum advantage of caching. Crawlers using the API to fetch page contents are liable to be blocked.
If you follow the guidelines in https://meta.wikimedia.org/wiki/Data_request_limitations, you should probable be able to poll the API's list=recentchanges for the page titles (not the content!) that your crawler needs to re-crawl without issue, but in the end that decision is up to people who are not me. Alternatively, you could use the IRC recent changes feedhttps://meta.wikimedia.org/wiki/IRC/Channels#Recent_changesto get that list.
mediawiki-api@lists.wikimedia.org