On Sat, May 6, 2017 at 9:12 PM, Abdulfattah Safa fattah.safa@gmail.com wrote:
I'm trying to get all the page titles in Wikipedia in namespace using the API as following:
https://en.wikipedia.org/w/api.php?action=query&format= xml&list=allpages&apnamespace=0&apfilterredir=nonredirects& aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE
I keep requesting this url and checking the response if contains continue tag. if yes, then I use same request but change the *BASE_PAGE_TITLE *to the value in apcontinue attribute in the response. My applications had been running since 3 days and number of retrieved exceeds 30M, whereas it is about 13M in the dumps. any idea?
Please do not scrap the web for those kind of requests- it is a waste of resources for you and for Wikimedia servers (given that there is a faster and more reliable alternative).
Looking at https://dumps.wikimedia.org/enwiki/20170501/ you can find:
2017-05-03 07:26:20 done List of all page titles https://dumps.wikimedia.org/enwiki/20170501/enwiki-20170501-all-titles.gz (221.7 MB) 2017-05-03 07:22:02 done List of page titles in main namespace https://dumps.wikimedia.org/enwiki/20170501/enwiki-20170501-all-titles-in-ns... (70.8 MB)
Use one of the above. Not only it is faster, you will also get consistent results- by the time you stop going over your loop, pages have been created and deleted. The above exports are done trying to get the most consistent state as practically possible, and actively monitored by WMF staff.