Re: [Wikitech-l] Get Wikipedia Page Titles using API looks Endless

7 May 2017


      On Sat, May 6, 2017 at 9:12 PM, Abdulfattah Safa fattah.safa@gmail.com
 wrote:
...
I'm trying to get all the page titles in Wikipedia in namespace using the
API as following:
https://en.wikipedia.org/w/api.php?action=query&format=
xml&list=allpages&apnamespace=0&apfilterredir=nonredirects&
aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE
I keep requesting this url and checking the response if contains continue
tag. if yes, then I use same request but change the *BASE_PAGE_TITLE *to
the value in apcontinue attribute in the response.
My applications had been running since 3 days and number of retrieved
exceeds 30M, whereas it is about 13M in the dumps.
any idea?
Please do not scrap the web for those kind of requests- it is a waste of
resources for you and for Wikimedia servers (given that there is a faster
and more reliable alternative).
Looking at https://dumps.wikimedia.org/enwiki/20170501/ you can find:
2017-05-03 07:26:20 done List of all page titles
https://dumps.wikimedia.org/enwiki/20170501/enwiki-20170501-all-titles.gz
(221.7 MB)
2017-05-03 07:22:02 done List of page titles in main namespace
https://dumps.wikimedia.org/enwiki/20170501/enwiki-20170501-all-titles-in-ns...
(70.8 MB)
Use one of the above. Not only it is faster, you will also get consistent
results- by the time you stop going over your loop, pages have been created
and deleted. The above exports are done trying to get the most consistent
state as practically possible, and actively monitored by WMF staff.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Get Wikipedia Page Titles using API looks Endless