Get Wikipedia Page Titles using API looks Endless

List overview All Threads
Download

newer

older

Congratulations to accepted...

Data gathering for intermittent...

Abdulfattah Safa

6 May 2017 6 May '17

7:12 p.m.

I'm trying to get all the page titles in Wikipedia in namespace using the API as following: https://en.wikipedia.org/w/api.php?action=query&format=xml&list=all… I keep requesting this url and checking the response if contains continue tag. if yes, then I use same request but change the *BASE_PAGE_TITLE *to the value in apcontinue attribute in the response. My applications had been running since 3 days and number of retrieved exceeds 30M, whereas it is about 13M in the dumps. any idea?

Show replies by date

Abdulfattah Safa

6 May 6 May

7:36 p.m.

for the & in $Continue=-||, it's a type. It doesn't exist in the code. On Sat, May 6, 2017 at 10:12 PM Abdulfattah Safa <fattah.safa(a)gmail.com> wrote:

...

Eran Rosenthal

8:39 p.m.

1. You can use limit parameter to get more titles in each request 2. For getting many entries it is recommended to extract from dumps or from database using quarry On May 6, 2017 22:36, "Abdulfattah Safa" <fattah.safa(a)gmail.com> wrote:

...

for the & in $Continue=-||, it's a type. It doesn't exist in the code. On Sat, May 6, 2017 at 10:12 PM Abdulfattah Safa <fattah.safa(a)gmail.com> wrote:

I'm trying to get all the page titles in Wikipedia in namespace using the API as following: https://en.wikipedia.org/w/api.php?action=query&format=

xml&list=allpages&apnamespace=0&apfilterredir=nonredirects& aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE

I keep requesting this url and checking the response if contains continue tag. if yes, then I use same request but change the *BASE_PAGE_TITLE *to the value in apcontinue attribute in the response. My applications had been running since 3 days and number of retrieved exceeds 30M, whereas it is about 13M in the dumps. any idea?

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Abdulfattah Safa

9:25 p.m.

1. I'm usng max as a limit parameter 2. I'm not sure if the dumps have the data I need. I need to get the titles for all Articles (name space = 0), with no redirects and also need the titles of all Categories (namespace = 14) without redirects On Sat, May 6, 2017 at 11:39 PM Eran Rosenthal <eranroz89(a)gmail.com> wrote:

...

for the & in $Continue=-||, it's a type. It doesn't exist in the code. On Sat, May 6, 2017 at 10:12 PM Abdulfattah Safa <fattah.safa(a)gmail.com> wrote: > I'm trying to get all the page titles in Wikipedia in namespace using

the

API as following: https://en.wikipedia.org/w/api.php?action=query&format=

xml&list=allpages&apnamespace=0&apfilterredir=nonredirects& aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE > > I keep requesting this url and checking the response if contains

continue

> tag. if yes, then I use same request but change the *BASE_PAGE_TITLE

*to

the value in apcontinue attribute in the response. My applications had been running since 3 days and number of retrieved exceeds 30M, whereas it is about 13M in the dumps. any idea?

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

John

9:27 p.m.

Give me a few minutes I can get you a database dump of what you need. On Sat, May 6, 2017 at 5:25 PM, Abdulfattah Safa <fattah.safa(a)gmail.com> wrote:

...

1. You can use limit parameter to get more titles in each request 2. For getting many entries it is recommended to extract from dumps or

from

database using quarry On May 6, 2017 22:36, "Abdulfattah Safa" <fattah.safa(a)gmail.com> wrote: > for the & in $Continue=-||, it's a type. It doesn't exist in the code. > > On Sat, May 6, 2017 at 10:12 PM Abdulfattah Safa <

fattah.safa(a)gmail.com>

wrote: > I'm trying to get all the page titles in Wikipedia in namespace using

the

API as following: https://en.wikipedia.org/w/api.php?action=query&format=

xml&list=allpages&apnamespace=0&apfilterredir=nonredirects& aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE > > I keep requesting this url and checking the response if contains

continue

> tag. if yes, then I use same request but change the *BASE_PAGE_TITLE

*to

the value in apcontinue attribute in the response. My applications had been running since 3 days and number of retrieved exceeds 30M, whereas it is about 13M in the dumps. any idea?

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

John

10:45 p.m.

...

Give me a few minutes I can get you a database dump of what you need. On Sat, May 6, 2017 at 5:25 PM, Abdulfattah Safa <fattah.safa(a)gmail.com> wrote:

1. You can use limit parameter to get more titles in each request 2. For getting many entries it is recommended to extract from dumps or

from

fattah.safa(a)gmail.com>

> wrote: > > > I'm trying to get all the page titles in Wikipedia in namespace

using

the

> API as following: > > https://en.wikipedia.org/w/api.php?action=query&format= xml&list=allpages&apnamespace=0&apfilterredir=nonredirects& aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE > > I keep requesting this url and checking the response if contains

continue

> tag. if yes, then I use same request but change the *BASE_PAGE_TITLE

*to > > the value in apcontinue attribute in the response. > > My applications had been running since 3 days and number of

retrieved

> exceeds 30M, whereas it is about 13M in the dumps. > any idea? > > > _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Abdulfattah Safa

7 May 7 May

7:41 a.m.

hello John, Thanks for your effort. Actually I need official dumps as I need to use them in my thesis. Could you please point me how did you get these ones? Also, any idea why the API doesn't work properly for en Wikipedia? I use the same code for other language and it worked. Thanks, Abed, On Sun, May 7, 2017 at 1:45 AM John <phoenixoverride(a)gmail.com> wrote:

...

Give me a few minutes I can get you a database dump of what you need. On Sat, May 6, 2017 at 5:25 PM, Abdulfattah Safa <fattah.safa(a)gmail.com> wrote: > 1. I'm usng max as a limit parameter > 2. I'm not sure if the dumps have the data I need. I need to get the > titles > for all Articles (name space = 0), with no redirects and also need the > titles of all Categories (namespace = 14) without redirects > > On Sat, May 6, 2017 at 11:39 PM Eran Rosenthal <eranroz89(a)gmail.com> > wrote: > > > 1. You can use limit parameter to get more titles in each request > > 2. For getting many entries it is recommended to extract from dumps or > from > > database using quarry > > > > On May 6, 2017 22:36, "Abdulfattah Safa" <fattah.safa(a)gmail.com>

wrote:

> > > > > for the & in $Continue=-||, it's a type. It doesn't exist in the

code.

> > > > > > On Sat, May 6, 2017 at 10:12 PM Abdulfattah Safa < > fattah.safa(a)gmail.com> > > > wrote: > > > > > > > I'm trying to get all the page titles in Wikipedia in namespace > using > > the > > > > API as following: > > > > > > > > https://en.wikipedia.org/w/api.php?action=query&format= > > > xml&list=allpages&apnamespace=0&apfilterredir=nonredirects& > > > aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE > > > > > > > > I keep requesting this url and checking the response if contains > > continue > > > > tag. if yes, then I use same request but change the

*BASE_PAGE_TITLE

*to > > the value in apcontinue attribute in the response. > > My applications had been running since 3 days and number of

retrieved

> > exceeds 30M, whereas it is about 13M in the dumps. > > any idea? > > > > > > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

John

10:49 a.m.

those are official, I ran the report from toollabs which is Wikimedia's developer platform which includes a copy of en.Wikipedia's database (with sensitive fields removed). Without looking at your code and doing some testing, which unfortunately I don't have the time for, I cannot help debugging why your code isn't working. Those two files where created by running "sql enwiki_p "select page_namespace from page where page_is_redirect =0 and page_namespace = 0;"> ns_0.txt" then compressing the resulting text file via 7zip. For the category namespace I just changed page_namespace = 0 to page_namespace = 14, On Sun, May 7, 2017 at 3:41 AM, Abdulfattah Safa <fattah.safa(a)gmail.com> wrote:

...

Here you go ns_0.7z <http://tools.wmflabs.org/betacommand-dev/reports/ns_0.7z> ns_14.7z <http://tools.wmflabs.org/betacommand-dev/reports/ns_14.7z> On Sat, May 6, 2017 at 5:27 PM, John <phoenixoverride(a)gmail.com> wrote: > Give me a few minutes I can get you a database dump of what you need. > > On Sat, May 6, 2017 at 5:25 PM, Abdulfattah Safa <

fattah.safa(a)gmail.com>

> wrote: > >> 1. I'm usng max as a limit parameter >> 2. I'm not sure if the dumps have the data I need. I need to get the >> titles >> for all Articles (name space = 0), with no redirects and also need the >> titles of all Categories (namespace = 14) without redirects >> >> On Sat, May 6, 2017 at 11:39 PM Eran Rosenthal <eranroz89(a)gmail.com> >> wrote: >> >> > 1. You can use limit parameter to get more titles in each request >> > 2. For getting many entries it is recommended to extract from dumps

> from > > database using quarry > > > > On May 6, 2017 22:36, "Abdulfattah Safa" <fattah.safa(a)gmail.com>

wrote:

> > > > > for the & in $Continue=-||, it's a type. It doesn't exist in the

code.

*BASE_PAGE_TITLE

> *to > > > the value in apcontinue attribute in the response. > > > My applications had been running since 3 days and number of retrieved > > > exceeds 30M, whereas it is about 13M in the dumps. > > > any idea? > > > > > > > > > > > _______________________________________________ > > Wikitech-l mailing list > > Wikitech-l(a)lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Jaime Crespo

12:22 p.m.

On Sat, May 6, 2017 at 9:12 PM, Abdulfattah Safa <fattah.safa(a)gmail.com> wrote:

...

I'm trying to get all the page titles in Wikipedia in namespace using the API as following: https://en.wikipedia.org/w/api.php?action=query&format= xml&list=allpages&apnamespace=0&apfilterredir=nonredirects& aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE I keep requesting this url and checking the response if contains continue tag. if yes, then I use same request but change the *BASE_PAGE_TITLE *to the value in apcontinue attribute in the response. My applications had been running since 3 days and number of retrieved exceeds 30M, whereas it is about 13M in the dumps. any idea?

Please do not scrap the web for those kind of requests- it is a waste of resources for you and for Wikimedia servers (given that there is a faster and more reliable alternative). Looking at https://dumps.wikimedia.org/enwiki/20170501/ you can find: 2017-05-03 07:26:20 done List of all page titles https://dumps.wikimedia.org/enwiki/20170501/enwiki-20170501-all-titles.gz (221.7 MB) 2017-05-03 07:22:02 done List of page titles in main namespace https://dumps.wikimedia.org/enwiki/20170501/enwiki-20170501-all-titles-in-n… (70.8 MB) Use one of the above. Not only it is faster, you will also get consistent results- by the time you stop going over your loop, pages have been created and deleted. The above exports are done trying to get the most consistent state as practically possible, and actively monitored by WMF staff.

Jaime Crespo

12:42 p.m.

...

Looking at https://dumps.wikimedia.org/enwiki/20170501/ you can find: 2017-05-03 07:26:20 done List of all page titles https://dumps.wikimedia.org/enwiki/20170501/enwiki-20170501-all-titles.gz (221.7 MB) 2017-05-03 07:22:02 done List of page titles in main namespace https://dumps.wikimedia.org/enwiki/20170501/enwiki- 20170501-all-titles-in-ns0.gz (70.8 MB)

If you want to do analysis of namespaces and redirects on your own, you can also use: https://dumps.wikimedia.org/enwiki/20170501/enwiki-20170501-page.sql.gz It is larger, but you can filter by columns page_is_redirect and page_namespace on your own terms.

2570

days inactive

2571

days old

wikitech-l@lists.wikimedia.org

Manage subscription

9 comments

4 participants

tags (0)

participants (4)

Abdulfattah Safa
Eran Rosenthal
Jaime Crespo
John