I'm trying to get all the page titles in Wikipedia in namespace using the API as following:
https://en.wikipedia.org/w/api.php?action=query&format=xml&list=allp...
I keep requesting this url and checking the response if contains continue tag. if yes, then I use same request but change the *BASE_PAGE_TITLE *to the value in apcontinue attribute in the response. My applications had been running since 3 days and number of retrieved exceeds 30M, whereas it is about 13M in the dumps. any idea?
for the & in $Continue=-||, it's a type. It doesn't exist in the code.
On Sat, May 6, 2017 at 10:12 PM Abdulfattah Safa fattah.safa@gmail.com wrote:
I'm trying to get all the page titles in Wikipedia in namespace using the API as following:
https://en.wikipedia.org/w/api.php?action=query&format=xml&list=allp...
I keep requesting this url and checking the response if contains continue tag. if yes, then I use same request but change the *BASE_PAGE_TITLE *to the value in apcontinue attribute in the response. My applications had been running since 3 days and number of retrieved exceeds 30M, whereas it is about 13M in the dumps. any idea?
1. You can use limit parameter to get more titles in each request 2. For getting many entries it is recommended to extract from dumps or from database using quarry
On May 6, 2017 22:36, "Abdulfattah Safa" fattah.safa@gmail.com wrote:
for the & in $Continue=-||, it's a type. It doesn't exist in the code.
On Sat, May 6, 2017 at 10:12 PM Abdulfattah Safa fattah.safa@gmail.com wrote:
I'm trying to get all the page titles in Wikipedia in namespace using the API as following:
xml&list=allpages&apnamespace=0&apfilterredir=nonredirects& aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE
I keep requesting this url and checking the response if contains continue tag. if yes, then I use same request but change the *BASE_PAGE_TITLE *to the value in apcontinue attribute in the response. My applications had been running since 3 days and number of retrieved exceeds 30M, whereas it is about 13M in the dumps. any idea?
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
1. I'm usng max as a limit parameter 2. I'm not sure if the dumps have the data I need. I need to get the titles for all Articles (name space = 0), with no redirects and also need the titles of all Categories (namespace = 14) without redirects
On Sat, May 6, 2017 at 11:39 PM Eran Rosenthal eranroz89@gmail.com wrote:
- You can use limit parameter to get more titles in each request
- For getting many entries it is recommended to extract from dumps or from
database using quarry
On May 6, 2017 22:36, "Abdulfattah Safa" fattah.safa@gmail.com wrote:
for the & in $Continue=-||, it's a type. It doesn't exist in the code.
On Sat, May 6, 2017 at 10:12 PM Abdulfattah Safa fattah.safa@gmail.com wrote:
I'm trying to get all the page titles in Wikipedia in namespace using
the
API as following:
xml&list=allpages&apnamespace=0&apfilterredir=nonredirects& aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE
I keep requesting this url and checking the response if contains
continue
tag. if yes, then I use same request but change the *BASE_PAGE_TITLE
*to
the value in apcontinue attribute in the response. My applications had been running since 3 days and number of retrieved exceeds 30M, whereas it is about 13M in the dumps. any idea?
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Give me a few minutes I can get you a database dump of what you need.
On Sat, May 6, 2017 at 5:25 PM, Abdulfattah Safa fattah.safa@gmail.com wrote:
- I'm usng max as a limit parameter
- I'm not sure if the dumps have the data I need. I need to get the titles
for all Articles (name space = 0), with no redirects and also need the titles of all Categories (namespace = 14) without redirects
On Sat, May 6, 2017 at 11:39 PM Eran Rosenthal eranroz89@gmail.com wrote:
- You can use limit parameter to get more titles in each request
- For getting many entries it is recommended to extract from dumps or
from
database using quarry
On May 6, 2017 22:36, "Abdulfattah Safa" fattah.safa@gmail.com wrote:
for the & in $Continue=-||, it's a type. It doesn't exist in the code.
On Sat, May 6, 2017 at 10:12 PM Abdulfattah Safa <
fattah.safa@gmail.com>
wrote:
I'm trying to get all the page titles in Wikipedia in namespace using
the
API as following:
xml&list=allpages&apnamespace=0&apfilterredir=nonredirects& aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE
I keep requesting this url and checking the response if contains
continue
tag. if yes, then I use same request but change the *BASE_PAGE_TITLE
*to
the value in apcontinue attribute in the response. My applications had been running since 3 days and number of retrieved exceeds 30M, whereas it is about 13M in the dumps. any idea?
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Here you go ns_0.7z http://tools.wmflabs.org/betacommand-dev/reports/ns_0.7z ns_14.7z http://tools.wmflabs.org/betacommand-dev/reports/ns_14.7z
On Sat, May 6, 2017 at 5:27 PM, John phoenixoverride@gmail.com wrote:
Give me a few minutes I can get you a database dump of what you need.
On Sat, May 6, 2017 at 5:25 PM, Abdulfattah Safa fattah.safa@gmail.com wrote:
- I'm usng max as a limit parameter
- I'm not sure if the dumps have the data I need. I need to get the
titles for all Articles (name space = 0), with no redirects and also need the titles of all Categories (namespace = 14) without redirects
On Sat, May 6, 2017 at 11:39 PM Eran Rosenthal eranroz89@gmail.com wrote:
- You can use limit parameter to get more titles in each request
- For getting many entries it is recommended to extract from dumps or
from
database using quarry
On May 6, 2017 22:36, "Abdulfattah Safa" fattah.safa@gmail.com wrote:
for the & in $Continue=-||, it's a type. It doesn't exist in the code.
On Sat, May 6, 2017 at 10:12 PM Abdulfattah Safa <
fattah.safa@gmail.com>
wrote:
I'm trying to get all the page titles in Wikipedia in namespace
using
the
API as following:
xml&list=allpages&apnamespace=0&apfilterredir=nonredirects& aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE
I keep requesting this url and checking the response if contains
continue
tag. if yes, then I use same request but change the *BASE_PAGE_TITLE
*to
the value in apcontinue attribute in the response. My applications had been running since 3 days and number of
retrieved
exceeds 30M, whereas it is about 13M in the dumps. any idea?
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
hello John, Thanks for your effort. Actually I need official dumps as I need to use them in my thesis. Could you please point me how did you get these ones? Also, any idea why the API doesn't work properly for en Wikipedia? I use the same code for other language and it worked.
Thanks, Abed,
On Sun, May 7, 2017 at 1:45 AM John phoenixoverride@gmail.com wrote:
Here you go ns_0.7z http://tools.wmflabs.org/betacommand-dev/reports/ns_0.7z ns_14.7z http://tools.wmflabs.org/betacommand-dev/reports/ns_14.7z
On Sat, May 6, 2017 at 5:27 PM, John phoenixoverride@gmail.com wrote:
Give me a few minutes I can get you a database dump of what you need.
On Sat, May 6, 2017 at 5:25 PM, Abdulfattah Safa fattah.safa@gmail.com wrote:
- I'm usng max as a limit parameter
- I'm not sure if the dumps have the data I need. I need to get the
titles for all Articles (name space = 0), with no redirects and also need the titles of all Categories (namespace = 14) without redirects
On Sat, May 6, 2017 at 11:39 PM Eran Rosenthal eranroz89@gmail.com wrote:
- You can use limit parameter to get more titles in each request
- For getting many entries it is recommended to extract from dumps or
from
database using quarry
On May 6, 2017 22:36, "Abdulfattah Safa" fattah.safa@gmail.com
wrote:
for the & in $Continue=-||, it's a type. It doesn't exist in the
code.
On Sat, May 6, 2017 at 10:12 PM Abdulfattah Safa <
fattah.safa@gmail.com>
wrote:
I'm trying to get all the page titles in Wikipedia in namespace
using
the
API as following:
xml&list=allpages&apnamespace=0&apfilterredir=nonredirects& aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE
I keep requesting this url and checking the response if contains
continue
tag. if yes, then I use same request but change the
*BASE_PAGE_TITLE
*to
the value in apcontinue attribute in the response. My applications had been running since 3 days and number of
retrieved
exceeds 30M, whereas it is about 13M in the dumps. any idea?
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
those are official, I ran the report from toollabs which is Wikimedia's developer platform which includes a copy of en.Wikipedia's database (with sensitive fields removed). Without looking at your code and doing some testing, which unfortunately I don't have the time for, I cannot help debugging why your code isn't working. Those two files where created by running "sql enwiki_p "select page_namespace from page where page_is_redirect =0 and page_namespace = 0;"> ns_0.txt" then compressing the resulting text file via 7zip. For the category namespace I just changed page_namespace = 0 to page_namespace = 14,
On Sun, May 7, 2017 at 3:41 AM, Abdulfattah Safa fattah.safa@gmail.com wrote:
hello John, Thanks for your effort. Actually I need official dumps as I need to use them in my thesis. Could you please point me how did you get these ones? Also, any idea why the API doesn't work properly for en Wikipedia? I use the same code for other language and it worked.
Thanks, Abed,
On Sun, May 7, 2017 at 1:45 AM John phoenixoverride@gmail.com wrote:
Here you go ns_0.7z http://tools.wmflabs.org/betacommand-dev/reports/ns_0.7z ns_14.7z http://tools.wmflabs.org/betacommand-dev/reports/ns_14.7z
On Sat, May 6, 2017 at 5:27 PM, John phoenixoverride@gmail.com wrote:
Give me a few minutes I can get you a database dump of what you need.
On Sat, May 6, 2017 at 5:25 PM, Abdulfattah Safa <
fattah.safa@gmail.com>
wrote:
- I'm usng max as a limit parameter
- I'm not sure if the dumps have the data I need. I need to get the
titles for all Articles (name space = 0), with no redirects and also need the titles of all Categories (namespace = 14) without redirects
On Sat, May 6, 2017 at 11:39 PM Eran Rosenthal eranroz89@gmail.com wrote:
- You can use limit parameter to get more titles in each request
- For getting many entries it is recommended to extract from dumps
or
from
database using quarry
On May 6, 2017 22:36, "Abdulfattah Safa" fattah.safa@gmail.com
wrote:
for the & in $Continue=-||, it's a type. It doesn't exist in the
code.
On Sat, May 6, 2017 at 10:12 PM Abdulfattah Safa <
fattah.safa@gmail.com>
wrote:
> I'm trying to get all the page titles in Wikipedia in namespace
using
the
> API as following: > > https://en.wikipedia.org/w/api.php?action=query&format= xml&list=allpages&apnamespace=0&apfilterredir=nonredirects& aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE > > I keep requesting this url and checking the response if contains
continue
> tag. if yes, then I use same request but change the
*BASE_PAGE_TITLE
*to
> the value in apcontinue attribute in the response. > My applications had been running since 3 days and number of
retrieved
> exceeds 30M, whereas it is about 13M in the dumps. > any idea? > > > _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Sat, May 6, 2017 at 9:12 PM, Abdulfattah Safa fattah.safa@gmail.com wrote:
I'm trying to get all the page titles in Wikipedia in namespace using the API as following:
https://en.wikipedia.org/w/api.php?action=query&format= xml&list=allpages&apnamespace=0&apfilterredir=nonredirects& aplimit=max&$continue=-||$apcontinue=BASE_PAGE_TITLE
I keep requesting this url and checking the response if contains continue tag. if yes, then I use same request but change the *BASE_PAGE_TITLE *to the value in apcontinue attribute in the response. My applications had been running since 3 days and number of retrieved exceeds 30M, whereas it is about 13M in the dumps. any idea?
Please do not scrap the web for those kind of requests- it is a waste of resources for you and for Wikimedia servers (given that there is a faster and more reliable alternative).
Looking at https://dumps.wikimedia.org/enwiki/20170501/ you can find:
2017-05-03 07:26:20 done List of all page titles https://dumps.wikimedia.org/enwiki/20170501/enwiki-20170501-all-titles.gz (221.7 MB) 2017-05-03 07:22:02 done List of page titles in main namespace https://dumps.wikimedia.org/enwiki/20170501/enwiki-20170501-all-titles-in-ns... (70.8 MB)
Use one of the above. Not only it is faster, you will also get consistent results- by the time you stop going over your loop, pages have been created and deleted. The above exports are done trying to get the most consistent state as practically possible, and actively monitored by WMF staff.
Looking at https://dumps.wikimedia.org/enwiki/20170501/ you can find:
2017-05-03 07:26:20 done List of all page titles https://dumps.wikimedia.org/enwiki/20170501/enwiki-20170501-all-titles.gz (221.7 MB) 2017-05-03 07:22:02 done List of page titles in main namespace https://dumps.wikimedia.org/enwiki/20170501/enwiki- 20170501-all-titles-in-ns0.gz (70.8 MB)
If you want to do analysis of namespaces and redirects on your own, you can also use: https://dumps.wikimedia.org/enwiki/20170501/enwiki-20170501-page.sql.gz It is larger, but you can filter by columns page_is_redirect and page_namespace on your own terms.
wikitech-l@lists.wikimedia.org