Apologize if this is not the appropriate forum for the question.
I am trying to access the content of Category pages from either the dump or APIs.
For example, I would like to get a complete list of rivers
http://en.wikipedia.org/wiki/Category:Lists_of_rivers
The API does provide the content but it is throttled
https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&...
Therefore I would like to find the content in the dumps. However, I cannot find this information in the dumps. I have looked inside
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml
and find nothing there. The pages are referenced in the the page SQL dumps
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.sql.gz
Do any of the dumps contain the category page content?
thank you
If you want to find out which pages are in some category, you need the categorylinks.sql dump. And to use that, you will probably also need page.sql.
Those dumps are in SQL, so probably the simplest way to use them is to import them into a MySQL database and then access that.
Documentation that explains what will the imported tables contain are at https://www.mediawiki.org/wiki/Manual:Categorylinks_table and https://www.mediawiki.org/wiki/Manual:Page_table.
Petr Onderka [[en:User:Svick]]
On Fri, Oct 18, 2013 at 12:18 AM, Peyman Faratin peyman@robustlinks.com wrote:
Apologize if this is not the appropriate forum for the question.
I am trying to access the content of Category pages from either the dump or APIs.
For example, I would like to get a complete list of rivers
http://en.wikipedia.org/wiki/Category:Lists_of_rivers
The API does provide the content but it is throttled
https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&...
Therefore I would like to find the content in the dumps. However, I cannot find this information in the dumps. I have looked inside
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml
and find nothing there. The pages are referenced in the the page SQL dumps
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.sql.gz
Do any of the dumps contain the category page content?
thank you
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Hi Petr
yes, makes sense. I will try your suggestion
thank you for your help
best
Peyman
On Oct 17, 2013, at 6:24 PM, Petr Onderka gsvick@gmail.com wrote:
If you want to find out which pages are in some category, you need the categorylinks.sql dump. And to use that, you will probably also need page.sql.
Those dumps are in SQL, so probably the simplest way to use them is to import them into a MySQL database and then access that.
Documentation that explains what will the imported tables contain are at https://www.mediawiki.org/wiki/Manual:Categorylinks_table and https://www.mediawiki.org/wiki/Manual:Page_table.
Petr Onderka [[en:User:Svick]]
On Fri, Oct 18, 2013 at 12:18 AM, Peyman Faratin peyman@robustlinks.com wrote:
Apologize if this is not the appropriate forum for the question.
I am trying to access the content of Category pages from either the dump or APIs.
For example, I would like to get a complete list of rivers
http://en.wikipedia.org/wiki/Category:Lists_of_rivers
The API does provide the content but it is throttled
https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&...
Therefore I would like to find the content in the dumps. However, I cannot find this information in the dumps. I have looked inside
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml
and find nothing there. The pages are referenced in the the page SQL dumps
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.sql.gz
Do any of the dumps contain the category page content?
thank you
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
xmldatadumps-l@lists.wikimedia.org