hi all
i just joined this list so i'm not sure if this is the correct forum to ask this, but how do i get a master list of all wikipedia categories and sub-categories? i looked a the database download page and took a peek at pages_articles.xml and see that one can parse out Categories from the mediawiki tags embedded in the text. but is there some other dump with all the categories (even ones that may not have articles) as well as their subcategories available somewhere?
apologies in advance if this was the wrong list to post this question to, but if someone could direct me to the right one it would be most appreciated.
thanks much!
On 9/7/06, karen marcelo karen@radarnetworks.com wrote:
hi all
i just joined this list so i'm not sure if this is the correct forum to ask this, but how do i get a master list of all wikipedia categories and sub-categories? i looked a the database download page and took a peek at pages_articles.xml and see that one can parse out Categories from the mediawiki tags embedded in the text. but is there some other dump with all the categories (even ones that may not have articles) as well as their subcategories available somewhere?
apologies in advance if this was the wrong list to post this question to, but if someone could direct me to the right one it would be most appreciated.
thanks much!
Isn't one of the dumps offered something like "categorylinks"?
~maru
You could try Special:Allpages and select the category namespace. If that doesn't serve your needs. Check out the toolserver.
On 9/8/06, maru dubshinki marudubshinki@gmail.com wrote:
On 9/7/06, karen marcelo karen@radarnetworks.com wrote:
hi all
i just joined this list so i'm not sure if this is the correct forum to ask this, but how do i get a master list of all wikipedia categories and sub-categories? i looked a the database download page and took a peek at pages_articles.xml and see that one can parse out Categories from the mediawiki tags embedded in the text. but is there some other dump with all the categories (even ones that may not have articles) as well as their subcategories available somewhere?
apologies in advance if this was the wrong list to post this question to, but if someone could direct me to the right one it would be most appreciated.
thanks much!
Isn't one of the dumps offered something like "categorylinks"?
~maru _______________________________________________ WikiEN-l mailing list WikiEN-l@Wikipedia.org To unsubscribe from this mailing list, visit: http://mail.wikipedia.org/mailman/listinfo/wikien-l
On 9/7/06, karen marcelo karen@radarnetworks.com wrote:
hi all
i just joined this list so i'm not sure if this is the correct forum to ask this, but how do i get a master list of all wikipedia categories and sub-categories? i looked a the database download page and took a peek at pages_articles.xml and see that one can parse out Categories from the mediawiki tags embedded in the text. but is there some other dump with all the categories (even ones that may not have articles) as well as their subcategories available somewhere?
apologies in advance if this was the wrong list to post this question to, but if someone could direct me to the right one it would be most appreciated.
thanks much!
If you look at http://download.wikimedia.org/enwiki/latest/ you can see that there is a dump that's called categorylinks ( http://download.wikimedia.org/enwiki/latest/enwiki-latest-categorylinks.sql.... ). It's just under 100 mb. That should be it. The schema for it can be found at http://meta.wikimedia.org/wiki/Categorylinks_table
It only identifies articles that contains a certain category by the id-number of the article, which means that if you want to know what categories are sub/super-categories you need to download another dump that contains the names of the categories, as well as their id. cur certainly contains this, but it's 1.5 GB large, and that may be a bit large. You should try enwiki-latest-all-titles-in-ns0.gz, that might contain the ids. It's just 14 mb big, so it's not hard to just download them and try.
Note: I'm not a (very good) mediawiki developer ;)
Good luck!
--Oskar