Hello everyone,
I am trying to use the MediaWiki API to create a dictionary based on
categories or lists on Wikipedia. I would like to be able to select a
category, or perhaps a list page, and get all members of that list.
I've done some reading of the API, and implemented a prototype. It works a
little bit but only when the data is structured just perfectly for my
purposes. For example, I can easily get a list of all of the
English-language films. I'm using the action=query and list=categorymembers
for this. I end up with 500 films at a time, and I can continue as needed
to get all 60k or so. This is because there is a category that is tagged to
each English-language film's individual page.
On the other hand, if I want to get a list of all National Hockey League
(NHL) players, this is a lot more difficult. The category "Category:Lists
of National Hockey League players" exists, but it's a category of lists of
players. Much of the categorization of Wikipedia turns out to be in lists,
not categories. I could write a webscrapper for this but that would
probably be very unreliable.
Is there a standardized way to deal with lists and sublists that I might
have missed? I don't mind write a bunch of code to recursively crawl
sublists and expand them. But I would like to avoid something as
not-standard as web scrapping the content because it will be very fragile.
Thank you for the help,
-mike