It’s not quite ready for “gold standard” evaluation, but I wonder if wptools could be
helpful?
<https://github.com/siznax/wptools>
We just added support for category continuations.
On Jan 10, 2018, at 2:19 PM, Mike MacHenry
<mike.machenry(a)gmail.com> wrote:
Hello everyone,
I am trying to use the MediaWiki API to create a dictionary based on categories or lists
on Wikipedia. I would like to be able to select a category, or perhaps a list page, and
get all members of that list.
I've done some reading of the API, and implemented a prototype. It works a little bit
but only when the data is structured just perfectly for my purposes. For example, I can
easily get a list of all of the English-language films. I'm using the action=query and
list=categorymembers for this. I end up with 500 films at a time, and I can continue as
needed to get all 60k or so. This is because there is a category that is tagged to each
English-language film's individual page.
On the other hand, if I want to get a list of all National Hockey League (NHL) players,
this is a lot more difficult. The category "Category:Lists of National Hockey League
players" exists, but it's a category of lists of players. Much of the
categorization of Wikipedia turns out to be in lists, not categories. I could write a
webscrapper for this but that would probably be very unreliable.
Is there a standardized way to deal with lists and sublists that I might have missed? I
don't mind write a bunch of code to recursively crawl sublists and expand them. But I
would like to avoid something as not-standard as web scrapping the content because it will
be very fragile.
Thank you for the help,
-mike
_______________________________________________
Mediawiki-api mailing list
Mediawiki-api(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api