Hello everyone,
I am trying to use the MediaWiki API to create a dictionary based on categories or lists on Wikipedia. I would like to be able to select a category, or perhaps a list page, and get all members of that list.
I've done some reading of the API, and implemented a prototype. It works a little bit but only when the data is structured just perfectly for my purposes. For example, I can easily get a list of all of the English-language films. I'm using the action=query and list=categorymembers for this. I end up with 500 films at a time, and I can continue as needed to get all 60k or so. This is because there is a category that is tagged to each English-language film's individual page.
On the other hand, if I want to get a list of all National Hockey League (NHL) players, this is a lot more difficult. The category "Category:Lists of National Hockey League players" exists, but it's a category of lists of players. Much of the categorization of Wikipedia turns out to be in lists, not categories. I could write a webscrapper for this but that would probably be very unreliable.
Is there a standardized way to deal with lists and sublists that I might have missed? I don't mind write a bunch of code to recursively crawl sublists and expand them. But I would like to avoid something as not-standard as web scrapping the content because it will be very fragile.
Thank you for the help, -mike
On Wed, Jan 10, 2018 at 2:19 PM, Mike MacHenry mike.machenry@gmail.com wrote:
On the other hand, if I want to get a list of all National Hockey League (NHL) players, this is a lot more difficult. The category "Category:Lists of National Hockey League players" exists, but it's a category of lists of players. Much of the categorization of Wikipedia turns out to be in lists, not categories. I could write a webscrapper for this but that would probably be very unreliable.
There is a Category:National Hockey League players. You'll have to handle subcategories on your own but that's still a lot less messy than parsing HTML.
Is there a standardized way to deal with lists and sublists that I might
have missed? I don't mind write a bunch of code to recursively crawl sublists and expand them. But I would like to avoid something as not-standard as web scrapping the content because it will be very fragile.
There is not. You can check if Wikidata has something appropriate (e.g. all humans with the P3522 (NHL.com player ID) property), but otherwise you are on your own. Also, there is no guarantee Wikipedia and Wikidata has the same data (every Wikipedia article has an item in Wikidata but often the properties are not fleshed out yet).
It’s not quite ready for “gold standard” evaluation, but I wonder if wptools could be helpful?
https://github.com/siznax/wptools https://github.com/siznax/wptools
We just added support for category continuations.
On Jan 10, 2018, at 2:19 PM, Mike MacHenry mike.machenry@gmail.com wrote:
Hello everyone,
I am trying to use the MediaWiki API to create a dictionary based on categories or lists on Wikipedia. I would like to be able to select a category, or perhaps a list page, and get all members of that list.
I've done some reading of the API, and implemented a prototype. It works a little bit but only when the data is structured just perfectly for my purposes. For example, I can easily get a list of all of the English-language films. I'm using the action=query and list=categorymembers for this. I end up with 500 films at a time, and I can continue as needed to get all 60k or so. This is because there is a category that is tagged to each English-language film's individual page.
On the other hand, if I want to get a list of all National Hockey League (NHL) players, this is a lot more difficult. The category "Category:Lists of National Hockey League players" exists, but it's a category of lists of players. Much of the categorization of Wikipedia turns out to be in lists, not categories. I could write a webscrapper for this but that would probably be very unreliable.
Is there a standardized way to deal with lists and sublists that I might have missed? I don't mind write a bunch of code to recursively crawl sublists and expand them. But I would like to avoid something as not-standard as web scrapping the content because it will be very fragile.
Thank you for the help, -mike _______________________________________________ Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
mediawiki-api@lists.wikimedia.org