Wow, thank you! It would take me a whole month to write such a query. :-|
Ettore Rizza
Le sam. 12 janv. 2019 à 15:42, Lucas Werkmeister mail@lucaswerkmeister.de a écrit :
SELECT ?item ?titleEn WITH { SELECT ?item WHERE { ?item wdt:P31 wd:Q5; wdt:P106 wd:Q36180; wdt:P21 wd:Q6581097; wikibase:sitelinks ?sitelinks. } # ORDER BY DESC(?sitelinks) LIMIT 50 } AS %maleAuthors WHERE { INCLUDE %maleAuthors. hint:SubQuery hint:optimizer "None". ?article schema:about ?item; schema:isPartOf https://en.wikipedia.org/ https://en.wikipedia.org/; schema:name ?titleEn. BIND(STR(?titleEn) AS ?title) SERVICE wikibase:mwapi { bd:serviceParam wikibase:api "Generator"; wikibase:endpoint "en.wikipedia.org"; mwapi:generator "allpages"; mwapi:gapfrom ?title; mwapi:gapminsize "10000"; mwapi:gaplimit "1"; wikibase:limit 1 . ?item_ wikibase:apiOutputItem mwapi:item. } FILTER(?item = ?item_) } LIMIT 50
Conveniently, it has a minimum size parameter built in, so we don’t even need to get the size as a revision property and filter on it afterwards.
However, this requires one API call per item, so it doesn’t scale at all – this query with just 50 arbitrary author items already takes about half a minute. (The commented-out ORDER BY DESC(?sitelinks) is intended as a heuristic to find larger articles first, but all the top 50 authors by sitelinks have articles longer than 10000 bytes on enwiki, so in that case you might as well just skip the MWAPI part altogether.)
I don’t think this can work very well. Even if MWAPI was expanded so that we could directly feed 50 or even 500 titles to the query API (as the titles parameter, skipping generators altogether), that’s probably still too much of a bottleneck for this kind of query. On 12.01.19 15:00, Ettore RIZZA wrote:
Hi,
Since the Mediawiki API allows to get the size in bytes of the last revision https://en.wikipedia.org/w/api.php?action=query&format=json&titles=barack%20obama&prop=revisions&rvprop=size of a Wikipedia page, is it not possible to retrieve this information with a generator? (it's a real question, I'm not at all comfortable with this API).
Ettore Rizza
Le sam. 12 janv. 2019 à 14:41, Reem Al-Kashif reemalkashif@gmail.com a écrit :
Right, I see what you mean. Thanks a lot!
On Sat, 12 Jan 2019 at 15:35, Lucas Werkmeister mail@lucaswerkmeister.de wrote:
Well, if you take just the MWAPI part of the query https://query.wikidata.org/#SELECT%20%3Ftitle%20WHERE%20%7B%0A%20%20SERVICE%20wikibase%3Amwapi%20%7B%0A%20%20%20%20bd%3AserviceParam%20wikibase%3Aendpoint%20%22en.wikipedia.org%22%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20wikibase%3Aapi%20%22Generator%22%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20mwapi%3Agenerator%20%22querypage%22%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20mwapi%3Agqppage%20%22Longpages%22%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20mwapi%3Agqplimit%20%22max%22.%0A%20%20%20%20%3Ftitle%20wikibase%3AapiOutput%20mwapi%3Atitle.%0A%20%20%7D%0A%7D, you’ll get exactly 10000 results, but most of them aren’t male authors (a lot of them seem to be lists of various kinds). And I think those 10000 results are all we can get from the API, so if we limit those to male authors afterwards, we only get a few results (about 100), and there’s no way to increase that number as far as I’m aware, because apparently we can’t get more than 10000 total pages from MWAPI.
Cheers, Lucas On 12.01.19 13:57, Reem Al-Kashif wrote:
Thank you so much, Nicolas & Lucas!
@Lucas this helps a lot! At least I will get an idea about what I need until PetScan is sorted out. Would you elaborate a bit more what do you mean by "most of its results are linked to items we don’t care about"?
Best, Reem
On Sat, 12 Jan 2019 at 14:18, Lucas Werkmeister < mail@lucaswerkmeister.de> wrote:
You can’t directly query for the size as far as I know, but you can use the longpages query page generator to get a list of the longest enwiki pages, then filter the associated items for male authors. But this will only get you about a hundred results until the longpages list is exhausted (most of its results are linked to items we don’t care about), and it won’t get you the actual size (and therefore the order of results isn’t necessarily meaningful either, you just know they’re among the longest pages).
SELECT ?item ?titleEn WHERE { hint:Query hint:optimizer "None". SERVICE wikibase:mwapi { bd:serviceParam wikibase:endpoint "en.wikipedia.org"; wikibase:api "Generator"; mwapi:generator "querypage"; mwapi:gqppage "Longpages"; mwapi:gqplimit "max". ?title wikibase:apiOutput mwapi:title. } BIND(STRLANG(?title, "en") AS ?titleEn) ?sitelink schema:name ?titleEn; schema:isPartOf https://en.wikipedia.org/ https://en.wikipedia.org/; schema:about ?item. ?item wdt:P31 wd:Q5; wdt:P106 wd:Q36180; wdt:P21 wd:Q6581097. }
Try it!
Cheers, Lucas On 12.01.19 12:56, Nicolas VIGNERON wrote:
Hi Reem,
If this page https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual/MWAPI is up-o-date it's does not seem possible to get the article size of a wikipedia article (but I must I don't use and know "wikibase:mwapi" a lot, maybe it has or will changed).
Cheers, Nicolas
Le sam. 12 janv. 2019 à 12:16, Reem Al-Kashif reemalkashif@gmail.com a écrit :
Hello!
Hope this finds you well. I put together a query https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20%3FsitelinkEn%0A%0AWHERE%20%7B%0A%20%3Fitem%20wdt%3AP31%20wd%3AQ5.%0A%20%3Fitem%20wdt%3AP106%20wd%3AQ36180.%0A%20%3Fitem%20wdt%3AP21%20wd%3AQ6581097.%0A%20%3FsitelinkEn%20schema%3Aabout%20%3Fitem%3B%0A%20%20%09%09%09%20%20%20%20schema%3AisPartOf%20%3Chttps%3A%2F%2Fen.wikipedia.org%2F%3E.%0A%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22.%20%7D%0A%20%20%7D to create a list of English Wikipedia articles about male writers. Is it possible to filter the results by size? For example, articles that are larger than or equal to 10k bytes?
I understand that this is better done by PetScan, but my PetScan query https://petscan.wmflabs.org/?language=en&project=wikipedia&depth=50&categories=Male%20writers&ns%5B0%5D=1&larger=10000&search_max_results=500&interface_language=en&&doit= refuses to cooperate for a reason I don't know yet.. :/
Thanks in advance.
Best, Reem
--
*Kind regards, Reem Al-Kashif*
http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail Virus-free. www.avg.com http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing listWikidata@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
*Kind regards, Reem Al-Kashif*
Wikidata mailing listWikidata@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
*Kind regards, Reem Al-Kashif* _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing listWikidata@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata