Perhaps more controversial, but interesting as a query exploring the instance-of sub-classes, 'fictional characters':
https://github.com/fhoffa/code_snippets/blob/master/wikidata/fictional_chara...
SELECT en_wiki, SUM(requests) requests, FIRST(instance_of) instance_of, VARIANCE(LOG(requests)) logvar FROM [fh-bigquery:wikipedia.pagecounts_201602_en_top365k] a JOIN ( SELECT en_wiki, GROUP_CONCAT(b.en_label) instance_of FROM FLATTEN([wikidata.latest_en_v1], instance_of) a JOIN ( SELECT numeric_id, GROUP_CONCAT(en_label) en_label FROM [fh-bigquery:wikidata.subclasses] b WHERE subclass_of_numeric_id=95074 # fictional character GROUP BY 1 ) b ON a.instance_of.numeric_id=b.numeric_id GROUP BY 1 ) b ON a.title=b.en_wiki #WHERE language='en' GROUP BY 1 HAVING logvar<2 ORDER BY 2 DESC LIMIT 8000
https://github.com/fhoffa/code_snippets/blob/master/wikidata/fictional_chara...
Row en_wiki requests instance_of logvar 1 Deadpool 1728814 mutant,superhero 0.386713546 2 Negasonic_Teenage_Warhead 577439 mutant 0.6765129273 3 Cable_(comics) 356178 mutant,superhero 1.387845036 4 Angel_Dust_(comics) 348442 mutant,superhero 0.5164479432 5 Copycat_(Marvel_Comics) 325989 mutant,supervillain,superhero 0.5722760146 6 Ajax_(comics) 304475 fictional character,mutate,superhero 0.6385063943 7 Colossus_(comics) 229808 mutant,superhero 0.5811843835 8 Geordi_La_Forge 183000 fictional human 1.593820584 9 Batman 168041 fictional character 0.07953897313 10 Hunter_Zolomon 164547 comic character 0.4568131007 11 Jessica_Jones 160619 fictional character,superhero 0.05243812944 12 Jesus 159111 deity,mythical character,human biblical character 0.08203965668 13 Weasel_(Marvel_Comics) 139374 fictional character 0.5396001692 14 ... ... ... ...
On Mon, Mar 14, 2016 at 10:55 PM, Felipe Hoffa felipe.hoffa@gmail.com wrote:
Thanks for your feedback on my first try on loading Wikidata on BigQuery
https://lists.wikimedia.org/pipermail/wikidata/2016-March/008414.html
I think I figured out the 'convolution tree' for sub-classes, I left it here:
https://bigquery.cloud.google.com/table/fh-bigquery:wikidata.subclasses
It seems we have:
SELECT level, COUNT(*) c FROM [fh-bigquery:wikidata.subclasses] b GROUP BY 1 ORDER BY 1
- 453072 classes (level 0).
- 629663 x subclass_of y relations (level 1).
- 635074 x is a subclass_of y and y is a subclass_of z relations (level 2).
- 773622 level 3.
- ...
- 61920 level 11.
- ...
- 196 level 20.
- and that's it... the tree doesn't go deeper than 20.
https://i.imgur.com/BUv8Hdp.png
Now I can ask for the Wikipedia pageviews of everyone that has an occupation that's a sub-class of 'musician' (or 'politician', or any other class):
SELECT en_wiki, SUM(requests) requests, FIRST(occupation) occupation, VARIANCE(LOG(requests)) logvar FROM [fh-bigquery:wikipedia.pagecounts_201602_en_top365k] a JOIN ( SELECT en_wiki, GROUP_CONCAT(b.en_label) occupation FROM FLATTEN([wikidata.latest_en_v1], occupation) a JOIN ( SELECT numeric_id, GROUP_CONCAT(en_label) en_label FROM [fh-bigquery:wikidata.subclasses] b WHERE subclass_of_numeric_id=639669 GROUP BY 1 ) b ON a.occupation.numeric_id=b.numeric_id GROUP BY 1 ) b ON a.title=b.en_wiki #WHERE language='en' GROUP BY 1 HAVING logvar<2 ORDER BY 2 DESC LIMIT 8000
https://github.com/fhoffa/code_snippets/blob/master/wikidata/musicians_all_p...
And the results:
https://github.com/fhoffa/code_snippets/blob/master/wikidata/musicians_all_p...
(this query took 5.2s, for 6.77 GB processed)
Hopefully you'll find this useful! I know that SQL is way less expressive than SPARQL, but it might save the day whenever the speed of BigQuery could be required. Try it out if you have a minute.
Please keep the feedback and advise coming, Felipe Hoffa https://twitter.com/felipehoffa