Thanks for your feedback on my first try on loading Wikidata on BigQuery
https://lists.wikimedia.org/pipermail/wikidata/2016-March/008414.html
I think I figured out the 'convolution tree' for sub-classes, I left it here:
https://bigquery.cloud.google.com/table/fh-bigquery:wikidata.subclasses
It seems we have:
SELECT level, COUNT(*) c FROM [fh-bigquery:wikidata.subclasses] b GROUP BY 1 ORDER BY 1
- 453072 classes (level 0). - 629663 x subclass_of y relations (level 1). - 635074 x is a subclass_of y and y is a subclass_of z relations (level 2). - 773622 level 3. - ... - 61920 level 11. - ... - 196 level 20. - and that's it... the tree doesn't go deeper than 20.
https://i.imgur.com/BUv8Hdp.png
Now I can ask for the Wikipedia pageviews of everyone that has an occupation that's a sub-class of 'musician' (or 'politician', or any other class):
SELECT en_wiki, SUM(requests) requests, FIRST(occupation) occupation, VARIANCE(LOG(requests)) logvar FROM [fh-bigquery:wikipedia.pagecounts_201602_en_top365k] a JOIN ( SELECT en_wiki, GROUP_CONCAT(b.en_label) occupation FROM FLATTEN([wikidata.latest_en_v1], occupation) a JOIN ( SELECT numeric_id, GROUP_CONCAT(en_label) en_label FROM [fh-bigquery:wikidata.subclasses] b WHERE subclass_of_numeric_id=639669 GROUP BY 1 ) b ON a.occupation.numeric_id=b.numeric_id GROUP BY 1 ) b ON a.title=b.en_wiki #WHERE language='en' GROUP BY 1 HAVING logvar<2 ORDER BY 2 DESC LIMIT 8000
https://github.com/fhoffa/code_snippets/blob/master/wikidata/musicians_all_p...
And the results:
en_wiki requests occupation logvar Kanye_West 940181 singer,rapper 0.6882018066 Sia_Furler 562789 singer,songwriter,composer 0.561231088 Brie_Larson 555301 singer,musician,singer-songwriter 0.6390245475 Beyonc%C3%A9 503342 record producer,singer-songwriter,singer,composer,musician 0.477047463 David_Bowie 502810 singer-songwriter,guitarist,saxophonist,composer,record producer 0.1213822659 Adele 480541 singer-songwriter,singer,guitarist 0.1618690338 Keanu_Reeves 471244 singer,musician 0.53960377 Rihanna 419017 singer 0.1801038939 Taylor_Swift 409519 singer-songwriter,pianist,bajista,composer,guitarist 0.2761908317 Zayn_Malik 405848 singer 0.08530145229 Kesha 402165 singer,composer,singer-songwriter,yodeler 1.781225996 Lady_Gaga 390866 singer,songwriter,record producer,composer,pianist,musician 0.5030283604 Michael_Jackson 361344 singer,singer-songwriter,composer,musician,songwriter,record producer 0.1256237352 Bill_Clinton 347856 saxophonist 0.150877313 Kendrick_Lamar 338141 singer,songwriter,rapper 0.4858019514 Justin_Bieber 329738 singer-songwriter,singer,musician 0.1038992156 ... ... ... ...
https://github.com/fhoffa/code_snippets/blob/master/wikidata/musicians_all_p...
(this query took 5.2s, for 6.77 GB processed)
Hopefully you'll find this useful! I know that SQL is way less expressive than SPARQL, but it might save the day whenever the speed of BigQuery could be required. Try it out if you have a minute.
Please keep the feedback and advise coming, Felipe Hoffa https://twitter.com/felipehoffa