Thanks for your feedback on my first try on loading Wikidata on BigQuery

https://lists.wikimedia.org/pipermail/wikidata/2016-March/008414.html


I think I figured out the 'convolution tree' for sub-classes, I left it here:

https://bigquery.cloud.google.com/table/fh-bigquery:wikidata.subclasses

It seems we have:

SELECT level, COUNT(*) c
FROM [fh-bigquery:wikidata.subclasses] b
GROUP BY 1
ORDER BY 1

- 453072 classes (level 0).
- 629663 x subclass_of y relations (level 1).
- 635074 x is a subclass_of y and y is a subclass_of z relations (level 2).
- 773622 level 3.
- ...
- 61920 level 11.
- ...
- 196 level 20.
- and that's it... the tree doesn't go deeper than 20.


https://i.imgur.com/BUv8Hdp.png


Now I can ask for the Wikipedia pageviews of everyone that has an occupation that's a sub-class of 'musician' (or 'politician', or any other class):


SELECT en_wiki, SUM(requests) requests, FIRST(occupation) occupation, 
               VARIANCE(LOG(requests)) logvar
FROM [fh-bigquery:wikipedia.pagecounts_201602_en_top365k] a
JOIN (
  SELECT en_wiki, GROUP_CONCAT(b.en_label) occupation
  FROM FLATTEN([wikidata.latest_en_v1], occupation) a
  JOIN (
    SELECT numeric_id, GROUP_CONCAT(en_label) en_label
    FROM [fh-bigquery:wikidata.subclasses] b
    WHERE subclass_of_numeric_id=639669
    GROUP BY 1
  ) b
  ON a.occupation.numeric_id=b.numeric_id
  GROUP BY 1
) b
ON a.title=b.en_wiki
#WHERE language='en'
GROUP BY 1
HAVING logvar<2
ORDER BY 2 DESC
LIMIT 8000


https://github.com/fhoffa/code_snippets/blob/master/wikidata/musicians_all_pageviews.sql

And the results:

en_wikirequestsoccupationlogvar
Kanye_West940181singer,rapper0.6882018066
Sia_Furler562789singer,songwriter,composer0.561231088
Brie_Larson555301singer,musician,singer-songwriter0.6390245475
Beyonc%C3%A9503342record producer,singer-songwriter,singer,composer,musician0.477047463
David_Bowie502810singer-songwriter,guitarist,saxophonist,composer,record producer0.1213822659
Adele480541singer-songwriter,singer,guitarist0.1618690338
Keanu_Reeves471244singer,musician0.53960377
Rihanna419017singer0.1801038939
Taylor_Swift409519singer-songwriter,pianist,bajista,composer,guitarist0.2761908317
Zayn_Malik405848singer0.08530145229
Kesha402165singer,composer,singer-songwriter,yodeler1.781225996
Lady_Gaga390866singer,songwriter,record producer,composer,pianist,musician0.5030283604
Michael_Jackson361344singer,singer-songwriter,composer,musician,songwriter,record producer0.1256237352
Bill_Clinton347856saxophonist0.150877313
Kendrick_Lamar338141singer,songwriter,rapper0.4858019514
Justin_Bieber329738singer-songwriter,singer,musician0.1038992156
............

https://github.com/fhoffa/code_snippets/blob/master/wikidata/musicians_all_pageviews.csv

(this query took 5.2s, for 6.77 GB processed)

Hopefully you'll find this useful! I know that SQL is way less expressive than SPARQL, but it might save the day whenever the speed of BigQuery could be required. Try it out if you have a minute.

Please keep the feedback and advise coming,
Felipe Hoffa
https://twitter.com/felipehoffa