It seems we have:
SELECT level, COUNT(*) c
FROM [fh-bigquery:wikidata.subclasses] b
GROUP BY 1
ORDER BY 1
- 453072 classes (level 0).
- 629663 x subclass_of y relations (level 1).
- 635074 x is a subclass_of y and y is a subclass_of z relations (level 2).
- 773622 level 3.
- ...
- 61920 level 11.
- ...
- 196 level 20.
- and that's it... the tree doesn't go deeper than 20.
Now I can ask for the Wikipedia pageviews of everyone that has an occupation that's a sub-class of 'musician' (or 'politician', or any other class):
SELECT en_wiki, SUM(requests) requests, FIRST(occupation) occupation,
VARIANCE(LOG(requests)) logvar
FROM [fh-bigquery:wikipedia.pagecounts_201602_en_top365k] a
JOIN (
SELECT en_wiki, GROUP_CONCAT(b.en_label) occupation
FROM FLATTEN([wikidata.latest_en_v1], occupation) a
JOIN (
SELECT numeric_id, GROUP_CONCAT(en_label) en_label
FROM [fh-bigquery:wikidata.subclasses] b
WHERE subclass_of_numeric_id=639669
GROUP BY 1
) b
ON a.occupation.numeric_id=b.numeric_id
GROUP BY 1
) b
ON a.title=b.en_wiki
#WHERE language='en'
GROUP BY 1
HAVING logvar<2
ORDER BY 2 DESC
LIMIT 8000
https://github.com/fhoffa/code_snippets/blob/master/wikidata/musicians_all_pageviews.sqlAnd the results:
en_wiki | requests | occupation | logvar |
Kanye_West | 940181 | singer,rapper | 0.6882018066 |
Sia_Furler | 562789 | singer,songwriter,composer | 0.561231088 |
Brie_Larson | 555301 | singer,musician,singer-songwriter | 0.6390245475 |
Beyonc%C3%A9 | 503342 | record producer,singer-songwriter,singer,composer,musician | 0.477047463 |
David_Bowie | 502810 | singer-songwriter,guitarist,saxophonist,composer,record producer | 0.1213822659 |
Adele | 480541 | singer-songwriter,singer,guitarist | 0.1618690338 |
Keanu_Reeves | 471244 | singer,musician | 0.53960377 |
Rihanna | 419017 | singer | 0.1801038939 |
Taylor_Swift | 409519 | singer-songwriter,pianist,bajista,composer,guitarist | 0.2761908317 |
Zayn_Malik | 405848 | singer | 0.08530145229 |
Kesha | 402165 | singer,composer,singer-songwriter,yodeler | 1.781225996 |
Lady_Gaga | 390866 | singer,songwriter,record producer,composer,pianist,musician | 0.5030283604 |
Michael_Jackson | 361344 | singer,singer-songwriter,composer,musician,songwriter,record producer | 0.1256237352 |
Bill_Clinton | 347856 | saxophonist | 0.150877313 |
Kendrick_Lamar | 338141 | singer,songwriter,rapper | 0.4858019514 |
Justin_Bieber | 329738 | singer-songwriter,singer,musician | 0.1038992156 |
... | ... | ... | ... |
(this query took 5.2s, for 6.77 GB processed)
Hopefully you'll find this useful! I know that SQL is way less expressive than SPARQL, but it might save the day whenever the speed of BigQuery could be required. Try it out if you have a minute.
Please keep the feedback and advise coming,