Thanks for your feedback on my first try on loading Wikidata on BigQuery
https://lists.wikimedia.org/pipermail/wikidata/2016-March/008414.html
I think I figured out the 'convolution tree' for sub-classes, I left it
here:
https://bigquery.cloud.google.com/table/fh-bigquery:wikidata.subclasses
It seems we have:
SELECT level, COUNT(*) c
FROM [fh-bigquery:wikidata.subclasses] b
GROUP BY 1
ORDER BY 1
- 453072 classes (level 0).
- 629663 x subclass_of y relations (level 1).
- 635074 x is a subclass_of y and y is a subclass_of z relations (level 2).
- 773622 level 3.
- ...
- 61920 level 11.
- ...
- 196 level 20.
- and that's it... the tree doesn't go deeper than 20.
https://i.imgur.com/BUv8Hdp.png
Now I can ask for the Wikipedia pageviews of everyone that has an
occupation that's a sub-class of 'musician' (or 'politician', or any
other
class):
SELECT en_wiki, SUM(requests) requests, FIRST(occupation) occupation,
VARIANCE(LOG(requests)) logvar
FROM [fh-bigquery:wikipedia.pagecounts_201602_en_top365k] a
JOIN (
SELECT en_wiki, GROUP_CONCAT(b.en_label) occupation
FROM FLATTEN([wikidata.latest_en_v1], occupation) a
JOIN (
SELECT numeric_id, GROUP_CONCAT(en_label) en_label
FROM [fh-bigquery:wikidata.subclasses] b
WHERE subclass_of_numeric_id=639669
GROUP BY 1
) b
ON a.occupation.numeric_id=b.numeric_id
GROUP BY 1
) b
ON a.title=b.en_wiki
#WHERE language='en'
GROUP BY 1
HAVING logvar<2
ORDER BY 2 DESC
LIMIT 8000
https://github.com/fhoffa/code_snippets/blob/master/wikidata/musicians_all_…
And the results:
en_wiki requests occupation logvar
Kanye_West 940181 singer,rapper 0.6882018066
Sia_Furler 562789 singer,songwriter,composer 0.561231088
Brie_Larson 555301 singer,musician,singer-songwriter 0.6390245475
Beyonc%C3%A9 503342 record
producer,singer-songwriter,singer,composer,musician 0.477047463
David_Bowie 502810 singer-songwriter,guitarist,saxophonist,composer,record
producer 0.1213822659
Adele 480541 singer-songwriter,singer,guitarist 0.1618690338
Keanu_Reeves 471244 singer,musician 0.53960377
Rihanna 419017 singer 0.1801038939
Taylor_Swift 409519 singer-songwriter,pianist,bajista,composer,guitarist
0.2761908317
Zayn_Malik 405848 singer 0.08530145229
Kesha 402165 singer,composer,singer-songwriter,yodeler 1.781225996
Lady_Gaga 390866 singer,songwriter,record producer,composer,pianist,musician
0.5030283604
Michael_Jackson 361344
singer,singer-songwriter,composer,musician,songwriter,record
producer 0.1256237352
Bill_Clinton 347856 saxophonist 0.150877313
Kendrick_Lamar 338141 singer,songwriter,rapper 0.4858019514
Justin_Bieber 329738 singer-songwriter,singer,musician 0.1038992156
... ... ... ...
https://github.com/fhoffa/code_snippets/blob/master/wikidata/musicians_all_…
(this query took 5.2s, for 6.77 GB processed)
Hopefully you'll find this useful! I know that SQL is way less expressive
than SPARQL, but it might save the day whenever the speed of BigQuery could
be required. Try it out if you have a minute.
Please keep the feedback and advise coming,
Felipe Hoffa
https://twitter.com/felipehoffa