Thanks for your feedback on my first try on loading Wikidata on BigQuery
https://lists.wikimedia.org/pipermail/wikidata/2016-March/008414.htmlI think I figured out the 'convolution tree' for sub-classes, I left it here:
https://bigquery.cloud.google.com/table/fh-bigquery:wikidata.subclassesIt seems we have:
SELECT level, COUNT(*) c
FROM [fh-bigquery:wikidata.subclasses] b
GROUP BY 1
ORDER BY 1- 453072 classes (level 0).- 629663 x subclass_of y relations (level 1).- 635074 x is a subclass_of y and y is a subclass_of z relations (level 2).- 773622 level 3.
- ...- 61920 level 11.
- ...- 196 level 20.- and that's it... the tree doesn't go deeper than 20.Now I can ask for the Wikipedia pageviews of everyone that has an occupation that's a sub-class of 'musician' (or 'politician', or any other class):
SELECT en_wiki, SUM(requests) requests, FIRST(occupation) occupation,
VARIANCE(LOG(requests)) logvar
FROM [fh-bigquery:wikipedia.pagecounts_201602_en_top365k] a
JOIN (
SELECT en_wiki, GROUP_CONCAT(b.en_label) occupation
FROM FLATTEN([wikidata.latest_en_v1], occupation) a
JOIN (
SELECT numeric_id, GROUP_CONCAT(en_label) en_label
FROM [fh-bigquery:wikidata.subclasses] b
WHERE subclass_of_numeric_id=639669
GROUP BY 1
) b
ON a.occupation.numeric_id=b.numeric_id
GROUP BY 1
) b
ON a.title=b.en_wiki
#WHERE language='en'
GROUP BY 1
HAVING logvar<2
ORDER BY 2 DESC
LIMIT 8000
https://github.com/fhoffa/code_snippets/blob/master/wikidata/musicians_all_pageviews.sqlAnd the results:
(this query took 5.2s, for 6.77 GB processed)Hopefully you'll find this useful! I know that SQL is way less expressive than SPARQL, but it might save the day whenever the speed of BigQuery could be required. Try it out if you have a minute.Please keep the feedback and advise coming,