(thx Denny for the introduction to all things Wikidata!)
It's all very early, but I wanted to share some results, and ask for advice on how to continue.
The best news about Wikidata in BigQuery: You can process the whole raw JSON dump in about 7 seconds:
SELECT MIN(LENGTH(item))
FROM [fh-bigquery:wikidata.latest_raw]
WHERE LENGTH(item)>5
(the shortest element in wikidata is 102 characters, and I use the LENGTH()<5 to filter the first and last rows of the dump file, which are simple square brackets)
You can also parse each JSON record on the fly:
SELECT JSON_EXTRACT_SCALAR(item, '$.id')
FROM [fh-bigquery:wikidata.latest_raw]
WHERE LENGTH(item)=102
(4 seconds, the shortest element is
https://www.wikidata.org/wiki/Q2307693)
Or to find cats:
SELECT JSON_EXTRACT_SCALAR(item, '$.id') id,
JSON_EXTRACT_SCALAR(item, '$.sitelinks.enwiki.title') title,
JSON_EXTRACT_SCALAR(item, '$.labels.en.value') label,
item
FROM [fh-bigquery:wikidata.latest_raw]
WHERE JSON_EXTRACT_SCALAR(item, '$.claims.P31[0].mainsnak.datavalue.value.numeric-id')='146' #cats
AND LENGTH(item)>10
LIMIT 300
(Wikidata has 54 cats)
SQL is very limited though - how about running some JavaScript inside SQL? Here I'm looking for Japanese and Arabic cats, and URL encoding their links:
https://github.com/fhoffa/code_snippets/blob/master/wikidata/find_cats_japanese_arabic.sql (25 links to the Japanese and Arabic Wikipedia)
(Wikidata has ~20M rows, while my "English Wikidata" has ~6M)
With this new table, I can write simpler queries that ask questions like "who has female and male genders assigned on Wikidata":
SELECT en_wiki, GROUP_CONCAT(UNIQUE(STRING(gender.numeric_id))) WITHIN RECORD genders
FROM [fh-bigquery:wikidata.latest_en_v1]
OMIT RECORD IF (EVERY(gender.numeric_id!=6581072) OR EVERY(gender.numeric_id!=6581097))
(33 records, and some look like they shouldn't have been assigned both genders)
Finally, why did I URL encode the titles to the English Wikipedia? So I can run JOINs with the Wikipedia pageviews dataset to find out the most visited cats (or movies?):
SELECT en_wiki, SUM(requests) requests
FROM [fh-bigquery:wikipedia.pagecounts_201602] a
JOIN (
SELECT en_wiki
FROM [fh-bigquery:wikidata.latest_en_v1]
WHERE instance_of.numeric_id=146
) b
ON a.title=b.en_wiki
WHERE language='en'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100
(13 seconds, Grumpy Cat got 19,342 requests in February)
Or to process way less data, a JOIN that only looks at the top 365k pages from English Wikipedia):
SELECT en_wiki, SUM(requests) requests
FROM [fh-bigquery:wikipedia.pagecounts_201602_en_top365k] a
JOIN (
SELECT en_wiki
FROM [fh-bigquery:wikidata.latest_en_v1]
WHERE instance_of.numeric_id=146
) b
ON a.title=b.en_wiki
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100
(15 seconds, same answers, but only 14 of the 39 cats are in the top 365k pages)
What I need help with:
- Advice, feedback?
Thanks, hopefully this is useful to the Wikidata community.
--Felipe Hoffa