Hi! I'm Felipe Hoffa at Google, and I've been playing with Wikidata's data in BigQuery:
https://twitter.com/felipehoffa/status/705068002522304512
(thx Denny for the introduction to all things Wikidata!)
It's all very early, but I wanted to share some results, and ask for advice on how to continue.
The best news about Wikidata in BigQuery: You can process the whole raw JSON dump in about 7 seconds:
SELECT MIN(LENGTH(item)) FROM [fh-bigquery:wikidata.latest_raw] WHERE LENGTH(item)>5
(the shortest element in wikidata is 102 characters, and I use the LENGTH()<5 to filter the first and last rows of the dump file, which are simple square brackets)
You can also parse each JSON record on the fly:
SELECT JSON_EXTRACT_SCALAR(item, '$.id') FROM [fh-bigquery:wikidata.latest_raw] WHERE LENGTH(item)=102
(4 seconds, the shortest element is https://www.wikidata.org/wiki/Q2307693)
Or to find cats:
SELECT JSON_EXTRACT_SCALAR(item, '$.id') id, JSON_EXTRACT_SCALAR(item, '$.sitelinks.enwiki.title') title, JSON_EXTRACT_SCALAR(item, '$.labels.en.value') label, item FROM [fh-bigquery:wikidata.latest_raw] WHERE JSON_EXTRACT_SCALAR(item, '$.claims.P31[0].mainsnak.datavalue.value.numeric-id')='146' #cats AND LENGTH(item)>10 LIMIT 300
(Wikidata has 54 cats)
SQL is very limited though - how about running some JavaScript inside SQL? Here I'm looking for Japanese and Arabic cats, and URL encoding their links:
https://github.com/fhoffa/code_snippets/blob/master/wikidata/find_cats_japan...
(25 links to the Japanese and Arabic Wikipedia)
Now that I have full control of each element with JavaScript, I can create a more traditional relational table, with nested elements, that only contains Wikidata items that have a page in the English Wikipedia:
https://github.com/fhoffa/code_snippets/blob/master/wikidata/create_wiki_en_...
(Wikidata has ~20M rows, while my "English Wikidata" has ~6M)
With this new table, I can write simpler queries that ask questions like "who has female and male genders assigned on Wikidata":
SELECT en_wiki, GROUP_CONCAT(UNIQUE(STRING(gender.numeric_id))) WITHIN RECORD genders FROM [fh-bigquery:wikidata.latest_en_v1] OMIT RECORD IF (EVERY(gender.numeric_id!=6581072) OR EVERY(gender.numeric_id!=6581097))
(33 records, and some look like they shouldn't have been assigned both genders)
Finally, why did I URL encode the titles to the English Wikipedia? So I can run JOINs with the Wikipedia pageviews dataset to find out the most visited cats (or movies?):
SELECT en_wiki, SUM(requests) requests FROM [fh-bigquery:wikipedia.pagecounts_201602] a JOIN ( SELECT en_wiki FROM [fh-bigquery:wikidata.latest_en_v1] WHERE instance_of.numeric_id=146 ) b ON a.title=b.en_wiki WHERE language='en' GROUP BY 1 ORDER BY 2 DESC LIMIT 100
(13 seconds, Grumpy Cat got 19,342 requests in February)
Or to process way less data, a JOIN that only looks at the top 365k pages from English Wikipedia):
SELECT en_wiki, SUM(requests) requests FROM [fh-bigquery:wikipedia.pagecounts_201602_en_top365k] a JOIN ( SELECT en_wiki FROM [fh-bigquery:wikidata.latest_en_v1] WHERE instance_of.numeric_id=146 ) b ON a.title=b.en_wiki GROUP BY 1 ORDER BY 2 DESC LIMIT 100
(15 seconds, same answers, but only 14 of the 39 cats are in the top 365k pages)
What I need help with:
- Advice, feedback? - My "raw to table" javascript code is incomplete and not very pretty - which columns would you want extracted? https://github.com/fhoffa/code_snippets/blob/master/wikidata/create_wiki_en_...
Try it out... it's free (up to a replenishing monthly limit), and I wrote instructions to get started while at the last Wikimania in Mexico: https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wikip...
Thanks, hopefully this is useful to the Wikidata community.
--Felipe Hoffa https://twitter.com/felipehoffa