playing with Wikidata in BigQuery - Wikidata

11 Mar 2016


      Hi! I'm Felipe Hoffa at Google, and I've been playing with Wikidata's data
in BigQuery:
https://twitter.com/felipehoffa/status/705068002522304512
(thx Denny for the introduction to all things Wikidata!)
It's all very early, but I wanted to share some results, and ask for advice
on how to continue.
The best news about Wikidata in BigQuery: You can process the whole raw
JSON dump in about 7 seconds:
SELECT MIN(LENGTH(item))
FROM [fh-bigquery:wikidata.latest_raw]
WHERE LENGTH(item)>5
(the shortest element in wikidata is 102 characters, and I use the
LENGTH()<5 to filter the first and last rows of the dump file, which are
simple square brackets)
You can also parse each JSON record on the fly:
SELECT JSON_EXTRACT_SCALAR(item, '$.id')
FROM [fh-bigquery:wikidata.latest_raw]
WHERE LENGTH(item)=102
(4 seconds, the shortest element is https://www.wikidata.org/wiki/Q2307693)
Or to find cats:
SELECT JSON_EXTRACT_SCALAR(item, '$.id') id,
           JSON_EXTRACT_SCALAR(item, '$.sitelinks.enwiki.title') title,
           JSON_EXTRACT_SCALAR(item, '$.labels.en.value') label,
           item
    FROM [fh-bigquery:wikidata.latest_raw]
    WHERE JSON_EXTRACT_SCALAR(item,
'$.claims.P31[0].mainsnak.datavalue.value.numeric-id')='146' #cats
    AND LENGTH(item)>10
    LIMIT 300
(Wikidata has 54 cats)
SQL is very limited though - how about running some JavaScript inside SQL?
Here I'm looking for Japanese and Arabic cats, and URL encoding their links:
https://github.com/fhoffa/code_snippets/blob/master/wikidata/find_cats_japan...
(25 links to the Japanese and Arabic Wikipedia)
Now that I have full control of each element with JavaScript, I can create
a more traditional relational table, with nested elements, that only
contains Wikidata items that have a page in the English Wikipedia:
https://github.com/fhoffa/code_snippets/blob/master/wikidata/create_wiki_en_...
(Wikidata has ~20M rows, while my "English Wikidata" has ~6M)
With this new table, I can write simpler queries that ask questions like
"who has female and male genders assigned on Wikidata":
SELECT en_wiki, GROUP_CONCAT(UNIQUE(STRING(gender.numeric_id))) WITHIN
RECORD genders
  FROM [fh-bigquery:wikidata.latest_en_v1]
  OMIT RECORD IF (EVERY(gender.numeric_id!=6581072) OR
EVERY(gender.numeric_id!=6581097))
(33 records, and some look like they shouldn't have been assigned both
genders)
Finally, why did I URL encode the titles to the English Wikipedia? So I can
run JOINs with the Wikipedia pageviews dataset to find out the most visited
cats (or movies?):
SELECT en_wiki, SUM(requests) requests
FROM [fh-bigquery:wikipedia.pagecounts_201602] a
JOIN (
  SELECT en_wiki
  FROM [fh-bigquery:wikidata.latest_en_v1]
  WHERE instance_of.numeric_id=146
) b
ON a.title=b.en_wiki
WHERE language='en'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100
(13 seconds, Grumpy Cat got 19,342 requests in February)
Or to process way less data, a JOIN that only looks at the top 365k pages
from English Wikipedia):
SELECT en_wiki, SUM(requests) requests
FROM [fh-bigquery:wikipedia.pagecounts_201602_en_top365k] a
JOIN (
  SELECT en_wiki
  FROM [fh-bigquery:wikidata.latest_en_v1]
  WHERE instance_of.numeric_id=146
) b
ON a.title=b.en_wiki
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100
(15 seconds, same answers, but only 14 of the 39 cats are in the top 365k
pages)
What I need help with:
- Advice, feedback?
- My "raw to table" javascript code is incomplete and not very pretty -
which columns would you want extracted?
https://github.com/fhoffa/code_snippets/blob/master/wikidata/create_wiki_en_...
Try it out... it's free (up to a replenishing monthly limit), and I wrote
instructions to get started while at the last Wikimania in Mexico:
https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wikip...
Thanks, hopefully this is useful to the Wikidata community.
--Felipe Hoffa
https://twitter.com/felipehoffa