Hi! I'm Felipe Hoffa at Google, and I've been playing with Wikidata's data in BigQuery:

https://twitter.com/felipehoffa/status/705068002522304512

(thx Denny for the introduction to all things Wikidata!)

It's all very early, but I wanted to share some results, and ask for advice on how to continue.

The best news about Wikidata in BigQuery: You can process the whole raw JSON dump in about 7 seconds:

SELECT MIN(LENGTH(item))
FROM [fh-bigquery:wikidata.latest_raw]
WHERE LENGTH(item)>5

(the shortest element in wikidata is 102 characters, and I use the LENGTH()<5 to filter the first and last rows of the dump file, which are simple square brackets)

You can also parse each JSON record on the fly:

SELECT JSON_EXTRACT_SCALAR(item, '$.id')
FROM [fh-bigquery:wikidata.latest_raw]
WHERE LENGTH(item)=102

(4 seconds, the shortest element is https://www.wikidata.org/wiki/Q2307693)

Or to find cats:

    SELECT JSON_EXTRACT_SCALAR(item, '$.id') id,
           JSON_EXTRACT_SCALAR(item, '$.sitelinks.enwiki.title') title,
           JSON_EXTRACT_SCALAR(item, '$.labels.en.value') label,
           item
    FROM [fh-bigquery:wikidata.latest_raw]
    WHERE JSON_EXTRACT_SCALAR(item, '$.claims.P31[0].mainsnak.datavalue.value.numeric-id')='146' #cats
    AND LENGTH(item)>10
    LIMIT 300

(Wikidata has 54 cats)

SQL is very limited though - how about running some JavaScript inside SQL? Here I'm looking for Japanese and Arabic cats, and URL encoding their links:

https://github.com/fhoffa/code_snippets/blob/master/wikidata/find_cats_japanese_arabic.sql

(25 links to the Japanese and Arabic Wikipedia)


Now that I have full control of each element with JavaScript, I can create a more traditional relational table, with nested elements, that only contains Wikidata items that have a page in the English Wikipedia:

https://github.com/fhoffa/code_snippets/blob/master/wikidata/create_wiki_en_table.sql

(Wikidata has ~20M rows, while my "English Wikidata" has ~6M)

With this new table, I can write simpler queries that ask questions like "who has female and male genders assigned on Wikidata":

  SELECT en_wiki, GROUP_CONCAT(UNIQUE(STRING(gender.numeric_id))) WITHIN RECORD genders
  FROM [fh-bigquery:wikidata.latest_en_v1]
  OMIT RECORD IF (EVERY(gender.numeric_id!=6581072) OR EVERY(gender.numeric_id!=6581097))

(33 records, and some look like they shouldn't have been assigned both genders)


Finally, why did I URL encode the titles to the English Wikipedia? So I can run JOINs with the Wikipedia pageviews dataset to find out the most visited cats (or movies?):

SELECT en_wiki, SUM(requests) requests
FROM [fh-bigquery:wikipedia.pagecounts_201602] a
JOIN (
  SELECT en_wiki
  FROM [fh-bigquery:wikidata.latest_en_v1]
  WHERE instance_of.numeric_id=146
) b
ON a.title=b.en_wiki
WHERE language='en'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100

(13 seconds, Grumpy Cat got 19,342 requests in February)

Or to process way less data, a JOIN that only looks at the top 365k pages from English Wikipedia):

SELECT en_wiki, SUM(requests) requests
FROM [fh-bigquery:wikipedia.pagecounts_201602_en_top365k] a
JOIN (
  SELECT en_wiki
  FROM [fh-bigquery:wikidata.latest_en_v1]
  WHERE instance_of.numeric_id=146
) b
ON a.title=b.en_wiki
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100

(15 seconds, same answers, but only 14 of the 39 cats are in the top 365k pages)


What I need help with:

- Advice, feedback?
- My "raw to table" javascript code is incomplete and not very pretty - which columns would you want extracted? https://github.com/fhoffa/code_snippets/blob/master/wikidata/create_wiki_en_table.sql


Try it out... it's free (up to a replenishing monthly limit), and I wrote instructions to get started while at the last Wikimania in Mexico: https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wikipedia_pageviews_in_5/


Thanks, hopefully this is useful to the Wikidata community.

--Felipe Hoffa
https://twitter.com/felipehoffa