Hoi, Do not say that you query Wikidata when you restrict it to items that only exist on Wikipedia. Why do that, why confuse the results in this way?? Thanks, GerardM
On 11 March 2016 at 07:13, Felipe Hoffa felipe.hoffa@gmail.com wrote:
Hi! I'm Felipe Hoffa at Google, and I've been playing with Wikidata's data in BigQuery:
https://twitter.com/felipehoffa/status/705068002522304512
(thx Denny for the introduction to all things Wikidata!)
It's all very early, but I wanted to share some results, and ask for advice on how to continue.
The best news about Wikidata in BigQuery: You can process the whole raw JSON dump in about 7 seconds:
SELECT MIN(LENGTH(item)) FROM [fh-bigquery:wikidata.latest_raw] WHERE LENGTH(item)>5
(the shortest element in wikidata is 102 characters, and I use the LENGTH()<5 to filter the first and last rows of the dump file, which are simple square brackets)
You can also parse each JSON record on the fly:
SELECT JSON_EXTRACT_SCALAR(item, '$.id') FROM [fh-bigquery:wikidata.latest_raw] WHERE LENGTH(item)=102
(4 seconds, the shortest element is https://www.wikidata.org/wiki/Q2307693 )
Or to find cats:
SELECT JSON_EXTRACT_SCALAR(item, '$.id') id, JSON_EXTRACT_SCALAR(item, '$.sitelinks.enwiki.title') title, JSON_EXTRACT_SCALAR(item, '$.labels.en.value') label, item FROM [fh-bigquery:wikidata.latest_raw] WHERE JSON_EXTRACT_SCALAR(item,
'$.claims.P31[0].mainsnak.datavalue.value.numeric-id')='146' #cats AND LENGTH(item)>10 LIMIT 300
(Wikidata has 54 cats)
SQL is very limited though - how about running some JavaScript inside SQL? Here I'm looking for Japanese and Arabic cats, and URL encoding their links:
https://github.com/fhoffa/code_snippets/blob/master/wikidata/find_cats_japan...
(25 links to the Japanese and Arabic Wikipedia)
Now that I have full control of each element with JavaScript, I can create a more traditional relational table, with nested elements, that only contains Wikidata items that have a page in the English Wikipedia:
https://github.com/fhoffa/code_snippets/blob/master/wikidata/create_wiki_en_...
(Wikidata has ~20M rows, while my "English Wikidata" has ~6M)
With this new table, I can write simpler queries that ask questions like "who has female and male genders assigned on Wikidata":
SELECT en_wiki, GROUP_CONCAT(UNIQUE(STRING(gender.numeric_id))) WITHIN RECORD genders FROM [fh-bigquery:wikidata.latest_en_v1] OMIT RECORD IF (EVERY(gender.numeric_id!=6581072) OR EVERY(gender.numeric_id!=6581097))
(33 records, and some look like they shouldn't have been assigned both genders)
Finally, why did I URL encode the titles to the English Wikipedia? So I can run JOINs with the Wikipedia pageviews dataset to find out the most visited cats (or movies?):
SELECT en_wiki, SUM(requests) requests FROM [fh-bigquery:wikipedia.pagecounts_201602] a JOIN ( SELECT en_wiki FROM [fh-bigquery:wikidata.latest_en_v1] WHERE instance_of.numeric_id=146 ) b ON a.title=b.en_wiki WHERE language='en' GROUP BY 1 ORDER BY 2 DESC LIMIT 100
(13 seconds, Grumpy Cat got 19,342 requests in February)
Or to process way less data, a JOIN that only looks at the top 365k pages from English Wikipedia):
SELECT en_wiki, SUM(requests) requests FROM [fh-bigquery:wikipedia.pagecounts_201602_en_top365k] a JOIN ( SELECT en_wiki FROM [fh-bigquery:wikidata.latest_en_v1] WHERE instance_of.numeric_id=146 ) b ON a.title=b.en_wiki GROUP BY 1 ORDER BY 2 DESC LIMIT 100
(15 seconds, same answers, but only 14 of the 39 cats are in the top 365k pages)
What I need help with:
- Advice, feedback?
- My "raw to table" javascript code is incomplete and not very pretty -
which columns would you want extracted? https://github.com/fhoffa/code_snippets/blob/master/wikidata/create_wiki_en_...
Try it out... it's free (up to a replenishing monthly limit), and I wrote instructions to get started while at the last Wikimania in Mexico: https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wikip...
Thanks, hopefully this is useful to the Wikidata community.
--Felipe Hoffa https://twitter.com/felipehoffa
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata