(25 links to the Japanese and Arabic Wikipedia)SQL is very limited though - how about running some JavaScript inside SQL? Here I'm looking for Japanese and Arabic cats, and URL encoding their links:(Wikidata has 54 cats)Or to find cats:(4 seconds, the shortest element is https://www.wikidata.org/wiki/Q2307693)You can also parse each JSON record on the fly:(the shortest element in wikidata is 102 characters, and I use the LENGTH()<5 to filter the first and last rows of the dump file, which are simple square brackets)The best news about Wikidata in BigQuery: You can process the whole raw JSON dump in about 7 seconds:It's all very early, but I wanted to share some results, and ask for advice on how to continue.Hi! I'm Felipe Hoffa at Google, and I've been playing with Wikidata's data in BigQuery:(thx Denny for the introduction to all things Wikidata!)
https://twitter.com/felipehoffa/status/705068002522304512
SELECT MIN(LENGTH(item))
FROM [fh-bigquery:wikidata.latest_raw]
WHERE LENGTH(item)>5
SELECT JSON_EXTRACT_SCALAR(item, '$.id')
FROM [fh-bigquery:wikidata.latest_raw]
WHERE LENGTH(item)=102
SELECT JSON_EXTRACT_SCALAR(item, '$.id') id,
JSON_EXTRACT_SCALAR(item, '$.sitelinks.enwiki.title') title,
JSON_EXTRACT_SCALAR(item, '$.labels.en.value') label,
item
FROM [fh-bigquery:wikidata.latest_raw]
WHERE JSON_EXTRACT_SCALAR(item, '$.claims.P31[0].mainsnak.datavalue.value.numeric-id')='146' #cats
AND LENGTH(item)>10
LIMIT 300
https://github.com/fhoffa/code_snippets/blob/master/wikidata/find_cats_japanese_arabic.sqlNow that I have full control of each element with JavaScript, I can create a more traditional relational table, with nested elements, that only contains Wikidata items that have a page in the English Wikipedia:
https://github.com/fhoffa/code_snippets/blob/master/wikidata/create_wiki_en_table.sql(Wikidata has ~20M rows, while my "English Wikidata" has ~6M)With this new table, I can write simpler queries that ask questions like "who has female and male genders assigned on Wikidata":
SELECT en_wiki, GROUP_CONCAT(UNIQUE(STRING(gender.numeric_id))) WITHIN RECORD genders
FROM [fh-bigquery:wikidata.latest_en_v1]
OMIT RECORD IF (EVERY(gender.numeric_id!=6581072) OR EVERY(gender.numeric_id!=6581097))(33 records, and some look like they shouldn't have been assigned both genders)Finally, why did I URL encode the titles to the English Wikipedia? So I can run JOINs with the Wikipedia pageviews dataset to find out the most visited cats (or movies?):
SELECT en_wiki, SUM(requests) requests
FROM [fh-bigquery:wikipedia.pagecounts_201602] a
JOIN (
SELECT en_wiki
FROM [fh-bigquery:wikidata.latest_en_v1]
WHERE instance_of.numeric_id=146
) b
ON a.title=b.en_wiki
WHERE language='en'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100(13 seconds, Grumpy Cat got 19,342 requests in February)Or to process way less data, a JOIN that only looks at the top 365k pages from English Wikipedia):
SELECT en_wiki, SUM(requests) requests
FROM [fh-bigquery:wikipedia.pagecounts_201602_en_top365k] a
JOIN (
SELECT en_wiki
FROM [fh-bigquery:wikidata.latest_en_v1]
WHERE instance_of.numeric_id=146
) b
ON a.title=b.en_wiki
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100(15 seconds, same answers, but only 14 of the 39 cats are in the top 365k pages)What I need help with:- Advice, feedback?- My "raw to table" javascript code is incomplete and not very pretty - which columns would you want extracted? https://github.com/fhoffa/code_snippets/blob/master/wikidata/create_wiki_en_table.sqlTry it out... it's free (up to a replenishing monthly limit), and I wrote instructions to get started while at the last Wikimania in Mexico: https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wikipedia_pageviews_in_5/Thanks, hopefully this is useful to the Wikidata community.--Felipe Hoffa
_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata