Hi! I'm Felipe Hoffa at Google, and I've been playing with Wikidata's data in BigQuery:
https://twitter.com/felipehoffa/status/705068002522304512
(thx Denny for the introduction to all things Wikidata!)
It's all very early, but I wanted to share some results, and ask for advice on how to continue.
The best news about Wikidata in BigQuery: You can process the whole raw JSON dump in about 7 seconds:
SELECT MIN(LENGTH(item)) FROM [fh-bigquery:wikidata.latest_raw] WHERE LENGTH(item)>5
(the shortest element in wikidata is 102 characters, and I use the LENGTH()<5 to filter the first and last rows of the dump file, which are simple square brackets)
You can also parse each JSON record on the fly:
SELECT JSON_EXTRACT_SCALAR(item, '$.id') FROM [fh-bigquery:wikidata.latest_raw] WHERE LENGTH(item)=102
(4 seconds, the shortest element is https://www.wikidata.org/wiki/Q2307693)
Or to find cats:
SELECT JSON_EXTRACT_SCALAR(item, '$.id') id, JSON_EXTRACT_SCALAR(item, '$.sitelinks.enwiki.title') title, JSON_EXTRACT_SCALAR(item, '$.labels.en.value') label, item FROM [fh-bigquery:wikidata.latest_raw] WHERE JSON_EXTRACT_SCALAR(item, '$.claims.P31[0].mainsnak.datavalue.value.numeric-id')='146' #cats AND LENGTH(item)>10 LIMIT 300
(Wikidata has 54 cats)
SQL is very limited though - how about running some JavaScript inside SQL? Here I'm looking for Japanese and Arabic cats, and URL encoding their links:
https://github.com/fhoffa/code_snippets/blob/master/wikidata/find_cats_japan...
(25 links to the Japanese and Arabic Wikipedia)
Now that I have full control of each element with JavaScript, I can create a more traditional relational table, with nested elements, that only contains Wikidata items that have a page in the English Wikipedia:
https://github.com/fhoffa/code_snippets/blob/master/wikidata/create_wiki_en_...
(Wikidata has ~20M rows, while my "English Wikidata" has ~6M)
With this new table, I can write simpler queries that ask questions like "who has female and male genders assigned on Wikidata":
SELECT en_wiki, GROUP_CONCAT(UNIQUE(STRING(gender.numeric_id))) WITHIN RECORD genders FROM [fh-bigquery:wikidata.latest_en_v1] OMIT RECORD IF (EVERY(gender.numeric_id!=6581072) OR EVERY(gender.numeric_id!=6581097))
(33 records, and some look like they shouldn't have been assigned both genders)
Finally, why did I URL encode the titles to the English Wikipedia? So I can run JOINs with the Wikipedia pageviews dataset to find out the most visited cats (or movies?):
SELECT en_wiki, SUM(requests) requests FROM [fh-bigquery:wikipedia.pagecounts_201602] a JOIN ( SELECT en_wiki FROM [fh-bigquery:wikidata.latest_en_v1] WHERE instance_of.numeric_id=146 ) b ON a.title=b.en_wiki WHERE language='en' GROUP BY 1 ORDER BY 2 DESC LIMIT 100
(13 seconds, Grumpy Cat got 19,342 requests in February)
Or to process way less data, a JOIN that only looks at the top 365k pages from English Wikipedia):
SELECT en_wiki, SUM(requests) requests FROM [fh-bigquery:wikipedia.pagecounts_201602_en_top365k] a JOIN ( SELECT en_wiki FROM [fh-bigquery:wikidata.latest_en_v1] WHERE instance_of.numeric_id=146 ) b ON a.title=b.en_wiki GROUP BY 1 ORDER BY 2 DESC LIMIT 100
(15 seconds, same answers, but only 14 of the 39 cats are in the top 365k pages)
What I need help with:
- Advice, feedback? - My "raw to table" javascript code is incomplete and not very pretty - which columns would you want extracted? https://github.com/fhoffa/code_snippets/blob/master/wikidata/create_wiki_en_...
Try it out... it's free (up to a replenishing monthly limit), and I wrote instructions to get started while at the last Wikimania in Mexico: https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wikip...
Thanks, hopefully this is useful to the Wikidata community.
--Felipe Hoffa https://twitter.com/felipehoffa
Hoi, Do not say that you query Wikidata when you restrict it to items that only exist on Wikipedia. Why do that, why confuse the results in this way?? Thanks, GerardM
On 11 March 2016 at 07:13, Felipe Hoffa felipe.hoffa@gmail.com wrote:
Hi! I'm Felipe Hoffa at Google, and I've been playing with Wikidata's data in BigQuery:
https://twitter.com/felipehoffa/status/705068002522304512
(thx Denny for the introduction to all things Wikidata!)
It's all very early, but I wanted to share some results, and ask for advice on how to continue.
The best news about Wikidata in BigQuery: You can process the whole raw JSON dump in about 7 seconds:
SELECT MIN(LENGTH(item)) FROM [fh-bigquery:wikidata.latest_raw] WHERE LENGTH(item)>5
(the shortest element in wikidata is 102 characters, and I use the LENGTH()<5 to filter the first and last rows of the dump file, which are simple square brackets)
You can also parse each JSON record on the fly:
SELECT JSON_EXTRACT_SCALAR(item, '$.id') FROM [fh-bigquery:wikidata.latest_raw] WHERE LENGTH(item)=102
(4 seconds, the shortest element is https://www.wikidata.org/wiki/Q2307693 )
Or to find cats:
SELECT JSON_EXTRACT_SCALAR(item, '$.id') id, JSON_EXTRACT_SCALAR(item, '$.sitelinks.enwiki.title') title, JSON_EXTRACT_SCALAR(item, '$.labels.en.value') label, item FROM [fh-bigquery:wikidata.latest_raw] WHERE JSON_EXTRACT_SCALAR(item,
'$.claims.P31[0].mainsnak.datavalue.value.numeric-id')='146' #cats AND LENGTH(item)>10 LIMIT 300
(Wikidata has 54 cats)
SQL is very limited though - how about running some JavaScript inside SQL? Here I'm looking for Japanese and Arabic cats, and URL encoding their links:
https://github.com/fhoffa/code_snippets/blob/master/wikidata/find_cats_japan...
(25 links to the Japanese and Arabic Wikipedia)
Now that I have full control of each element with JavaScript, I can create a more traditional relational table, with nested elements, that only contains Wikidata items that have a page in the English Wikipedia:
https://github.com/fhoffa/code_snippets/blob/master/wikidata/create_wiki_en_...
(Wikidata has ~20M rows, while my "English Wikidata" has ~6M)
With this new table, I can write simpler queries that ask questions like "who has female and male genders assigned on Wikidata":
SELECT en_wiki, GROUP_CONCAT(UNIQUE(STRING(gender.numeric_id))) WITHIN RECORD genders FROM [fh-bigquery:wikidata.latest_en_v1] OMIT RECORD IF (EVERY(gender.numeric_id!=6581072) OR EVERY(gender.numeric_id!=6581097))
(33 records, and some look like they shouldn't have been assigned both genders)
Finally, why did I URL encode the titles to the English Wikipedia? So I can run JOINs with the Wikipedia pageviews dataset to find out the most visited cats (or movies?):
SELECT en_wiki, SUM(requests) requests FROM [fh-bigquery:wikipedia.pagecounts_201602] a JOIN ( SELECT en_wiki FROM [fh-bigquery:wikidata.latest_en_v1] WHERE instance_of.numeric_id=146 ) b ON a.title=b.en_wiki WHERE language='en' GROUP BY 1 ORDER BY 2 DESC LIMIT 100
(13 seconds, Grumpy Cat got 19,342 requests in February)
Or to process way less data, a JOIN that only looks at the top 365k pages from English Wikipedia):
SELECT en_wiki, SUM(requests) requests FROM [fh-bigquery:wikipedia.pagecounts_201602_en_top365k] a JOIN ( SELECT en_wiki FROM [fh-bigquery:wikidata.latest_en_v1] WHERE instance_of.numeric_id=146 ) b ON a.title=b.en_wiki GROUP BY 1 ORDER BY 2 DESC LIMIT 100
(15 seconds, same answers, but only 14 of the 39 cats are in the top 365k pages)
What I need help with:
- Advice, feedback?
- My "raw to table" javascript code is incomplete and not very pretty -
which columns would you want extracted? https://github.com/fhoffa/code_snippets/blob/master/wikidata/create_wiki_en_...
Try it out... it's free (up to a replenishing monthly limit), and I wrote instructions to get started while at the last Wikimania in Mexico: https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wikip...
Thanks, hopefully this is useful to the Wikidata community.
--Felipe Hoffa https://twitter.com/felipehoffa
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
Hi! I'm Felipe Hoffa at Google, and I've been playing with Wikidata's data in BigQuery:
This looks pretty cool!
(Wikidata has 54 cats)
query.wikidata.org has 59:
I wonder why the difference?
Also, looks like your query:
WHERE JSON_EXTRACT_SCALAR(item, '$.claims.P31[0].mainsnak.datavalue.value.numeric-id')='146' #cats
only checks the first claim. But there could be more than one P31 claims in Wikidata.
- Advice, feedback?
More cool examples! :)
- My "raw to table" javascript code is incomplete and not very pretty -
which columns would you want extracted? https://github.com/fhoffa/code_snippets/blob/master/wikidata/create_wiki_en_...
That's hard to say - properties can vary a lot, especially depending on kind of items (birth date useful for humans, less so for buildings or paintings :) So it probably has to be rather flexible to cover varied cases.
On Fri, Mar 11, 2016 at 11:57 AM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
Hi! I'm Felipe Hoffa at Google, and I've been playing with Wikidata's data in BigQuery:
This looks pretty cool!
Thanks!
(Wikidata has 54 cats)
query.wikidata.org has 59:
I wonder why the difference?
Also, looks like your query:
WHERE JSON_EXTRACT_SCALAR(item,
'$.claims.P31[0].mainsnak.datavalue.value.numeric-id')='146' #cats
only checks the first claim. But there could be more than one P31 claims in Wikidata.
Exactly, that's the problem (with a solution):
- JSON_EXTRACT_SCALAR() is not expressive enough to allow me to visit all elements in an array. - But being able to write JS code gives me the flexibility to go and visit each of the claims. - Instead of running JS code each time, I can create these derived tables that allow me to write code like:
SELECT en_wiki FROM [fh-bigquery:wikidata.latest_en_v1] OMIT RECORD IF EVERY(instance_of.numeric_id!=146)
That looks into every claim of each item (that this table has as nested elements in each row).
But now I only get 41 cats - because these are the cats that have an enwiki page.
I could do the same for the whole Wikidata, instead of only the elements connected to Wikipedia, but to GerardM's point, why is this so bad? I'm interested in linking Wikidata to external sources, hence I derive a table for that purpose. I'm happy to derive other tables for different purposes, if someone sees the use case.
- Advice, feedback?
More cool examples! :)
Ok, ok. Fictional characters and superheroe:
https://github.com/fhoffa/code_snippets/blob/master/wikidata/fictional_and_s...
SELECT en_wiki, SUM(requests) requests, FIRST(superhero) superhero, FIRST(fictional_character) fictional_character, FIRST(occupation_superhero) occupation_superhero, VARIANCE(LOG(requests)) varlog FROM [fh-bigquery:wikipedia.pagecounts_201602_en_top365k] a JOIN ( SELECT en_wiki, EVERY(instance_of.numeric_id!=188784) WITHIN RECORD superhero, EVERY(instance_of.numeric_id!=95074) WITHIN RECORD fictional_character, EVERY(occupation.numeric_id!=188784) WITHIN RECORD occupation_superhero, FROM [wikidata.latest_en_v1] OMIT RECORD IF (EVERY(instance_of.numeric_id!=188784) AND EVERY(instance_of.numeric_id!=95074) AND EVERY(occupation.numeric_id!=188784)) ) b ON a.title=b.en_wiki GROUP BY 1 HAVING varlog<0.5 ORDER BY 2 DESC LIMIT 100
What's interesting about this query:
- Some are instance of fictional characters. - Some are instance of superhero. - Some have occupation superhero.
Notice that I'm filtering by VARIANCE(LOG(pageviews)), because I don't like spiky data. Makes my results look prettier, but it also shows how having hourly data instead of monthly summary of pageviews allow me to do some fancier filtering.
- My "raw to table" javascript code is incomplete and not very pretty -
which columns would you want extracted?
https://github.com/fhoffa/code_snippets/blob/master/wikidata/create_wiki_en_...
That's hard to say - properties can vary a lot, especially depending on kind of items (birth date useful for humans, less so for buildings or paintings :) So it probably has to be rather flexible to cover varied cases.
Something I need to figure out now: How to navigate the tree of sub-classes. Probably I'll create a 'convolution table', but maybe there's something out there I can use?
Thanks!
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Cool :)
On Fri, Mar 11, 2016 at 2:32 PM, Felipe Hoffa felipe.hoffa@gmail.com wrote:
On Fri, Mar 11, 2016 at 11:57 AM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
Hi! I'm Felipe Hoffa at Google, and I've been playing with Wikidata's data in BigQuery:
This looks pretty cool!
Thanks!
(Wikidata has 54 cats)
query.wikidata.org has 59:
I wonder why the difference?
Also, looks like your query:
WHERE JSON_EXTRACT_SCALAR(item,
'$.claims.P31[0].mainsnak.datavalue.value.numeric-id')='146' #cats
only checks the first claim. But there could be more than one P31 claims in Wikidata.
Exactly, that's the problem (with a solution):
- JSON_EXTRACT_SCALAR() is not expressive enough to allow me to visit all
elements in an array.
- But being able to write JS code gives me the flexibility to go and visit
each of the claims.
- Instead of running JS code each time, I can create these derived tables
that allow me to write code like:
SELECT en_wiki FROM [fh-bigquery:wikidata.latest_en_v1] OMIT RECORD IF EVERY(instance_of.numeric_id!=146)
That looks into every claim of each item (that this table has as nested elements in each row).
But now I only get 41 cats - because these are the cats that have an enwiki page.
I could do the same for the whole Wikidata, instead of only the elements connected to Wikipedia, but to GerardM's point, why is this so bad? I'm interested in linking Wikidata to external sources, hence I derive a table for that purpose. I'm happy to derive other tables for different purposes, if someone sees the use case.
- Advice, feedback?
More cool examples! :)
Ok, ok. Fictional characters and superheroe:
https://github.com/fhoffa/code_snippets/blob/master/wikidata/fictional_and_s...
SELECT en_wiki, SUM(requests) requests, FIRST(superhero) superhero, FIRST(fictional_character) fictional_character, FIRST(occupation_superhero) occupation_superhero, VARIANCE(LOG(requests)) varlog FROM [fh-bigquery:wikipedia.pagecounts_201602_en_top365k] a JOIN ( SELECT en_wiki, EVERY(instance_of.numeric_id!=188784) WITHIN RECORD superhero, EVERY(instance_of.numeric_id!=95074) WITHIN RECORD fictional_character, EVERY(occupation.numeric_id!=188784) WITHIN RECORD occupation_superhero, FROM [wikidata.latest_en_v1] OMIT RECORD IF (EVERY(instance_of.numeric_id!=188784) AND EVERY(instance_of.numeric_id!=95074) AND EVERY(occupation.numeric_id!=188784)) ) b ON a.title=b.en_wiki GROUP BY 1 HAVING varlog<0.5 ORDER BY 2 DESC LIMIT 100
What's interesting about this query:
- Some are instance of fictional characters.
- Some are instance of superhero.
- Some have occupation superhero.
Notice that I'm filtering by VARIANCE(LOG(pageviews)), because I don't like spiky data. Makes my results look prettier, but it also shows how having hourly data instead of monthly summary of pageviews allow me to do some fancier filtering.
- My "raw to table" javascript code is incomplete and not very pretty -
which columns would you want extracted?
https://github.com/fhoffa/code_snippets/blob/master/wikidata/create_wiki_en_...
That's hard to say - properties can vary a lot, especially depending on kind of items (birth date useful for humans, less so for buildings or paintings :) So it probably has to be rather flexible to cover varied cases.
Something I need to figure out now: How to navigate the tree of sub-classes. Probably I'll create a 'convolution table', but maybe there's something out there I can use?
Thanks!
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
I could do the same for the whole Wikidata, instead of only the elements connected to Wikipedia, but to GerardM's point, why is this so bad? I'm
Well, I'm not sure about "bad", but certainly as you can see it won't be complete. Wikidata is a separate data source from Wikipedia, and while there is a huge overlap in the data, they are not identical. Moreover, English Wikipedia is not the only Wikipedia, of course, so limiting the lookup to English only would again omit data. Sometimes this way is good enough, sometimes it would not be.
Something I need to figure out now: How to navigate the tree of sub-classes. Probably I'll create a 'convolution table', but maybe there's something out there I can use?
This tree is huge :) And it may be also a bit weird. I can't remember from the top of my head any place that had the full tree... There's this tool: http://tools.wmflabs.org/wikidata-todo/tree.html but I think if you ask it to generate full tree it may not work well.
On 12.03.2016 07:21, Stas Malyshev wrote: ....
Something I need to figure out now: How to navigate the tree of sub-classes. Probably I'll create a 'convolution table', but maybe there's something out there I can use?
This tree is huge :) And it may be also a bit weird. I can't remember from the top of my head any place that had the full tree... There's this tool: http://tools.wmflabs.org/wikidata-todo/tree.html but I think if you ask it to generate full tree it may not work well.
You can probably get a list of all subclass-of relations from SPARQL query, if this is what you need. I am using this in a tool without problems (and I am actually querying for full statements, not for wdt-simplified properties there).
If you are looking for a way to browse the "tree", you could look at http://sergestratan.bitbucket.org/ But the graph is not a tree, and not even acyclic, so it is a bit of a mess to navigate in full.
Markus