Cool :)
On Fri, Mar 11, 2016 at 2:32 PM, Felipe Hoffa <felipe.hoffa(a)gmail.com>
wrote:
On Fri, Mar 11, 2016 at 11:57 AM, Stas Malyshev
<smalyshev(a)wikimedia.org>
wrote:
Hi!
Hi! I'm Felipe Hoffa at Google, and I've
been playing with Wikidata's
data in BigQuery:
This looks pretty cool!
Thanks!
(Wikidata
has 54 cats)
query.wikidata.org has 59:
http://tinyurl.com/ju7xsn2
I wonder why the difference?
Also, looks like your query:
WHERE JSON_EXTRACT_SCALAR(item,
'$.claims.P31[0].mainsnak.datavalue.value.numeric-id')='146' #cats
only checks the first claim. But there could be more than one P31 claims
in Wikidata.
Exactly, that's the problem (with a solution):
- JSON_EXTRACT_SCALAR() is not expressive enough to allow me to visit all
elements in an array.
- But being able to write JS code gives me the flexibility to go and visit
each of the claims.
- Instead of running JS code each time, I can create these derived tables
that allow me to write code like:
SELECT en_wiki
FROM [fh-bigquery:wikidata.latest_en_v1]
OMIT RECORD IF EVERY(instance_of.numeric_id!=146)
That looks into every claim of each item (that this table has as nested
elements in each row).
But now I only get 41 cats - because these are the cats that have an
enwiki page.
I could do the same for the whole Wikidata, instead of only the elements
connected to Wikipedia, but to GerardM's point, why is this so bad? I'm
interested in linking Wikidata to external sources, hence I derive a table
for that purpose. I'm happy to derive other tables for different purposes,
if someone sees the use case.
- Advice, feedback?
More cool examples! :)
Ok, ok. Fictional characters and superheroe:
https://github.com/fhoffa/code_snippets/blob/master/wikidata/fictional_and_…
SELECT en_wiki, SUM(requests) requests,
FIRST(superhero) superhero, FIRST(fictional_character)
fictional_character, FIRST(occupation_superhero) occupation_superhero,
VARIANCE(LOG(requests)) varlog
FROM [fh-bigquery:wikipedia.pagecounts_201602_en_top365k] a
JOIN (
SELECT en_wiki,
EVERY(instance_of.numeric_id!=188784) WITHIN RECORD superhero,
EVERY(instance_of.numeric_id!=95074) WITHIN RECORD fictional_character,
EVERY(occupation.numeric_id!=188784) WITHIN RECORD
occupation_superhero,
FROM [wikidata.latest_en_v1]
OMIT RECORD IF (EVERY(instance_of.numeric_id!=188784)
AND EVERY(instance_of.numeric_id!=95074)
AND EVERY(occupation.numeric_id!=188784))
) b
ON a.title=b.en_wiki
GROUP BY 1
HAVING varlog<0.5
ORDER BY 2 DESC
LIMIT 100
What's interesting about this query:
- Some are instance of fictional characters.
- Some are instance of superhero.
- Some have occupation superhero.
Notice that I'm filtering by VARIANCE(LOG(pageviews)), because I don't
like spiky data. Makes my results look prettier, but it also shows how
having hourly data instead of monthly summary of pageviews allow me to do
some fancier filtering.
- My "raw to table" javascript code is
incomplete and not very pretty -
which columns would you want extracted?
https://github.com/fhoffa/code_snippets/blob/master/wikidata/create_wiki_en…
That's hard to say - properties can vary a lot, especially depending on
kind of items (birth date useful for humans, less so for buildings or
paintings :) So it probably has to be rather flexible to cover varied
cases.
Something I need to figure out now: How to navigate the tree of
sub-classes. Probably I'll create a 'convolution table', but maybe
there's
something out there I can use?
Thanks!
--
Stas Malyshev
smalyshev(a)wikimedia.org
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
--
- Scott MacLeod - Founder & President
- Please donate to tax-exempt 501 (c) (3)
- World University and School
- via PayPal, or credit card, here -
-
- or send checks to
- 415 480 4577
- PO Box 442, (86 Ridgecrest Road), Canyon, CA 94516
- World University and School - like Wikipedia with best STEM-centric
OpenCourseWare - incorporated as a nonprofit university and school in
California, and is a U.S. 501 (c) (3) tax-exempt educational organization.
World University and School is sending you this because of your interest in
free, online, higher education. If you don't want to receive these, please
reply with 'unsubscribe' in the body of the email, leaving the subject line
intact. Thank you.