Hi! I'm Felipe Hoffa at Google, and I've been playing with Wikidata's data
in BigQuery:
https://twitter.com/felipehoffa/status/705068002522304512
(thx Denny for the introduction to all things Wikidata!)
It's all very early, but I wanted to share some results, and ask for advice
on how to continue.
The best news about Wikidata in BigQuery: You can process the whole raw
JSON dump in about 7 seconds:
SELECT MIN(LENGTH(item))
FROM [fh-bigquery:wikidata.latest_raw]
WHERE LENGTH(item)>5
(the shortest element in wikidata is 102 characters, and I use the
LENGTH()<5 to filter the first and last rows of the dump file, which are
simple square brackets)
You can also parse each JSON record on the fly:
SELECT JSON_EXTRACT_SCALAR(item, '$.id')
FROM [fh-bigquery:wikidata.latest_raw]
WHERE LENGTH(item)=102
(4 seconds, the shortest element is https://www.wikidata.org/wiki/Q2307693)
Or to find cats:
SELECT JSON_EXTRACT_SCALAR(item, '$.id') id,
JSON_EXTRACT_SCALAR(item, '$.sitelinks.enwiki.title') title,
JSON_EXTRACT_SCALAR(item, '$.labels.en.value') label,
item
FROM [fh-bigquery:wikidata.latest_raw]
WHERE JSON_EXTRACT_SCALAR(item,
'$.claims.P31[0].mainsnak.datavalue.value.numeric-id')='146' #cats
AND LENGTH(item)>10
LIMIT 300
(Wikidata has 54 cats)
SQL is very limited though - how about running some JavaScript inside SQL?
Here I'm looking for Japanese and Arabic cats, and URL encoding their links:
https://github.com/fhoffa/code_snippets/blob/master/wikidata/find_cats_japa…
(25 links to the Japanese and Arabic Wikipedia)
Now that I have full control of each element with JavaScript, I can create
a more traditional relational table, with nested elements, that only
contains Wikidata items that have a page in the English Wikipedia:
https://github.com/fhoffa/code_snippets/blob/master/wikidata/create_wiki_en…
(Wikidata has ~20M rows, while my "English Wikidata" has ~6M)
With this new table, I can write simpler queries that ask questions like
"who has female and male genders assigned on Wikidata":
SELECT en_wiki, GROUP_CONCAT(UNIQUE(STRING(gender.numeric_id))) WITHIN
RECORD genders
FROM [fh-bigquery:wikidata.latest_en_v1]
OMIT RECORD IF (EVERY(gender.numeric_id!=6581072) OR
EVERY(gender.numeric_id!=6581097))
(33 records, and some look like they shouldn't have been assigned both
genders)
Finally, why did I URL encode the titles to the English Wikipedia? So I can
run JOINs with the Wikipedia pageviews dataset to find out the most visited
cats (or movies?):
SELECT en_wiki, SUM(requests) requests
FROM [fh-bigquery:wikipedia.pagecounts_201602] a
JOIN (
SELECT en_wiki
FROM [fh-bigquery:wikidata.latest_en_v1]
WHERE instance_of.numeric_id=146
) b
ON a.title=b.en_wiki
WHERE language='en'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100
(13 seconds, Grumpy Cat got 19,342 requests in February)
Or to process way less data, a JOIN that only looks at the top 365k pages
from English Wikipedia):
SELECT en_wiki, SUM(requests) requests
FROM [fh-bigquery:wikipedia.pagecounts_201602_en_top365k] a
JOIN (
SELECT en_wiki
FROM [fh-bigquery:wikidata.latest_en_v1]
WHERE instance_of.numeric_id=146
) b
ON a.title=b.en_wiki
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100
(15 seconds, same answers, but only 14 of the 39 cats are in the top 365k
pages)
What I need help with:
- Advice, feedback?
- My "raw to table" javascript code is incomplete and not very pretty -
which columns would you want extracted?
https://github.com/fhoffa/code_snippets/blob/master/wikidata/create_wiki_en…
Try it out... it's free (up to a replenishing monthly limit), and I wrote
instructions to get started while at the last Wikimania in Mexico:
https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wiki…
Thanks, hopefully this is useful to the Wikidata community.
--Felipe Hoffa
https://twitter.com/felipehoffa
Apologies for cross-posting.
========================================================
CALL FOR WORKSHOP AND TUTORIAL PROPOSALS
20th International Conference on Knowledge Engineering and Knowledge
Management (EKAW 2016)
Workshop and Tutorial days: 19-20 November 2016, Bologna, Italy
Proposal submission: May 25, 2016
Web site: http://ekaw2016.cs.unibo.it/
========================================================
The International Conference on Knowledge Engineering and Knowledge
Management (EKAW) is concerned with all aspects of eliciting, acquiring,
modeling and managing knowledge, as well its role in the construction of
knowledge-intensive systems and services for the semantic web, knowledge
management, e-business, natural language processing, intelligent
information integration, etc.
Besides a research track, EKAW will host a number of workshops and tutorial
on topics related to the theme of the conference. We hope our workshops to
provide an informal setting where participants have the opportunity to
discuss specific technical topics in an atmosphere that fosters the active
exchange of ideas; and tutorials to enable attendees to fully appreciate
current issues, main schools of thought, and possible application areas.
== TOPICS OF INTEREST ==
In order to meet these goals, workshop/tutorial proposals should address
topics that satisfy the following criteria:
- the topic falls in the general scope of EKAW 2016 (
http://ekaw2016.cs.unibo.it/?q=callforpapers),
- there is a clear focus on a specific technology, problem or
application, and
- there is a sufficiently large community interested in the topic.
== SUBMISSION GUIDELINES ==
Proposals should be submitted via EasyChair, which will be made available
on EKAW’s web site shortly (http://ekaw2016.cs.unibo.it/). Submissions
should be a single PDF file of no more than 5 pages, specifying "Workshop
Proposal" or "Tutorial Proposal", and should contain the following
information.
Workshop proposals:
- Title.
- Abstract (200 words).
- Motivation on why the topic is of particular interest at this time
and its relation to the main conference topics.
- Workshop format, discussing the mix of events such as paper
presentations, invited talks, panels, and general discussion.
- Intended audience and expected number of participants.
- List of (potential) members of the program committee (at least 25%
have to be confirmed at the time of the proposal, confirmed participants
should be marked specifically).
- Indication of whether the workshop should be considered for a
half-day or full-day.
- The tentative dates (submission, notification, camera-ready deadline,
etc.)
- Past versions of the workshop, including URLs as well as number of
submissions and acceptance rates.
- Data of the organizers (name, affiliation, email address, homepage)
and short CV.
Additionally
- we strongly advise having more than one organizer, preferably from
different institutions, bringing different perspectives to the workshop
topic.
- we welcome, and will prioritise, workshops with creative structures
and organizations that attract various types of contributions and ensure
rich interactions.
Tutorial proposals:
- Title.
- Abstract (200 words).
- Relation to the conference topics, i.e. why it will be of interest to
the conference attendants.
- If the tutorial, or a very similar tutorial, has been given
elsewhere, explanation of the benefit of presenting it again to the EKAW
community.
- Overview of content, description of the aims, presentation style,
potential/preferred prerequisite knowledge.
- Indication on whether the tutorial should be considered for a
half-day or full-day.
- Intended audience and expected number of participants.
- Audio-visual or technical requirements and any special room
requirements (for hands-on sessions, any software needed and download sites
must be provided by the tutorial presenters).
- Data of the presenters (name, affiliation, email address, homepage)
and short CV including also their expertise, experiences in teaching and in
tutorial presentation.
== WORKSHOP ORGANIZERS RESPONSIBILITIES ==
The organizers of accepted workshops are expected to:
- prepare a workshop webpage (linked to the official EKAW website)
containing the call for papers and detailed information about the workshop
organization and timelines.
- be responsible for the workshop publicity.
- be responsible for their own reviewing process, decide upon the final
program content and report the number of submissions and accepted papers to
the workshop chair.
- be responsible for publishing electronic proceedings (e.g., on the
CEUR-WS website).
- assure workshop participants are informed they have to register to
the main conference and the workshop.
- schedule, attend and coordinate their entire workshop.
== TUTORIAL ORGANIZERS RESPONSIBILITIES ==
The proposers of accepted tutorials are expected to prepare a tutorial
webpage (linked to the official EKAW website) containing detailed
information about the tutorial, and to distribute material to participants.
== SUBMISSION DATES AND DETAIL ==
Important Dates
- Proposals due: May 25, 2016
- Notifications: June 27, 2016
Suggested Timeline for Workshops
- Workshop website up and calls: July 18, 2016
- Deadline to submit Papers to Workshops: September 15, 2016
- Acceptance of Papers for Workshops: October 6, 2016
- Workshop days: November 19-20, 2016
== CHAIRS ==
- Jun Zhao (University of Oxford)
- Matthew Horridge (Stanford University)
Apologies for cross-posting
========================================================
13th ESWC 2016
http://2016.eswc-conferences.org/call-challenges
Call for Semantic Web Challenges Entries
Open Knowledge Extraction (OKE) Challenge
Challenge on Semantic Sentiment Analysis
Conference Live app Challenge
Open Challenge on Question Answering over Linked Data
Top-K Shortest Path in Large Typed RDF Graphs Challenge
Semantic Publishing Challenge
schema.org - Bonus Challenge
Wikidata - Bonus Challenge
========================================================
OVERVIEW
The 13th ESWC, to be held from May 29th to June 2nd in Heraklion, Crete,
features no less than seven challenges this year!
The purpose of the challenges is to showcase the maturity of state of
the art methods and tools on tasks common to the Semantic Web community
and adjacent disciplines, in a controlled setting involving rigorous
evaluation.
Semantic Web Challenges are an official track of the conference,
ensuring significant visibility for the challenges as well as
participants. Challenge participants are asked to present their
submissions as well as provide a paper describing their work. The
details of the submissions may vary per challenge and will be found in
the individual calls. These papers must undergo a peer-review by experts
relevant to the challenge task, and will be published in the official
ESWC2016 Satellite Events proceedings.
IMPORTANT DATES
Individual challenges may deviate from these dates but as a rule the
following dates apply:
* Training data ready and challenges Calls for Papers sent: Friday
January 15th, 2016
* Challenge papers submission deadline: Monday March 21st, 2016
* Challenge paper reviews due: Tuesday April 5th, 2016
* Notifications sent to participants and invitations to submit task
results: Friday April 8th, 2016
* Camera ready papers due: Sunday April 24th, 2016
CHALLENGES AT A GLANCE
Open Knowledge Extraction (OKE) Challenge
The OKE challenge, launched as first edition at last year Extended
Semantic Web Conference, ESWC2015, has the ambition to provide a
reference framework for researc on Knowledge Extraction from text for
the Semantic Web by re-defining a number of tasks (typically from
information and knowledge extraction), taking into account specific SW
requirements.
http://2016.eswc-conferences.org/eswc-16-open-knowledge-extraction-oke-chal…
Challenge on Semantic Sentiment Analysis
Social media evolution has given users one important opportunity for
expressing their thoughts and opinions online. The information thus
produced is related to many different areas such as commerce, tourism,
education, health and causes the size of the Social Web to expand
exponentially.
http://2016.eswc-conferences.org/eswc-16-challenge-semantic-sentiment-analy…
Conference Live app Challenge
In the past two years the Extended Semantic Web Conference (ESWC) has
provided a semantic Web application to browse conference data. The
application, called Conference Live, is a Web and mobile application
based on conference data from the Semantic Web Dog Food server, which
provides facilities to browse papers and authors at a specific conference.
http://2016.eswc-conferences.org/eswc-16-conference-live-app-challenge
6th Open Challenge on Question Answering over Linked Data (QALD-6)
The past years have seen a growing amount of research on question
answering over Semantic Web data, shaping an interaction paradigm that
allows end users to profit from the expressive power of Semantic Web
standards while at the same time hiding their complexity behind an
intuitive and easy-to-use interface. The Question Answering over Linked
Data challenge provides an up-to-date benchmark for assessing and
comparing systems that mediate between a user, expressing his or her
information need in natural language, and RDF data.
http://2016.eswc-conferences.org/6th-open-challenge-question-answering-over…
Top-K Shortest Path in Large Typed RDF Graphs Challenge
The advent of SPARQL 1.1 introduced property paths as a new graph
matching paradigm that allows the employment of Kleene star * (and it's
variant +) unary operators to build SPARQL queries that are agnostic of
the underlying RDF graph structure. The ability to express path patterns
that are agnostic of the underlying graph structure is certainly a step
forward.
http://2016.eswc-conferences.org/top-k-shortest-path-large-typed-rdf-graphs…
Semantic Publishing Challenge 2016 – Assessing the Quality of Scientific
Output in its Ecosystem
This is the next iteration of the successful Semantic Publishing
Challenge of ESWC 2014 and 2015. We continue pursuing the objective of
assessing the quality of scientific output, evolving thedataset
bootstrapped in 2014 and 2015 to take into account the wider ecosystem
of publications.
http://2016.eswc-conferences.org/assessing-quality-scientific-output-its-ec…schema.org - Bonus Challenge
Rather than create a separate schema.org challenge, we encourage where
appropriate submissions to other ESWC2016 challenges to consider also
exploring schema.org's relationship with Linked Data and Semantic Web
tools, technologies, vocabularies and datasets.
http://2016.eswc-conferences.org/bonus-challenge
Wikidata - Bonus Challenge
Wikidata is the largest free and open general purpose knowledge base in
the world, collecting a wide variety of common and specialized knowledge
in a machine-readable form. Wikimedia projects like Wikipedia make use
of the data to enrich their articles. Anyone else is equally welcome to
use the data in Wikidata to enrich their applications or do research,
for example.
Over the past 3 years, Wikidata has grown rapidly and build a great
community around structured knowledge.
The purpose of this additional challenge is to explore ways of closing
key gaps in Wikidata or between Wikidata and the Linked Data and
Semantic Web community.
First we offer a brief background on Wikidata and its current key gaps,
then we outline how this relates to this year's set challenges.
http://2016.eswc-conferences.org/wikidata-challenge
CONTACT
ESWC 2016 Challenge Chairs
* Stefan Dietze, L3S Research Center, Germany (dietze(a)l3s.de)
* Anna Tordai, Elsevier, Netherlands (a.tordai(a)elsevier.com)
--
Prof. Dr. Heiko Paulheim
Data and Web Science Group
University of Mannheim
Phone: +49 621 181 2661
B6, 26, Room C1.09
D-68159 Mannheim
Mail: heiko(a)informatik.uni-mannheim.de
Web: www.heikopaulheim.com
Hi,
There is a performance issue with the labelling service. Using labels
makes even simple queries time out. For example this one:
SELECT $p $pLabel
WHERE {
$p wdt:P31 _:bnode .
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
} LIMIT 11
The workaround is to use subqueries. For example, the following query
returns immediately:
SELECT $p $pLabel
WHERE {
{ SELECT $p WHERE { $p wdt:P31 _:bnode . } LIMIT 11 }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
I strongly suppose that almost every use of the labelling service could
be performed like this (the only exception is when you apply further
query conditions on the label). BlazeGraph should recognize this.
Meanwhile, everybody who uses queries with labels in an application
should rewrite them as above to get the best performance (and reduce
load on the query service ;-).
Cheers,
Markus
--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/
Glad to see an effort that integrates data from both databases!
Marco
On 3/3/16 13:00, wikidata-request(a)lists.wikimedia.org wrote:
> Date: Wed, 02 Mar 2016 22:00:03 +0000
> From: Denny Vrandečić<vrandecic(a)gmail.com>
> To: "Discussion list for the Wikidata project."
> <wikidata(a)lists.wikimedia.org>
> Subject: Re: [Wikidata] nice
> Message-ID:
> <CAJVtBfcZpTeMpaXobe-Zw4zC4SakNbNxiMKY6-e3qL60MD-GOQ(a)mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> (and to make it clear, it is unclear whether this is an error due to
> DBpedia or due to the companies extraction framework, I was not diving into
> the data)
>
> On Wed, Mar 2, 2016 at 1:59 PM Denny Vrandečić<vrandecic(a)gmail.com> wrote:
>
>> >Depends how good the DBpedia data really is - as the BBC article says,
>> >some 2007 football match in the UK was extracted as a "Battle"...
>> >
>> >On Wed, Mar 2, 2016 at 1:54 PM Daniel Kinzler<daniel.kinzler(a)wikimedia.de>
>> >wrote:
>> >
>>> >>"They found 12,703 battles which had an exact location and date, 2,657 of
>>> >>them
>>> >>are from Wikidata, the others are from DPpedia."
>>> >>
>>> >>Maybe we can do better?
>>> >>
>>> >>Am 02.03.2016 um 22:14 schrieb Lydia Pintscher:
>>>> >> >On Wed, Mar 2, 2016 at 8:14 PM Gerard Meijssen <
>>> >>gerard.meijssen(a)gmail.com
>>>> >> ><mailto:gerard.meijssen@gmail.com>> wrote:
>>>> >> >
>>>> >> > Hoi,
>>>> >> > Yup I missed that one.. this [1] was my source:)
>>>> >> > Gerard
>>>> >> >
>>>> >> > [1]http://www.bbc.com/news/magazine-35685889
>>>> >> >
>>>> >> >
>>>> >> >This is really great. I am thrilled about this because this isn't
>>> >>coverage about
>>>> >> >Wikidata but coverage_with_ Wikidata on major news sites for the
>>> >>second time
>>>> >> >this week
>>>> >> >(
>>> >>http://www.faz.net/aktuell/feuilleton/kino/academy-awards-die-oscars-von-19…
>>> >>being
>>>> >> >the other one). They're using Wikidata data to do meaningful reporting.
>>> >>Our data
>>>> >> >and the project as a whole got (at the very least) good enough for
>>> >>this. It
>>>> >> >feels to me like we've broken through a wall.
>>>> >> >High5 everyone! :D
>>>> >> >
>>>> >> >Cheers
>>>> >> >Lydia
>>>> >> >--
>>>> >> >Lydia Pintscher -http://about.me/lydia.pintscher
>>>> >> >Product Manager for Wikidata
>>>> >> >
>>>> >> >Wikimedia Deutschland e.V.
>>>> >> >Tempelhofer Ufer 23-24
>>>> >> >10963 Berlin
>>>> >> >www.wikimedia.de <http://www.wikimedia.de>
>>>> >> >
>>>> >> >Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
>>>> >> >
>>>> >> >Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
>>> >>unter der
>>>> >> >Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für
>>>> >> >Körperschaften I Berlin, Steuernummer 27/029/42207.
>>>> >> >
>>>> >> >
>>>> >> >_______________________________________________
>>>> >> >Wikidata mailing list
>>>> >> >Wikidata(a)lists.wikimedia.org
>>>> >> >https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>> >> >
>>> >>
>>> >>
>>> >>--
>>> >>Daniel Kinzler
>>> >>Senior Software Developer
>>> >>
>>> >>Wikimedia Deutschland
>>> >>Gesellschaft zur Förderung Freien Wissens e.V.
>>> >>
>>> >>_______________________________________________
>>> >>Wikidata mailing list
>>> >>Wikidata(a)lists.wikimedia.org
>>> >>https://lists.wikimedia.org/mailman/listinfo/wikidata
>>> >>
>> >
Hello Wikidata community! Wikidata is a great platform for collecting information, and the high quality work of many authors yields very reliable information. Still, a challenge for users of Wikidata is that there is no way to see whether *all* data on a certain topic is in Wikidata. For instance, it is easy to see that Malia and Sasha are children of Obama, but there is no way to specify that these are all his children. More generally, Wikidata stores many facts, but it stores no information about which topic it contains all facts. Today we are happy to share with you a prototype that allows to add and manage such completeness information, and would be happy to get your feedback on how useful you consider this tool, or where you see space for improvements. With our prototype, called COOL-WD (Completeness Tool for Wikidata), one can: 1. See completeness statements for Wikidata facts 2. Add, remove, aggregate and filter completeness statements 3. See how completeness statements allow conclusions about the completeness of SPARQL queries over Wikidata. COOL-WD is available at http://cool-wd.inf.unibz.it/ and a 3-min demo video can be found at http://cool-wd.inf.unibz.it/coolwd-hd.mp4 It employs various libraries, most importantly GWT, Apache Jena, SQLite and the Wikidata API. The formal background and description of the tool including an indexing technique for completeness statements have been accepted as a research paper at ICWE 2016 (http://icwe2016.inf.usi.ch/) available to download at: http://bit.ly/1VOsRCH Below are some naive ideas of how completeness could be useful to users: > Use Case 1: Rido is a geographer who would like to contribute to Wikidata about the administrative divisions of regions. He cares so much about data quality, especially data completeness, and is collaborating with Simon, another geographer. However, when completing data on Wikidata, there is currently no way to mark which data is complete. Rido and Simon must make these notes about completeness manually in, say, a Google Doc. Worse still, the effort from Rido and Simon to complete data could not be appreciated by Wikidata users since to the users’ eyes, there is no difference between complete data and incomplete data on Wikidata. Demo: Wikidata is complete for all administrative divisions of Saxony (http://cool-wd.inf.unibz.it/?p=Q1202) > Use Case 2: Jen is a developer of a moviegoer application. She usually integrates data between multiple sources including Wikidata. If some movies on Wikidata have completeness statements, she might optimize her application to not search in other data sources for those movies. Demo: So, when her app is asking on COOL-WD at http://cool-wd.inf.unibz.it/?p=query for cast and screenwriters of the movie Before Sunset (http://cool-wd.inf.unibz.it/?p=Q652186): SELECT * WHERE { wd:Q652186 wdt:P161 ?c . wd:Q652186 wdt:P58 ?s } Her app gets not only query answers but also the completeness information of her query. We are looking forward to your feedback! Best, Fariz, Simon, Rido, and Werner Free University of Bozen-Bolzano, Italy