Hey Scott,
While I'm not sure I can help with the details of the specific example
you are mentioning, but the general area you are in -- dealing with
answering questions posed in natural language -- is called "Question
Answering".
When dealing with data in an RDF format (as per Wikidata), there's quite
a lot of research done in the context of "Question Answering over Linked
Data" (QALD).
The methods are not 100% accurate, but given data in a structured format
(like RDF), with good labels, and assuming relatively simple objective
questions (like "what age is the current Italian president?") that can
be answered over the data, I believe these techniques can get quite good
results. One can check out the QALD evaluation series for more details
on how good [1].
I'm not really in that area myself, but perhaps the keywords might be
useful if you want to read more.
Probably this will not be so helpful though if your focus is on using
Wikidata to answer one specific question. :)
Cheers,
Aidan
[1]
Thanks, Aidan, Stas and Wikidatans,
Thanks for the feedback.
While I'm not yet a SQL/SPARQL programmer, I wonder if one could make
each word in the question concrete, a Qidentifier, and with rank-able
outcomes, create Wikidata Q-items/identifiers with attributes possibly
for each MIT OCW course in 7 languages courses and each Yale OYC
courses, as well as each WUaS subject page. It's the ranking of
responses that would lesson the significance of the question that is
inherent ill-definition/subjectivity, I think (?) - and there might be
other SQL/SPARQL related approaches to this problem too.
Then hypothetically one could compare, for example, the list of MIT OCW
Earth, Atmosphere and Planetary Science courses (e.g.
http://ocw.mit.edu/courses/earth-atmospheric-and-planetary-sciences/ and
in Spanish -
http://ocw.mit.edu/courses/translated-courses/spanish/#earth-atmospheric-an…
and WUaS's Earth wiki subject -
http://worlduniversity.wikia.com/wiki/Earth,_Atmospheric,_and_Planetary_Sci…),
Statistics (e.g.
http://ocw.mit.edu/courses/mathematics/ and in Spanish
http://ocw.mit.edu/courses/translated-courses/spanish/#mathematics and
WUaS's Statistics' wiki page
http://worlduniversity.wikia.com/wiki/Statistics),
Space/Astronautics courses
(
http://ocw.mit.edu/courses/aeronautics-and-astronautics/ and WUaS's
Space wiki subject -
http://worlduniversity.wikia.com/wiki/Space) with
perhaps wiki-added WUaS
Journalism wiki subject page (e.g.
http://ocw.mit.edu/courses/comparative-media-studies-writing/ and
Journalism
http://worlduniversity.wikia.com/wiki/Journalism and various
forms of writing at WUaS
http://worlduniversity.wikia.com/wiki/writing)
... with Q items, newspaper articles and ask a variety of related
questions of the results?
It would be some sort of correlation of the relative rankings of these
outputs in response to the queries - and which could yield results
paralleling somehow Google Search results, for example. (Possible
collaboration with Google Search even would increase eventually
collaboration in voice on Android smartphones, and in Google group video
Hangouts for ASL and other forms of sign language, for example).
I haven't been able to find any Mandarin Chinese MIT OCW Statistics,
Earth, Space, or Journalism courses -
http://ocw.mit.edu/courses/translated-courses/traditional-chinese
(accessible here
http://ocw.mit.edu/courses/translated-courses/) yet, to
speak of, although these MIT OCW Writing courses in Mandarin Chinese -
http://ocw.mit.edu/courses/translated-courses/traditional-chinese/#comparat…
- could work possibly for some of these hypothetical Wikidata query
performance questions I'm seeking to explore - in this "if one builds it
approach."
For example, and hypothetically, if there were 3 relatively recent and
new MIT OCW Earth courses, and 2 new MIT OCW Statistics courses, and 10
journalism articles from best newspapers and best academic journals in
English on Earth/Space
(
http://ocw.mit.edu/courses/aeronautics-and-astronautics/), and 4 in
Chinese, and 5 in Spanish, for example, perhaps one could get helpful
and useful outputs (that could eventually be asked for in voice/natural
language processing), - by ranking relative importance partly according
to the newness of the course, and getting objective relative outcomes as
a group. The importance of a specific set of journals to a specific
discipline / subject could be another source of ranking of importance,
for example - to highlight the operative item in this question, and add
some further relative rankings as useful SQL coding possibilities.
Wikidata would generate or get a lot of valuable new fact-oriented and
knowledge-oriented Q items/identifiers/attributes (for CC MIT OCW's 2300
courses in English, and the other courses in 6 other languages, and CC
Yale OYC, as well as CC WUaS subjects, and with planning for major
universities with these and growing number of wiki subjects in all
languages).
I have no idea yet how to write the SQL/SPARQL for this, but rankable Q*
identifiers, new Q* identifiers and Google would be places I'd begin if
I did. What do you think?
Cheers, Scott
On Sun, Aug 7, 2016 at 2:02 PM, Aidan Hogan <aidhog(a)gmail.com
<mailto:aidhog@gmail.com>> wrote:
Hey Scott,
On 07-08-2016 16:15, Info WorldUniversity wrote:
Hi Aidan, Markus, Daniel and Wikidatans,
As an emergence out of this conversation on Wikidata query
performance,
and re cc World University and School/Wikidata, as a theoretical
challenge, how would you suggest coding WUaS/Wikidata initially
to be
able to answer this question - "What are most impt stats issues in
earth/space sci that journalists should understand?" -
https://twitter.com/ReginaNuzzo/status/761179359101259776
<https://twitter.com/ReginaNuzzo/status/761179359101259776> - in
many
Wikipedia languages including however in American Sign Language (and
other sign languages), as well as eventually in voice. (Regina
Nuzzo is
an associate Professor at Gallaudet University for the hearing
impaired/deafness, and has a Ph.D. in statistics from Stanford;
Regina
was born with hearing loss herself).
I fear we are nowhere near answering these sorts of questions (by
we, I mean the computer science community, not just Wikidata). The
main problem is that the question is inherently
ill-defined/subjective: there is no correct answer here.
We would need to think about refining the question to something that
is well-defined/objective, which even as a human is difficult.
Perhaps we could consider a question such as: "what statistical
methods (from a fixed list) have been used in scientific papers
referenced by news articles have been published in the past seven
years by media companies that have their headquarters in the US?".
Of course even then, there are still some minor subjective aspects,
and Wikidata would not have coverage, to answer such a question.
The short answer is that machines are nowhere near answering these
sorts of questions, no more than we are anywhere near taking a raw
stream of binary data from an .mp4 video file and turning it into
visual output. If we want to use machines to do useful things, we
need to meet machines half-way. Part of that is formulating our
questions in a way that machines can hope to process.
I'm excited for when we can ask WUaS (or Wikipedia) this
question, (or
so many others) in voice combining, for example, CC WUaS Statistics,
Earth, Space & Journalism wiki subject pages (with all their CC
MIT OCW
and Yale OYC) -
http://worlduniversity.wikia.com/wiki/Subjects
<http://worlduniversity.wikia.com/wiki/Subjects> - in all
of Wikipedia's 358 languages, again eventually in voice and in
ASL/other
sign languages
(
https://twitter.com/WorldUnivAndSch/status/761593842202050560
<https://twitter.com/WorldUnivAndSch/status/761593842202050560>
- see,
too -
https://www.wikidata.org/wiki/Wikidata:Project_chat#Schools
<https://www.wikidata.org/wiki/Wikidata:Project_chat#Schools>).
Thanks for your paper, Aidan, as well. Would designing for deafness
inform how you would approach "Querying Wikidata: Comparing SPARQL,
Relational and Graph Databases" in any new ways?
In the context of Wikidata, the question of language is mostly a
question of interface (which is itself non-trivial). But to answer
the question in whatever language or mode, the question first has to
be answered in some (machine-friendly) language. This is the
direction in which Wikidata goes: answers are first Q* identifiers,
for which labels in different languages can be generated and used to
generate a mode.
Likewise our work is on the level of generating those Q*
identifiers, which can be later turned into tables, maps, sentences,
bubbles, etc. I think the interface question is an important one,
but a different one to that which we tackle.
Cheers,
Aidan
On Sat, Aug 6, 2016 at 12:29 PM, Markus Kroetzsch
<markus.kroetzsch(a)tu-dresden.de
<mailto:markus.kroetzsch@tu-dresden.de>
<mailto:markus.kroetzsch@tu-dresden.de
<mailto:markus.kroetzsch@tu-dresden.de>>>
wrote:
Hi Aidan,
Thanks, very interesting, though I have not read the details
yet.
I wonder if you have compared the actual query results you
got from
the different stores. As far as I know, Neo4J actually uses
a very
idiosyncratic query semantics that is neither compatible
with SPARQL
(not even on the BGP level) nor with SQL (even for
SELECT-PROJECT-JOIN queries). So it is difficult to compare
it to
engines that use SQL or SPARQL (or any other standard query
language, for that matter). In this sense, it may not be
meaningful
to benchmark it against such systems.
Regarding Virtuoso, the reason for not picking it for
Wikidata was
the lack of load-balancing support in the open source
version, not
the performance of a single instance.
Best regards,
Markus
On 06.08.2016 18:19, Aidan Hogan wrote:
Hey all,
Recently we wrote a paper discussing the query
performance for
Wikidata,
comparing different possible representations of the
knowledge-base in
Postgres (a relational database), Neo4J (a graph database),
Virtuoso (a
SPARQL database) and BlazeGraph (the SPARQL database
currently
in use)
for a set of equivalent benchmark queries.
The paper was recently accepted for presentation at the
International
Semantic Web Conference (ISWC) 2016. A pre-print is
available here:
http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf
<http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf>
<http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf
<http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf>>
Of course there are some caveats with these results in
the sense
that
perhaps other engines would perform better on different
hardware, or
different styles of queries: for this reason we tried to
use the
most
general types of queries possible and tried to test
different
representations in different engines (we did not vary
the hardware).
Also in the discussion of results, we tried to give a
more general
explanation of the trends, highlighting some
strengths/weaknesses for
each engine independently of the particular queries/data.
I think it's worth a glance for anyone who is interested
in the
technology/techniques needed to query Wikidata.
Cheers,
Aidan
P.S., the paper above is a follow-up to a previous work
with Markus
Krötzsch that focussed purely on RDF/SPARQL:
http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf
<http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf>
<http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf
<http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf>>
(I'm not sure if it was previously mentioned on the list.)
P.P.S., as someone who's somewhat of an outsider but
who's been
watching
on for a few years now, I'd like to congratulate the
community for
making Wikidata what it is today. It's awesome work.
Keep going. :)
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
<mailto:Wikidata@lists.wikimedia.org>
<mailto:Wikidata@lists.wikimedia.org
<mailto:Wikidata@lists.wikimedia.org>>
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
<https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>>
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
<mailto:Wikidata@lists.wikimedia.org>
<mailto:Wikidata@lists.wikimedia.org
<mailto:Wikidata@lists.wikimedia.org>>
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
<https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>>
--
- Scott MacLeod - Founder & President
-
http://worlduniversityandschool.org
<http://worlduniversityandschool.org>
<http://worlduniversityandschool.org/
<http://worlduniversityandschool.org/>>
- 415 480 4577 <tel:415%20480%204577>
- PO Box 442, (86 Ridgecrest Road), Canyon, CA 94516
- World University and School - like Wikipedia with best
STEM-centric
OpenCourseWare - incorporated as a nonprofit university and
school in
California, and is a U.S. 501 (c) (3) tax-exempt educational
organization.
World University and School is sending you this because of your
interest
in free, online, higher education. If you don't want to receive
these,
please reply with 'unsubscribe' in the body of the email,
leaving the
subject line intact. Thank you.
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
--
- Scott MacLeod - Founder & President
- Please donate to tax-exempt 501 (c) (3)
- World University and School
- via PayPal, or credit card, here -
-
http://worlduniversityandschool.org
- or send checks to
- 415 480 4577
- PO Box 442, (86 Ridgecrest Road), Canyon, CA 94516
- World University and School - like Wikipedia with best STEM-centric
OpenCourseWare - incorporated as a nonprofit university and school in
California, and is a U.S. 501 (c) (3) tax-exempt educational organization.
World University and School is sending you this because of your interest
in free, online, higher education. If you don't want to receive these,
please reply with 'unsubscribe' in the body of the email, leaving the
subject line intact. Thank you.
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata