Wikidata query performance paper

List overview All Threads
Download

newer

older

Discussion on graph databases for...

Re: [Wikidata] Hello Wikidata!

Aidan Hogan

6 Aug 2016 6 Aug '16

4:19 p.m.

Hey all,

Recently we wrote a paper discussing the query performance for Wikidata, comparing different possible representations of the knowledge-base in Postgres (a relational database), Neo4J (a graph database), Virtuoso (a SPARQL database) and BlazeGraph (the SPARQL database currently in use) for a set of equivalent benchmark queries.

The paper was recently accepted for presentation at the International Semantic Web Conference (ISWC) 2016. A pre-print is available here:

http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf

Of course there are some caveats with these results in the sense that perhaps other engines would perform better on different hardware, or different styles of queries: for this reason we tried to use the most general types of queries possible and tried to test different representations in different engines (we did not vary the hardware). Also in the discussion of results, we tried to give a more general explanation of the trends, highlighting some strengths/weaknesses for each engine independently of the particular queries/data.

I think it's worth a glance for anyone who is interested in the technology/techniques needed to query Wikidata.

Cheers, Aidan

P.S., the paper above is a follow-up to a previous work with Markus Krötzsch that focussed purely on RDF/SPARQL:

http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf

(I'm not sure if it was previously mentioned on the list.)

P.P.S., as someone who's somewhat of an outsider but who's been watching on for a few years now, I'd like to congratulate the community for making Wikidata what it is today. It's awesome work. Keep going. :)

Show replies by date

Markus Kroetzsch

6 Aug 6 Aug

7:29 p.m.

Hi Aidan,

Thanks, very interesting, though I have not read the details yet.

I wonder if you have compared the actual query results you got from the different stores. As far as I know, Neo4J actually uses a very idiosyncratic query semantics that is neither compatible with SPARQL (not even on the BGP level) nor with SQL (even for SELECT-PROJECT-JOIN queries). So it is difficult to compare it to engines that use SQL or SPARQL (or any other standard query language, for that matter). In this sense, it may not be meaningful to benchmark it against such systems.

Regarding Virtuoso, the reason for not picking it for Wikidata was the lack of load-balancing support in the open source version, not the performance of a single instance.

Best regards,

Markus

On 06.08.2016 18:19, Aidan Hogan wrote:

...

Hey all,

Recently we wrote a paper discussing the query performance for Wikidata, comparing different possible representations of the knowledge-base in Postgres (a relational database), Neo4J (a graph database), Virtuoso (a SPARQL database) and BlazeGraph (the SPARQL database currently in use) for a set of equivalent benchmark queries.

The paper was recently accepted for presentation at the International Semantic Web Conference (ISWC) 2016. A pre-print is available here:

http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf

Of course there are some caveats with these results in the sense that perhaps other engines would perform better on different hardware, or different styles of queries: for this reason we tried to use the most general types of queries possible and tried to test different representations in different engines (we did not vary the hardware). Also in the discussion of results, we tried to give a more general explanation of the trends, highlighting some strengths/weaknesses for each engine independently of the particular queries/data.

I think it's worth a glance for anyone who is interested in the technology/techniques needed to query Wikidata.

Cheers, Aidan

P.S., the paper above is a follow-up to a previous work with Markus Krötzsch that focussed purely on RDF/SPARQL:

http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf

(I'm not sure if it was previously mentioned on the list.)

P.P.S., as someone who's somewhat of an outsider but who's been watching on for a few years now, I'd like to congratulate the community for making Wikidata what it is today. It's awesome work. Keep going. :)

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Aidan Hogan

9:09 p.m.

Hey Markus,

On 06-08-2016 15:29, Markus Kroetzsch wrote:

...

Hi Aidan,

Thanks, very interesting, though I have not read the details yet.

I wonder if you have compared the actual query results you got from the different stores. As far as I know, Neo4J actually uses a very idiosyncratic query semantics that is neither compatible with SPARQL (not even on the BGP level) nor with SQL (even for SELECT-PROJECT-JOIN queries). So it is difficult to compare it to engines that use SQL or SPARQL (or any other standard query language, for that matter). In this sense, it may not be meaningful to benchmark it against such systems.

Yes, SPARQL has a homomorphism-based semantics (where a single result can repeat an edge or node an arbitrary amount of times without problem) whereas I believe that Neo4J has a sort of pseudo-isomorphism-no-repeated-edge semantics in its evaluation (where a result cannot reuse the same edge twice, but can match the same node to multiple variables). Our queries were generated in such a way that no edges would be repeated. We also applied a distinct (set) semantics in all cases. For queries that repeat edges, indeed there would be a problem.

In terms of checking answers, we cross-referenced the number of results returned in each case. Where there were no errors (exceptions or timeouts), the result sizes overall were verified to be almost the same (something like 99.99%). The small differences were caused by things like BlazeGraph rejecting dates like February 30th that other engines didn't. We accepted this as close enough ... as not going to affect the performance results.

Our results and experiences were, in general, quite negative with respect to using Neo4J at the moment. This was somewhat counter to our initial expectations in that we thought that Wikidata would fit naturally with the property graph model that Neo4J uses, and also more generally in terms of the relative popularity of Neo4J [1].

We encountered a lot of issues, not only in terms of performance, but also in terms of indexing and representation (limited support for lookups on edge information), query language features (no RPQs: only star on simple labels), query planning (poor selectively decisions when processing bgps), etc. Our general impression is that Neo4J started with a specific use-case in mind (traversing nodes following paths) for which it is specialised, but does not currently work well for general basic graph pattern matching, and hence does not match well with the Wikidata use-case.

...

Regarding Virtuoso, the reason for not picking it for Wikidata was the lack of load-balancing support in the open source version, not the performance of a single instance.

This is good to know! We were admittedly curious about this.

On a side note, the results we presented for BlazeGraph could improve dramatically if one could isolate queries that timed out. Once one query in a sequence timed-out (we used server-side timeouts), we observed that a run of queries would then timeout, possibly a locking problem or resource leak. Also Daniel mentioned that from discussion with the devs, they claim that the current implementation works best on SSD hard drives; our experiments were on a standard SATA.

Best, Aidan

[1] http://db-engines.com/en/ranking (anecdotal of course)

...

On 06.08.2016 18:19, Aidan Hogan wrote:

...
Hey all,

Recently we wrote a paper discussing the query performance for Wikidata, comparing different possible representations of the knowledge-base in Postgres (a relational database), Neo4J (a graph database), Virtuoso (a SPARQL database) and BlazeGraph (the SPARQL database currently in use) for a set of equivalent benchmark queries.

The paper was recently accepted for presentation at the International Semantic Web Conference (ISWC) 2016. A pre-print is available here:

http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf

Of course there are some caveats with these results in the sense that perhaps other engines would perform better on different hardware, or different styles of queries: for this reason we tried to use the most general types of queries possible and tried to test different representations in different engines (we did not vary the hardware). Also in the discussion of results, we tried to give a more general explanation of the trends, highlighting some strengths/weaknesses for each engine independently of the particular queries/data.

I think it's worth a glance for anyone who is interested in the technology/techniques needed to query Wikidata.

Cheers, Aidan

P.S., the paper above is a follow-up to a previous work with Markus Krötzsch that focussed purely on RDF/SPARQL:

http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf

(I'm not sure if it was previously mentioned on the list.)

P.P.S., as someone who's somewhat of an outsider but who's been watching on for a few years now, I'd like to congratulate the community for making Wikidata what it is today. It's awesome work. Keep going. :)

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Stas Malyshev

9:56 p.m.

Hi!

...

On a side note, the results we presented for BlazeGraph could improve dramatically if one could isolate queries that timed out. Once one query in a sequence timed-out (we used server-side timeouts), we observed that a run of queries would then timeout, possibly a locking problem or

Could you please give a bit more details about this failure scenario? Is is that several queries are run in parallel and one query, timing out, hurts performance of others? Does it happen even after the long query times out? Or it was a sequential run and after one query timed out, the next query had worse performance than the same query when run not preceded by the timing-out query, i.e. timeout query had persistent effect beyond its initial run?

BTW, what was the timeout setting in your experiments? I see in the article that it says "timeouts are counted as 60 seconds" - does it mean that Blazegraph had internal timeout setting set to 60 seconds, or that the setting was different, but when processing results, the actual run time was replaced by 60 seconds?

Also, did you use analytic mode for the queries? https://wiki.blazegraph.com/wiki/index.php/QueryEvaluation#Analytic_Query_Ev... https://wiki.blazegraph.com/wiki/index.php/AnalyticQuery

This is the mode that is turned on automatically for the Wikidata Query Service, and it uses AFAIK different memory management which may influence how the cases you had problems with are handled.

I would appreciate as much detail as you could give on this, as this may also be useful on current query engine work. Also, if you're interested in the work done on WDQS, our experiences and the reasons for certain decisions and setups we did, I'd be glad to answer any questions.

...

resource leak. Also Daniel mentioned that from discussion with the devs, they claim that the current implementation works best on SSD hard drives; our experiments were on a standard SATA.

Yes, we run it on SSD, judging from our tests on test servers, running on virtualized SATA machines, the difference is indeed dramatic (orders of magnitude and more for some queries). Then again, this is highly unscientific anecdotal evidence, we didn't make anything resembling formal benchmarks since the test hardware is clearly inferior to the production one and is meant to be so. But the point is that SSD is likely a must for Blazegraph to work well on this data set. Might also improve results for other engines, so not sure how it influences the comparison between the engines.

-- Stas Malyshev smalyshev@wikimedia.org

Aidan Hogan

10:33 p.m.

On 06-08-2016 17:56, Stas Malyshev wrote:

...

Hi!

...
On a side note, the results we presented for BlazeGraph could improve dramatically if one could isolate queries that timed out. Once one query in a sequence timed-out (we used server-side timeouts), we observed that a run of queries would then timeout, possibly a locking problem or

Could you please give a bit more details about this failure scenario? Is is that several queries are run in parallel and one query, timing out, hurts performance of others? Does it happen even after the long query times out? Or it was a sequential run and after one query timed out, the next query had worse performance than the same query when run not preceded by the timing-out query, i.e. timeout query had persistent effect beyond its initial run?

The latter was the case, yes. We ran the queries in a given batch sequentially (waiting for one to finish before the next was run) and when one timed out, the next would almost surely time-out and the engine would not recover.

We tried a few things on this, like waiting an extra 60 seconds before running the next query, and also changing memory configurations to avoid GC issues. I believe Daniel was also in contact with the devs. Ultimately we figured we probably couldn't resolve the issue without touching the source code, which would obviously not be fair.

...

BTW, what was the timeout setting in your experiments? I see in the article that it says "timeouts are counted as 60 seconds" - does it mean that Blazegraph had internal timeout setting set to 60 seconds, or that the setting was different, but when processing results, the actual run time was replaced by 60 seconds?

Yup, the settings are here:

http://users.dcc.uchile.cl/~dhernand/wquery/#configure-blazegraph

My understanding is that with those settings, we set an internal timeout on BlazeGraph of 60 seconds.

...

Also, did you use analytic mode for the queries? https://wiki.blazegraph.com/wiki/index.php/QueryEvaluation#Analytic_Query_Ev... https://wiki.blazegraph.com/wiki/index.php/AnalyticQuery

This is the mode that is turned on automatically for the Wikidata Query Service, and it uses AFAIK different memory management which may influence how the cases you had problems with are handled.

This I am not aware of. I would have to ask Daniel to be sure (I know he spent quite a lot of time playing around with different settings in the case of BlazeGraph).

...

I would appreciate as much detail as you could give on this, as this may also be useful on current query engine work. Also, if you're interested in the work done on WDQS, our experiences and the reasons for certain decisions and setups we did, I'd be glad to answer any questions.

I guess to start with you should have a look at the documentation here:

http://users.dcc.uchile.cl/~dhernand/wquery/

If there's some details missing from that, or if you have any further questions, I can put you in contact with Daniel who did all the scripts, ran the experiments, was in discussion with the devs, etc. in the context of BlazeGraph. (I don't think he's on this list.)

I could also ask him perhaps to try create a minimal-ish test-case that reproduces the problem.

...

...
resource leak. Also Daniel mentioned that from discussion with the devs, they claim that the current implementation works best on SSD hard drives; our experiments were on a standard SATA.

Yes, we run it on SSD, judging from our tests on test servers, running on virtualized SATA machines, the difference is indeed dramatic (orders of magnitude and more for some queries). Then again, this is highly unscientific anecdotal evidence, we didn't make anything resembling formal benchmarks since the test hardware is clearly inferior to the production one and is meant to be so. But the point is that SSD is likely a must for Blazegraph to work well on this data set. Might also improve results for other engines, so not sure how it influences the comparison between the engines.

Yes, I think this was the message we got from the mailing lists when we were trying to troubleshoot these issues: it would be better to use an SSD. But we did not have one, and of course we didn't want to tailor our hardware to suit one particular engine.

Unfortunately I think all such empirical experiments are in some sense anecdotal; even ours. We cannot deduce, for example, what would happen, relatively speaking, on a machine with an SSD, or more cores, or with multiple instances. But still, one can learn a lot from good anecdotes.

Cheers, Aidan

Info WorldUniversity

7 Aug 7 Aug

8:15 p.m.

Hi Aidan, Markus, Daniel and Wikidatans,

As an emergence out of this conversation on Wikidata query performance, and re cc World University and School/Wikidata, as a theoretical challenge, how would you suggest coding WUaS/Wikidata initially to be able to answer this question - "What are most impt stats issues in earth/space sci that journalists should understand?" - https://twitter.com/ReginaNuzzo/status/761179359101259776 - in many Wikipedia languages including however in American Sign Language (and other sign languages), as well as eventually in voice. (Regina Nuzzo is an associate Professor at Gallaudet University for the hearing impaired/deafness, and has a Ph.D. in statistics from Stanford; Regina was born with hearing loss herself).

I'm excited for when we can ask WUaS (or Wikipedia) this question, (or so many others) in voice combining, for example, CC WUaS Statistics, Earth, Space & Journalism wiki subject pages (with all their CC MIT OCW and Yale OYC) - http://worlduniversity.wikia.com/wiki/Subjects - in all of Wikipedia's 358 languages, again eventually in voice and in ASL/other sign languages (https://twitter.com/WorldUnivAndSch/status/761593842202050560 - see, too - https://www.wikidata.org/wiki/Wikidata:Project_chat#Schools).

Thanks for your paper, Aidan, as well. Would designing for deafness inform how you would approach "Querying Wikidata: Comparing SPARQL, Relational and Graph Databases" in any new ways?

Best, Scott

On Sat, Aug 6, 2016 at 12:29 PM, Markus Kroetzsch < markus.kroetzsch@tu-dresden.de> wrote:

...

Hi Aidan,

Thanks, very interesting, though I have not read the details yet.

I wonder if you have compared the actual query results you got from the different stores. As far as I know, Neo4J actually uses a very idiosyncratic query semantics that is neither compatible with SPARQL (not even on the BGP level) nor with SQL (even for SELECT-PROJECT-JOIN queries). So it is difficult to compare it to engines that use SQL or SPARQL (or any other standard query language, for that matter). In this sense, it may not be meaningful to benchmark it against such systems.

Regarding Virtuoso, the reason for not picking it for Wikidata was the lack of load-balancing support in the open source version, not the performance of a single instance.

Best regards,

Markus

On 06.08.2016 18:19, Aidan Hogan wrote:

...
Hey all,

Recently we wrote a paper discussing the query performance for Wikidata, comparing different possible representations of the knowledge-base in Postgres (a relational database), Neo4J (a graph database), Virtuoso (a SPARQL database) and BlazeGraph (the SPARQL database currently in use) for a set of equivalent benchmark queries.

The paper was recently accepted for presentation at the International Semantic Web Conference (ISWC) 2016. A pre-print is available here:

http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf

Of course there are some caveats with these results in the sense that perhaps other engines would perform better on different hardware, or different styles of queries: for this reason we tried to use the most general types of queries possible and tried to test different representations in different engines (we did not vary the hardware). Also in the discussion of results, we tried to give a more general explanation of the trends, highlighting some strengths/weaknesses for each engine independently of the particular queries/data.

I think it's worth a glance for anyone who is interested in the technology/techniques needed to query Wikidata.

Cheers, Aidan

P.S., the paper above is a follow-up to a previous work with Markus Krötzsch that focussed purely on RDF/SPARQL:

http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf

(I'm not sure if it was previously mentioned on the list.)

P.P.S., as someone who's somewhat of an outsider but who's been watching on for a few years now, I'd like to congratulate the community for making Wikidata what it is today. It's awesome work. Keep going. :)

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- - Scott MacLeod - Founder & President - http://worlduniversityandschool.org - 415 480 4577 - PO Box 442, (86 Ridgecrest Road), Canyon, CA 94516 - World University and School - like Wikipedia with best STEM-centric OpenCourseWare - incorporated as a nonprofit university and school in California, and is a U.S. 501 (c) (3) tax-exempt educational organization. World University and School is sending you this because of your interest in free, online, higher education. If you don't want to receive these, please reply with 'unsubscribe' in the body of the email, leaving the subject line intact. Thank you.

Aidan Hogan

9:02 p.m.

Hey Scott,

On 07-08-2016 16:15, Info WorldUniversity wrote:

...

Hi Aidan, Markus, Daniel and Wikidatans,

As an emergence out of this conversation on Wikidata query performance, and re cc World University and School/Wikidata, as a theoretical challenge, how would you suggest coding WUaS/Wikidata initially to be able to answer this question - "What are most impt stats issues in earth/space sci that journalists should understand?" - https://twitter.com/ReginaNuzzo/status/761179359101259776 - in many Wikipedia languages including however in American Sign Language (and other sign languages), as well as eventually in voice. (Regina Nuzzo is an associate Professor at Gallaudet University for the hearing impaired/deafness, and has a Ph.D. in statistics from Stanford; Regina was born with hearing loss herself).

I fear we are nowhere near answering these sorts of questions (by we, I mean the computer science community, not just Wikidata). The main problem is that the question is inherently ill-defined/subjective: there is no correct answer here.

We would need to think about refining the question to something that is well-defined/objective, which even as a human is difficult. Perhaps we could consider a question such as: "what statistical methods (from a fixed list) have been used in scientific papers referenced by news articles have been published in the past seven years by media companies that have their headquarters in the US?". Of course even then, there are still some minor subjective aspects, and Wikidata would not have coverage, to answer such a question.

The short answer is that machines are nowhere near answering these sorts of questions, no more than we are anywhere near taking a raw stream of binary data from an .mp4 video file and turning it into visual output. If we want to use machines to do useful things, we need to meet machines half-way. Part of that is formulating our questions in a way that machines can hope to process.

...

I'm excited for when we can ask WUaS (or Wikipedia) this question, (or so many others) in voice combining, for example, CC WUaS Statistics, Earth, Space & Journalism wiki subject pages (with all their CC MIT OCW and Yale OYC) - http://worlduniversity.wikia.com/wiki/Subjects - in all of Wikipedia's 358 languages, again eventually in voice and in ASL/other sign languages (https://twitter.com/WorldUnivAndSch/status/761593842202050560 - see, too - https://www.wikidata.org/wiki/Wikidata:Project_chat#Schools).

Thanks for your paper, Aidan, as well. Would designing for deafness inform how you would approach "Querying Wikidata: Comparing SPARQL, Relational and Graph Databases" in any new ways?

In the context of Wikidata, the question of language is mostly a question of interface (which is itself non-trivial). But to answer the question in whatever language or mode, the question first has to be answered in some (machine-friendly) language. This is the direction in which Wikidata goes: answers are first Q* identifiers, for which labels in different languages can be generated and used to generate a mode.

Likewise our work is on the level of generating those Q* identifiers, which can be later turned into tables, maps, sentences, bubbles, etc. I think the interface question is an important one, but a different one to that which we tackle.

Cheers, Aidan

...

On Sat, Aug 6, 2016 at 12:29 PM, Markus Kroetzsch <markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de> wrote:

Hi Aidan,

Thanks, very interesting, though I have not read the details yet.

I wonder if you have compared the actual query results you got from
the different stores. As far as I know, Neo4J actually uses a very
idiosyncratic query semantics that is neither compatible with SPARQL
(not even on the BGP level) nor with SQL (even for
SELECT-PROJECT-JOIN queries). So it is difficult to compare it to
engines that use SQL or SPARQL (or any other standard query
language, for that matter). In this sense, it may not be meaningful
to benchmark it against such systems.

Regarding Virtuoso, the reason for not picking it for Wikidata was
the lack of load-balancing support in the open source version, not
the performance of a single instance.

Best regards,

Markus



On 06.08.2016 18:19, Aidan Hogan wrote:

    Hey all,

    Recently we wrote a paper discussing the query performance for
    Wikidata,
    comparing different possible representations of the
    knowledge-base in
    Postgres (a relational database), Neo4J (a graph database),
    Virtuoso (a
    SPARQL database) and BlazeGraph (the SPARQL database currently
    in use)
    for a set of equivalent benchmark queries.

    The paper was recently accepted for presentation at the
    International
    Semantic Web Conference (ISWC) 2016. A pre-print is available here:

    http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf
    <http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf>

    Of course there are some caveats with these results in the sense
    that
    perhaps other engines would perform better on different hardware, or
    different styles of queries: for this reason we tried to use the
    most
    general types of queries possible and tried to test different
    representations in different engines (we did not vary the hardware).
    Also in the discussion of results, we tried to give a more general
    explanation of the trends, highlighting some
    strengths/weaknesses for
    each engine independently of the particular queries/data.

    I think it's worth a glance for anyone who is interested in the
    technology/techniques needed to query Wikidata.

    Cheers,
    Aidan


    P.S., the paper above is a follow-up to a previous work with Markus
    Krötzsch that focussed purely on RDF/SPARQL:

    http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf
    <http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf>

    (I'm not sure if it was previously mentioned on the list.)

    P.P.S., as someone who's somewhat of an outsider but who's been
    watching
    on for a few years now, I'd like to congratulate the community for
    making Wikidata what it is today. It's awesome work. Keep going. :)

    _______________________________________________
    Wikidata mailing list
    Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
    https://lists.wikimedia.org/mailman/listinfo/wikidata
    <https://lists.wikimedia.org/mailman/listinfo/wikidata>



_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>

Scott MacLeod - Founder & President
http://worlduniversityandschool.org

http://worlduniversityandschool.org/

415 480 4577
PO Box 442, (86 Ridgecrest Road), Canyon, CA 94516
World University and School - like Wikipedia with best STEM-centric

OpenCourseWare - incorporated as a nonprofit university and school in California, and is a U.S. 501 (c) (3) tax-exempt educational organization.

World University and School is sending you this because of your interest in free, online, higher education. If you don't want to receive these, please reply with 'unsubscribe' in the body of the email, leaving the subject line intact. Thank you.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Scott MacLeod

11:53 p.m.

Thanks, Aidan, Stas and Wikidatans,

Thanks for the feedback.

While I'm not yet a SQL/SPARQL programmer, I wonder if one could make each word in the question concrete, a Qidentifier, and with rank-able outcomes, create Wikidata Q-items/identifiers with attributes possibly for each MIT OCW course in 7 languages courses and each Yale OYC courses, as well as each WUaS subject page. It's the ranking of responses that would lesson the significance of the question that is inherent ill-definition/subjectivity, I think (?) - and there might be other SQL/SPARQL related approaches to this problem too.

Then hypothetically one could compare, for example, the list of MIT OCW

Earth, Atmosphere and Planetary Science courses (e.g. http://ocw.mit.edu/courses/earth-atmospheric-and-planetary-sciences/ and in Spanish - http://ocw.mit.edu/courses/translated-courses/spanish/#earth-atmospheric-and... and WUaS's Earth wiki subject - http://worlduniversity.wikia.com/wiki/Earth,_Atmospheric,_and_Planetary_Scie... ),

Statistics (e.g. http://ocw.mit.edu/courses/mathematics/ and in Spanish http://ocw.mit.edu/courses/translated-courses/spanish/#mathematics and WUaS's Statistics' wiki page http://worlduniversity.wikia.com/wiki/Statistics),

Space/Astronautics courses ( http://ocw.mit.edu/courses/aeronautics-and-astronautics/ and WUaS's Space wiki subject - http://worlduniversity.wikia.com/wiki/Space) with perhaps wiki-added WUaS

Journalism wiki subject page (e.g. http://ocw.mit.edu/courses/comparative-media-studies-writing/ and Journalism http://worlduniversity.wikia.com/wiki/Journalism and various forms of writing at WUaS http://worlduniversity.wikia.com/wiki/writing)

... with Q items, newspaper articles and ask a variety of related questions of the results?

It would be some sort of correlation of the relative rankings of these outputs in response to the queries - and which could yield results paralleling somehow Google Search results, for example. (Possible collaboration with Google Search even would increase eventually collaboration in voice on Android smartphones, and in Google group video Hangouts for ASL and other forms of sign language, for example).

I haven't been able to find any Mandarin Chinese MIT OCW Statistics, Earth, Space, or Journalism courses - http://ocw.mit.edu/courses/translated-courses/traditional-chinese (accessible here http://ocw.mit.edu/courses/translated-courses/) yet, to speak of, although these MIT OCW Writing courses in Mandarin Chinese - http://ocw.mit.edu/courses/translated-courses/traditional-chinese/#comparati... - could work possibly for some of these hypothetical Wikidata query performance questions I'm seeking to explore - in this "if one builds it approach."

For example, and hypothetically, if there were 3 relatively recent and new MIT OCW Earth courses, and 2 new MIT OCW Statistics courses, and 10 journalism articles from best newspapers and best academic journals in English on Earth/Space ( http://ocw.mit.edu/courses/aeronautics-and-astronautics/), and 4 in Chinese, and 5 in Spanish, for example, perhaps one could get helpful and useful outputs (that could eventually be asked for in voice/natural language processing), - by ranking relative importance partly according to the newness of the course, and getting objective relative outcomes as a group. The importance of a specific set of journals to a specific discipline / subject could be another source of ranking of importance, for example - to highlight the operative item in this question, and add some further relative rankings as useful SQL coding possibilities.

Wikidata would generate or get a lot of valuable new fact-oriented and knowledge-oriented Q items/identifiers/attributes (for CC MIT OCW's 2300 courses in English, and the other courses in 6 other languages, and CC Yale OYC, as well as CC WUaS subjects, and with planning for major universities with these and growing number of wiki subjects in all languages).

I have no idea yet how to write the SQL/SPARQL for this, but rankable Q* identifiers, new Q* identifiers and Google would be places I'd begin if I did. What do you think?

Cheers, Scott

On Sun, Aug 7, 2016 at 2:02 PM, Aidan Hogan aidhog@gmail.com wrote:

...

Hey Scott,

On 07-08-2016 16:15, Info WorldUniversity wrote:

...
Hi Aidan, Markus, Daniel and Wikidatans,

As an emergence out of this conversation on Wikidata query performance, and re cc World University and School/Wikidata, as a theoretical challenge, how would you suggest coding WUaS/Wikidata initially to be able to answer this question - "What are most impt stats issues in earth/space sci that journalists should understand?" - https://twitter.com/ReginaNuzzo/status/761179359101259776 - in many Wikipedia languages including however in American Sign Language (and other sign languages), as well as eventually in voice. (Regina Nuzzo is an associate Professor at Gallaudet University for the hearing impaired/deafness, and has a Ph.D. in statistics from Stanford; Regina was born with hearing loss herself).

I fear we are nowhere near answering these sorts of questions (by we, I mean the computer science community, not just Wikidata). The main problem is that the question is inherently ill-defined/subjective: there is no correct answer here.

We would need to think about refining the question to something that is well-defined/objective, which even as a human is difficult. Perhaps we could consider a question such as: "what statistical methods (from a fixed list) have been used in scientific papers referenced by news articles have been published in the past seven years by media companies that have their headquarters in the US?". Of course even then, there are still some minor subjective aspects, and Wikidata would not have coverage, to answer such a question.

The short answer is that machines are nowhere near answering these sorts of questions, no more than we are anywhere near taking a raw stream of binary data from an .mp4 video file and turning it into visual output. If we want to use machines to do useful things, we need to meet machines half-way. Part of that is formulating our questions in a way that machines can hope to process.

I'm excited for when we can ask WUaS (or Wikipedia) this question, (or

...
so many others) in voice combining, for example, CC WUaS Statistics, Earth, Space & Journalism wiki subject pages (with all their CC MIT OCW and Yale OYC) - http://worlduniversity.wikia.com/wiki/Subjects - in all of Wikipedia's 358 languages, again eventually in voice and in ASL/other sign languages (https://twitter.com/WorldUnivAndSch/status/761593842202050560 - see, too - https://www.wikidata.org/wiki/Wikidata:Project_chat#Schools).

Thanks for your paper, Aidan, as well. Would designing for deafness inform how you would approach "Querying Wikidata: Comparing SPARQL, Relational and Graph Databases" in any new ways?

In the context of Wikidata, the question of language is mostly a question of interface (which is itself non-trivial). But to answer the question in whatever language or mode, the question first has to be answered in some (machine-friendly) language. This is the direction in which Wikidata goes: answers are first Q* identifiers, for which labels in different languages can be generated and used to generate a mode.

Likewise our work is on the level of generating those Q* identifiers, which can be later turned into tables, maps, sentences, bubbles, etc. I think the interface question is an important one, but a different one to that which we tackle.

Cheers, Aidan

On Sat, Aug 6, 2016 at 12:29 PM, Markus Kroetzsch

...
<markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de>

wrote:
Hi Aidan,

Thanks, very interesting, though I have not read the details yet.

I wonder if you have compared the actual query results you got from
the different stores. As far as I know, Neo4J actually uses a very
idiosyncratic query semantics that is neither compatible with SPARQL
(not even on the BGP level) nor with SQL (even for
SELECT-PROJECT-JOIN queries). So it is difficult to compare it to
engines that use SQL or SPARQL (or any other standard query
language, for that matter). In this sense, it may not be meaningful
to benchmark it against such systems.

Regarding Virtuoso, the reason for not picking it for Wikidata was
the lack of load-balancing support in the open source version, not
the performance of a single instance.

Best regards,

Markus



On 06.08.2016 18:19, Aidan Hogan wrote:

    Hey all,

    Recently we wrote a paper discussing the query performance for
    Wikidata,
    comparing different possible representations of the
    knowledge-base in
    Postgres (a relational database), Neo4J (a graph database),
    Virtuoso (a
    SPARQL database) and BlazeGraph (the SPARQL database currently
    in use)
    for a set of equivalent benchmark queries.

    The paper was recently accepted for presentation at the
    International
    Semantic Web Conference (ISWC) 2016. A pre-print is available
here:
    http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf
    <http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf>

    Of course there are some caveats with these results in the sense
    that
    perhaps other engines would perform better on different hardware,
or different styles of queries: for this reason we tried to use the most general types of queries possible and tried to test different representations in different engines (we did not vary the hardware). Also in the discussion of results, we tried to give a more general explanation of the trends, highlighting some strengths/weaknesses for each engine independently of the particular queries/data.
    I think it's worth a glance for anyone who is interested in the
    technology/techniques needed to query Wikidata.

    Cheers,
    Aidan


    P.S., the paper above is a follow-up to a previous work with
Markus Krötzsch that focussed purely on RDF/SPARQL:
    http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf
    <http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf>

    (I'm not sure if it was previously mentioned on the list.)

    P.P.S., as someone who's somewhat of an outsider but who's been
    watching
    on for a few years now, I'd like to congratulate the community for
    making Wikidata what it is today. It's awesome work. Keep going.
:)
    _______________________________________________
    Wikidata mailing list
    Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org
...
    https://lists.wikimedia.org/mailman/listinfo/wikidata
    <https://lists.wikimedia.org/mailman/listinfo/wikidata>



_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
--

Scott MacLeod - Founder & President

http://worlduniversityandschool.org

http://worlduniversityandschool.org/

415 480 4577

PO Box 442, (86 Ridgecrest Road), Canyon, CA 94516

World University and School - like Wikipedia with best STEM-centric

OpenCourseWare - incorporated as a nonprofit university and school in California, and is a U.S. 501 (c) (3) tax-exempt educational organization.

World University and School is sending you this because of your interest in free, online, higher education. If you don't want to receive these, please reply with 'unsubscribe' in the body of the email, leaving the subject line intact. Thank you.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- - Scott MacLeod - Founder & President - Please donate to tax-exempt 501 (c) (3) - World University and School - via PayPal, or credit card, here - - http://worlduniversityandschool.org - or send checks to - 415 480 4577 - PO Box 442, (86 Ridgecrest Road), Canyon, CA 94516 - World University and School - like Wikipedia with best STEM-centric OpenCourseWare - incorporated as a nonprofit university and school in California, and is a U.S. 501 (c) (3) tax-exempt educational organization. World University and School is sending you this because of your interest in free, online, higher education. If you don't want to receive these, please reply with 'unsubscribe' in the body of the email, leaving the subject line intact. Thank you.

Aidan Hogan

8 Aug 8 Aug

12:07 a.m.

Hey Scott,

While I'm not sure I can help with the details of the specific example you are mentioning, but the general area you are in -- dealing with answering questions posed in natural language -- is called "Question Answering".

When dealing with data in an RDF format (as per Wikidata), there's quite a lot of research done in the context of "Question Answering over Linked Data" (QALD).

The methods are not 100% accurate, but given data in a structured format (like RDF), with good labels, and assuming relatively simple objective questions (like "what age is the current Italian president?") that can be answered over the data, I believe these techniques can get quite good results. One can check out the QALD evaluation series for more details on how good [1].

I'm not really in that area myself, but perhaps the keywords might be useful if you want to read more.

Probably this will not be so helpful though if your focus is on using Wikidata to answer one specific question. :)

Cheers, Aidan

[1] http://qald.sebastianwalter.org/

On 07-08-2016 19:53, Scott MacLeod wrote:

...

Thanks, Aidan, Stas and Wikidatans,

Thanks for the feedback.

While I'm not yet a SQL/SPARQL programmer, I wonder if one could make each word in the question concrete, a Qidentifier, and with rank-able outcomes, create Wikidata Q-items/identifiers with attributes possibly for each MIT OCW course in 7 languages courses and each Yale OYC courses, as well as each WUaS subject page. It's the ranking of responses that would lesson the significance of the question that is inherent ill-definition/subjectivity, I think (?) - and there might be other SQL/SPARQL related approaches to this problem too.

Then hypothetically one could compare, for example, the list of MIT OCW

Earth, Atmosphere and Planetary Science courses (e.g. http://ocw.mit.edu/courses/earth-atmospheric-and-planetary-sciences/ and in Spanish - http://ocw.mit.edu/courses/translated-courses/spanish/#earth-atmospheric-and... and WUaS's Earth wiki subject - http://worlduniversity.wikia.com/wiki/Earth,_Atmospheric,_and_Planetary_Scie...),

Statistics (e.g. http://ocw.mit.edu/courses/mathematics/ and in Spanish http://ocw.mit.edu/courses/translated-courses/spanish/#mathematics and WUaS's Statistics' wiki page http://worlduniversity.wikia.com/wiki/Statistics),

Space/Astronautics courses (http://ocw.mit.edu/courses/aeronautics-and-astronautics/ and WUaS's Space wiki subject - http://worlduniversity.wikia.com/wiki/Space) with perhaps wiki-added WUaS

Journalism wiki subject page (e.g. http://ocw.mit.edu/courses/comparative-media-studies-writing/ and Journalism http://worlduniversity.wikia.com/wiki/Journalism and various forms of writing at WUaS http://worlduniversity.wikia.com/wiki/writing)

... with Q items, newspaper articles and ask a variety of related questions of the results?

It would be some sort of correlation of the relative rankings of these outputs in response to the queries - and which could yield results paralleling somehow Google Search results, for example. (Possible collaboration with Google Search even would increase eventually collaboration in voice on Android smartphones, and in Google group video Hangouts for ASL and other forms of sign language, for example).

I haven't been able to find any Mandarin Chinese MIT OCW Statistics, Earth, Space, or Journalism courses - http://ocw.mit.edu/courses/translated-courses/traditional-chinese (accessible here http://ocw.mit.edu/courses/translated-courses/) yet, to speak of, although these MIT OCW Writing courses in Mandarin Chinese - http://ocw.mit.edu/courses/translated-courses/traditional-chinese/#comparati...

could work possibly for some of these hypothetical Wikidata query

performance questions I'm seeking to explore - in this "if one builds it approach."

For example, and hypothetically, if there were 3 relatively recent and new MIT OCW Earth courses, and 2 new MIT OCW Statistics courses, and 10 journalism articles from best newspapers and best academic journals in English on Earth/Space (http://ocw.mit.edu/courses/aeronautics-and-astronautics/), and 4 in Chinese, and 5 in Spanish, for example, perhaps one could get helpful and useful outputs (that could eventually be asked for in voice/natural language processing), - by ranking relative importance partly according to the newness of the course, and getting objective relative outcomes as a group. The importance of a specific set of journals to a specific discipline / subject could be another source of ranking of importance, for example - to highlight the operative item in this question, and add some further relative rankings as useful SQL coding possibilities.

Wikidata would generate or get a lot of valuable new fact-oriented and knowledge-oriented Q items/identifiers/attributes (for CC MIT OCW's 2300 courses in English, and the other courses in 6 other languages, and CC Yale OYC, as well as CC WUaS subjects, and with planning for major universities with these and growing number of wiki subjects in all languages).

I have no idea yet how to write the SQL/SPARQL for this, but rankable Q* identifiers, new Q* identifiers and Google would be places I'd begin if I did. What do you think?

Cheers, Scott

On Sun, Aug 7, 2016 at 2:02 PM, Aidan Hogan <aidhog@gmail.com mailto:aidhog@gmail.com> wrote:
Hey Scott,

On 07-08-2016 16:15, Info WorldUniversity wrote:

    Hi Aidan, Markus, Daniel and Wikidatans,

    As an emergence out of this conversation on Wikidata query
    performance,
    and re cc World University and School/Wikidata, as a theoretical
    challenge, how would you suggest coding WUaS/Wikidata initially
    to be
    able to answer this question - "What are most impt stats issues in
    earth/space sci that journalists should understand?" -
    https://twitter.com/ReginaNuzzo/status/761179359101259776
    <https://twitter.com/ReginaNuzzo/status/761179359101259776> - in
    many
    Wikipedia languages including however in American Sign Language (and
    other sign languages), as well as eventually in voice. (Regina
    Nuzzo is
    an associate Professor at Gallaudet University for the hearing
    impaired/deafness, and has a Ph.D. in statistics from Stanford;
    Regina
    was born with hearing loss herself).


I fear we are nowhere near answering these sorts of questions (by
we, I mean the computer science community, not just Wikidata). The
main problem is that the question is inherently
ill-defined/subjective: there is no correct answer here.

We would need to think about refining the question to something that
is well-defined/objective, which even as a human is difficult.
Perhaps we could consider a question such as: "what statistical
methods (from a fixed list) have been used in scientific papers
referenced by news articles have been published in the past seven
years by media companies that have their headquarters in the US?".
Of course even then, there are still some minor subjective aspects,
and Wikidata would not have coverage, to answer such a question.

The short answer is that machines are nowhere near answering these
sorts of questions, no more than we are anywhere near taking a raw
stream of binary data from an .mp4 video file and turning it into
visual output. If we want to use machines to do useful things, we
need to meet machines half-way. Part of that is formulating our
questions in a way that machines can hope to process.

    I'm excited for when we can ask WUaS (or Wikipedia) this
    question, (or
    so many others) in voice combining, for example, CC WUaS Statistics,
    Earth, Space & Journalism wiki subject pages (with all their CC
    MIT OCW
    and Yale OYC) - http://worlduniversity.wikia.com/wiki/Subjects
    <http://worlduniversity.wikia.com/wiki/Subjects> - in all
    of Wikipedia's 358 languages, again eventually in voice and in
    ASL/other
    sign languages
    (https://twitter.com/WorldUnivAndSch/status/761593842202050560
    <https://twitter.com/WorldUnivAndSch/status/761593842202050560>
    - see,
    too -
    https://www.wikidata.org/wiki/Wikidata:Project_chat#Schools
    <https://www.wikidata.org/wiki/Wikidata:Project_chat#Schools>).

    Thanks for your paper, Aidan, as well. Would designing for deafness
    inform how you would approach "Querying Wikidata: Comparing SPARQL,
    Relational and Graph Databases" in any new ways?


In the context of Wikidata, the question of language is mostly a
question of interface (which is itself non-trivial). But to answer
the question in whatever language or mode, the question first has to
be answered in some (machine-friendly) language. This is the
direction in which Wikidata goes: answers are first Q* identifiers,
for which labels in different languages can be generated and used to
generate a mode.

Likewise our work is on the level of generating those Q*
identifiers, which can be later turned into tables, maps, sentences,
bubbles, etc. I think the interface question is an important one,
but a different one to that which we tackle.

Cheers,
Aidan


    On Sat, Aug 6, 2016 at 12:29 PM, Markus Kroetzsch
    <markus.kroetzsch@tu-dresden.de
    <mailto:markus.kroetzsch@tu-dresden.de>
    <mailto:markus.kroetzsch@tu-dresden.de
    <mailto:markus.kroetzsch@tu-dresden.de>>>

    wrote:

        Hi Aidan,

        Thanks, very interesting, though I have not read the details
    yet.

        I wonder if you have compared the actual query results you
    got from
        the different stores. As far as I know, Neo4J actually uses
    a very
        idiosyncratic query semantics that is neither compatible
    with SPARQL
        (not even on the BGP level) nor with SQL (even for
        SELECT-PROJECT-JOIN queries). So it is difficult to compare
    it to
        engines that use SQL or SPARQL (or any other standard query
        language, for that matter). In this sense, it may not be
    meaningful
        to benchmark it against such systems.

        Regarding Virtuoso, the reason for not picking it for
    Wikidata was
        the lack of load-balancing support in the open source
    version, not
        the performance of a single instance.

        Best regards,

        Markus



        On 06.08.2016 18:19, Aidan Hogan wrote:

            Hey all,

            Recently we wrote a paper discussing the query
    performance for
            Wikidata,
            comparing different possible representations of the
            knowledge-base in
            Postgres (a relational database), Neo4J (a graph database),
            Virtuoso (a
            SPARQL database) and BlazeGraph (the SPARQL database
    currently
            in use)
            for a set of equivalent benchmark queries.

            The paper was recently accepted for presentation at the
            International
            Semantic Web Conference (ISWC) 2016. A pre-print is
    available here:


    http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf
    <http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf>

    <http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf
    <http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf>>

            Of course there are some caveats with these results in
    the sense
            that
            perhaps other engines would perform better on different
    hardware, or
            different styles of queries: for this reason we tried to
    use the
            most
            general types of queries possible and tried to test
    different
            representations in different engines (we did not vary
    the hardware).
            Also in the discussion of results, we tried to give a
    more general
            explanation of the trends, highlighting some
            strengths/weaknesses for
            each engine independently of the particular queries/data.

            I think it's worth a glance for anyone who is interested
    in the
            technology/techniques needed to query Wikidata.

            Cheers,
            Aidan


            P.S., the paper above is a follow-up to a previous work
    with Markus
            Krötzsch that focussed purely on RDF/SPARQL:


    http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf
    <http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf>

    <http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf
    <http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf>>

            (I'm not sure if it was previously mentioned on the list.)

            P.P.S., as someone who's somewhat of an outsider but
    who's been
            watching
            on for a few years now, I'd like to congratulate the
    community for
            making Wikidata what it is today. It's awesome work.
    Keep going. :)

            _______________________________________________
            Wikidata mailing list
            Wikidata@lists.wikimedia.org
    <mailto:Wikidata@lists.wikimedia.org>
    <mailto:Wikidata@lists.wikimedia.org
    <mailto:Wikidata@lists.wikimedia.org>>
            https://lists.wikimedia.org/mailman/listinfo/wikidata
    <https://lists.wikimedia.org/mailman/listinfo/wikidata>
            <https://lists.wikimedia.org/mailman/listinfo/wikidata
    <https://lists.wikimedia.org/mailman/listinfo/wikidata>>



        _______________________________________________
        Wikidata mailing list
        Wikidata@lists.wikimedia.org
    <mailto:Wikidata@lists.wikimedia.org>
    <mailto:Wikidata@lists.wikimedia.org
    <mailto:Wikidata@lists.wikimedia.org>>
        https://lists.wikimedia.org/mailman/listinfo/wikidata
    <https://lists.wikimedia.org/mailman/listinfo/wikidata>
        <https://lists.wikimedia.org/mailman/listinfo/wikidata
    <https://lists.wikimedia.org/mailman/listinfo/wikidata>>




    --

    - Scott MacLeod - Founder & President

    - http://worlduniversityandschool.org
    <http://worlduniversityandschool.org>
    <http://worlduniversityandschool.org/
    <http://worlduniversityandschool.org/>>

    - 415 480 4577 <tel:415%20480%204577>

    - PO Box 442, (86 Ridgecrest Road), Canyon, CA 94516

    - World University and School - like Wikipedia with best
    STEM-centric
    OpenCourseWare - incorporated as a nonprofit university and
    school in
    California, and is a U.S. 501 (c) (3) tax-exempt educational
    organization.


    World University and School is sending you this because of your
    interest
    in free, online, higher education. If you don't want to receive
    these,
    please reply with 'unsubscribe' in the body of the email,
    leaving the
    subject line intact. Thank you.



    _______________________________________________
    Wikidata mailing list
    Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
    https://lists.wikimedia.org/mailman/listinfo/wikidata
    <https://lists.wikimedia.org/mailman/listinfo/wikidata>


_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
--

Scott MacLeod - Founder & President

Please donate to tax-exempt 501 (c) (3)

World University and School

via PayPal, or credit card, here -

http://worlduniversityandschool.org

or send checks to

415 480 4577

PO Box 442, (86 Ridgecrest Road), Canyon, CA 94516

World University and School - like Wikipedia with best STEM-centric

OpenCourseWare - incorporated as a nonprofit university and school in California, and is a U.S. 501 (c) (3) tax-exempt educational organization.

World University and School is sending you this because of your interest in free, online, higher education. If you don't want to receive these, please reply with 'unsubscribe' in the body of the email, leaving the subject line intact. Thank you.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Stas Malyshev

6 Aug 6 Aug

9:38 p.m.

Hi!

...

The paper was recently accepted for presentation at the International Semantic Web Conference (ISWC) 2016. A pre-print is available here:

http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf

Thank you for the link! It would be interesting to see actual data representations used for RDF (e.g. examples of the data or more detailed description). I notice that they differ substantially from what we use in the Wikidata Query service implementation, used with Blazegraph, and also some of the performance features we have implemented are probably not part of your implementation. In any case, it would be interesting to know the details of which RDF representations were used.

I also note that only statements and qualifiers are mentioned in most of the text, but very little mention of sitelinks and references. Were they part of the model too?

Due to the different RDF semantics, it would be also interesting to get more details about how the example queries were translated to the RDF representation(s) used in the article. Was it an automatic process or they were translated manually? Is it possible to see them?

When working on Query Service implementation, we considered a number of possible representations, which regard to both performance and semantic completeness. One of the conclusions was that achieving adequate semantic completeness and performance on relational database, while allowing people to (relatively) easy write complex queries is not possible, due to relational engines not being a good match for hierachical graph-like structures in Wikidata.

It would be interesting to look at the Postgres implementation of the data model and queries to see whether your conclusions were different in this case.

-- Stas Malyshev smalyshev@wikimedia.org

Aidan Hogan

10:15 p.m.

Hi Stas,

[I'm sorry, I just realised this email was mysteriously sent before it was finished. I'll respond in a moment to your other mail.]

On 06-08-2016 17:38, Stas Malyshev wrote:

...

Hi!

...
The paper was recently accepted for presentation at the International Semantic Web Conference (ISWC) 2016. A pre-print is available here:

http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf

Thank you for the link! It would be interesting to see actual data representations used for RDF (e.g. examples of the data or more detailed description). I notice that they differ substantially from what we use in the Wikidata Query service implementation, used with Blazegraph, and also some of the performance features we have implemented are probably not part of your implementation. In any case, it would be interesting to know the details of which RDF representations were used.

There's a brief summary in the paper of the models used. In terms of all the "gory" details of how everything was generated, (hopefully) all of the relevant details supporting the paper should be available here:

http://users.dcc.uchile.cl/~dhernand/wquery/

The RDF representations are summarised in Figure 2. The code we used to generate those representations is mentioned here:

http://users.dcc.uchile.cl/~dhernand/wquery/#download-the-code http://users.dcc.uchile.cl/~dhernand/wquery/#translate-the-data-to-rdf

Note we did not consider any "direct triples" in the representations since we felt this would effectively be "covered" by the Named Graphs representation. Rather than mixing direct triples and reified representations (like in the current service), we chose to keep them separate.

...

I also note that only statements and qualifiers are mentioned in most of the text, but very little mention of sitelinks and references. Were they part of the model too?

We just generalised sitelinks and references as a special type of qualifier (actually I don't think the paper mentions sitelinks but we mention this in the context of references).

...

Due to the different RDF semantics, it would be also interesting to get more details about how the example queries were translated to the RDF representation(s) used in the article. Was it an automatic process or they were translated manually? Is it possible to see them?

I guess that depends on what you mean by "automatic" or "manual". :)

Automatic scripts were manually coded to convert from the JSON dump to each representation. The code is linked above.

We didn't put the dataset up (since the raw data and the code are provided and can be used to generate them and the RDF datasets are obviously large) but if you want a copy of the raw RDF data we generated, let me know.

...

When working on Query Service implementation, we considered a number of possible representations, which regard to both performance and semantic completeness. One of the conclusions was that achieving adequate semantic completeness and performance on relational database, while allowing people to (relatively) easy write complex queries is not possible, due to relational engines not being a good match for hierachical graph-like structures in Wikidata.

I'm not sure I follow on this part, in particular on the part of "semantic completeness" and why this is hard to achieve in the context of relational databases. (I get the gist but don't understand enough to respond directly ... but perhaps below I can answer indirectly?)

...

It would be interesting to look at the Postgres implementation of the data model and queries to see whether your conclusions were different in this case.

A sketch of the relational schema is given in Figure 3 of the paper (which is not too dissimilar to the Named Graph representation for RDF) and some more low level details, including code, etc., in the link above, including details on indexing. This was something we admittedly has to play around with quite a bit.

Our general experiences of using Postgres were:

* It's very good for simple queries that involve a single join through a primary/foreign key (a caveat here: we used the "direct client" of Postgres since we could not find a HTTP client like other engines).

* It's not so good when there's a lot of "self-joins" in the query (compared with Virtuoso), like for "bushy queries" (or what we call "snowflake queries"), or when multiple values for a tuple are given (i.e., a single pattern contains multiple constants) but neither on their own are particularly selective. We figure that perhaps Virtuoso has special optimisations for such self-joins since they would be much more common in an RDF/SPARQL scenario than a relational/SQL scenario.

* Encoding object values with different datatypes (booleans, dates, etc.) was a pain. One option was to have separate tables/columns for each datatype, which would complicate queries and also leave the question of how to add calendars, precisions, etc. Another option was to use JSON strings to encode the values (the version of Postgres we used just considered these as strings, but I think the new version has some JSONB(?) support that could help get around this).

Probably some of these issues could be resolved by playing around with the schema and/or the indexing, but perhaps relating to what you were saying, the result would be a pretty "exceptional" schema difficult to write queries for.

A more general problem we encountered:

* SQL has poor support for arbitrary-length path queries (RPQs/property paths). You can do something like that using the WITH RECURSIVE feature, but this is a much more general feature that did not work well for Postgres in initial experiments. We don't really report the details of this in the paper, but our experience is that Postgres would not support these well. A lot of the examples we saw on the query service use the * or + feature of SPARQL property paths (esp. for types). This would be an issue in Postgres (perhaps it could be partially solved by materialising some transitive closures, e.g., on types, but something as flexible as property paths didn't seem feasible to us).

Cheers, Aidan

Stas Malyshev

10:48 p.m.

Hi!

...

There's a brief summary in the paper of the models used. In terms of all the "gory" details of how everything was generated, (hopefully) all of the relevant details supporting the paper should be available here:

http://users.dcc.uchile.cl/~dhernand/wquery/

Yes, the gory part is what I'm after :) Thank you, I'll read through it in the next couple of days and come back with any questions/comments I might have.

...

We just generalised sitelinks and references as a special type of qualifier (actually I don't think the paper mentions sitelinks but we mention this in the context of references).

Sitelinks can not be qualifiers, since they belong to the entity, not to the statement. They can, I imagine, be considered a special case of properties (we do not do it, but in theory it is not impossible to represent them this way if one wanted to).

I am not sure how exactly one would make references special case of qualifier, as qualifier has one (maybe complex) value, while references each can have multiple properties and values, but I'll read through the details and the code before I talk more about it, it's possible that I find my answers there.

...

I guess that depends on what you mean by "automatic" or "manual". :)

Automatic scripts were manually coded to convert from the JSON dump to each representation. The code is linked above.

Here I meant queries, not data.

...

I'm not sure I follow on this part, in particular on the part of "semantic completeness" and why this is hard to achieve in the context of relational databases. (I get the gist but don't understand enough to respond directly ... but perhaps below I can answer indirectly?)

Check out https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format

This is the range of data we need to represent and allow people to query. We found it hard to do this using relational model. It's probably possible in theory, but producing efficient queries for it looked very challenging, unless we were essentially to duplicate the effort that is implemented in any graph database and only use the db itself for most basic storage needs. That's pretty much what Titan + Cassandra combo did, which we initially used until Titan's devs were acquired by DataStax and resulting uncertainty prompted us to look into different solutions. I imagine in theory it's also possible to create Something+PostgreSQL combo doing the same, but PostgreSQL looks not enough.

In any case, dealing with things like property paths seem to be rather hard on SQL-based platform, and practically a must for Wikidata querying.

...

It's not so good when there's a lot of "self-joins" in the query

(compared with Virtuoso), like for "bushy queries" (or what we call "snowflake queries"), or when multiple values for a tuple are given (i.e., a single pattern contains multiple constants) but neither on their own are particularly selective. We figure that perhaps Virtuoso has special optimisations for such self-joins since they would be much more common in an RDF/SPARQL scenario than a relational/SQL scenario.

That confirms my intuition about it, thanks for the details :)

...

Encoding object values with different datatypes (booleans, dates,

etc.) was a pain. One option was to have separate tables/columns for each datatype, which would complicate queries and also leave the question of how to add calendars, precisions, etc. Another option was to use JSON strings to encode the values (the version of Postgres we used just considered these as strings, but I think the new version has some JSONB(?) support that could help get around this).

That is also an issue. We have a number of specialty data types (e.g. dates extending billion years into the future/past, coordinates including different globes, etc.) which may present a challenge unless the platform offers an easy way to encode custom types and deal with them. RDF has rather flexible model (basically string + type IRI) here, and Blazegraph too, not sure how accommodating the SQL databases would be.

Also, relational DBs mostly prefer very predictable data type model - i.e. same column always contains the same type. This is obviously not true for any generic representation, and may be not true even in very restricted context - e.g. same property can have values of different types (rare but happens).

Of course, one can wrap everything into JSON - but then how you index, i.e. sort or range by date if date is a JSON object having no natural way to compare?

Given such issues, that was the reason we've stopped investigating SQL direction very early (we also didn't have a lot of manpower to investigate all ways completely, so we had to quickly evaluate a number of solutions, choose preferred one and concentrate there).

-- Stas Malyshev smalyshev@wikimedia.org

Aidan Hogan

11:09 p.m.

On 06-08-2016 18:48, Stas Malyshev wrote:

...

Hi!

...
There's a brief summary in the paper of the models used. In terms of all the "gory" details of how everything was generated, (hopefully) all of the relevant details supporting the paper should be available here:

http://users.dcc.uchile.cl/~dhernand/wquery/

Yes, the gory part is what I'm after :) Thank you, I'll read through it in the next couple of days and come back with any questions/comments I might have.

Okay! :)

...

...
We just generalised sitelinks and references as a special type of qualifier (actually I don't think the paper mentions sitelinks but we mention this in the context of references).

Sitelinks can not be qualifiers, since they belong to the entity, not to the statement. They can, I imagine, be considered a special case of properties (we do not do it, but in theory it is not impossible to represent them this way if one wanted to).

Ah yes, I think in that context our results should be considered as being issued over a "core" of Wikidata in the sense that we did not directly consider somevalue, novalue, ranks, etc. (I'm not certain in the case of sitelinks; I do not remember discussing those). This is indeed all doable in RDF without too much bother (I think) but would be much more involved for the relational database or for Neo4J.

...

I am not sure how exactly one would make references special case of qualifier, as qualifier has one (maybe complex) value, while references each can have multiple properties and values, but I'll read through the details and the code before I talk more about it, it's possible that I find my answers there.

...
I guess that depends on what you mean by "automatic" or "manual". :)

Automatic scripts were manually coded to convert from the JSON dump to each representation. The code is linked above.

Here I meant queries, not data.

Ah, so the query generation process is also described in the documentation above. The core idea was to first create "subgraphs" of data with the patterns we wanted to generate queries for, and then using a certain random process, turn some constants into variables, and then select some variables to project. In summary, the queries were automatically generated from the data to ensure non-empty results.

...

...
I'm not sure I follow on this part, in particular on the part of "semantic completeness" and why this is hard to achieve in the context of relational databases. (I get the gist but don't understand enough to respond directly ... but perhaps below I can answer indirectly?)

Check out https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format

This is the range of data we need to represent and allow people to query. We found it hard to do this using relational model. It's probably possible in theory, but producing efficient queries for it looked very challenging, unless we were essentially to duplicate the effort that is implemented in any graph database and only use the db itself for most basic storage needs. That's pretty much what Titan + Cassandra combo did, which we initially used until Titan's devs were acquired by DataStax and resulting uncertainty prompted us to look into different solutions. I imagine in theory it's also possible to create Something+PostgreSQL combo doing the same, but PostgreSQL looks not enough.

Yes, this is something we did look into in some detail in the sense that we had a rather complex relational structure encoding all the features mentioned (storing everything from the JSON dumps, essentially), but the structure was so complex [1], we decided to simplify and consider the final models described in the paper ... especially given the prospect of trying to do something similar in Neo4J afterwards. :)

...

In any case, dealing with things like property paths seem to be rather hard on SQL-based platform, and practically a must for Wikidata querying.

Yep, agreed.

...

...

Encoding object values with different datatypes (booleans, dates,

etc.) was a pain. One option was to have separate tables/columns for each datatype, which would complicate queries and also leave the question of how to add calendars, precisions, etc. Another option was to use JSON strings to encode the values (the version of Postgres we used just considered these as strings, but I think the new version has some JSONB(?) support that could help get around this).

That is also an issue. We have a number of specialty data types (e.g. dates extending billion years into the future/past, coordinates including different globes, etc.) which may present a challenge unless the platform offers an easy way to encode custom types and deal with them. RDF has rather flexible model (basically string + type IRI) here, and Blazegraph too, not sure how accommodating the SQL databases would be.

Also, relational DBs mostly prefer very predictable data type model - i.e. same column always contains the same type. This is obviously not true for any generic representation, and may be not true even in very restricted context - e.g. same property can have values of different types (rare but happens).

Of course, one can wrap everything into JSON - but then how you index, i.e. sort or range by date if date is a JSON object having no natural way to compare?

Given such issues, that was the reason we've stopped investigating SQL direction very early (we also didn't have a lot of manpower to investigate all ways completely, so we had to quickly evaluate a number of solutions, choose preferred one and concentrate there).

+1 to all of this, yes! We had similar experiences.

Cheers, Aidan

[1] https://github.com/eczerega/wikidatarelational/blob/master/Modelo/Modelo%20V...

Daniel Kinzler

7 Aug 7 Aug

11:03 a.m.

Hi Aidan!

Thank you for this very interesting research!

Query performance was of course on of the key factors for selecting the technology to use for the query services. However, it was only one among several more. The Wikidata use case is different from most common scenarios in some ways, for instance:

* We cannot optimize for specific queries, since users are free to submit any query they like. * The data representation needs to be intuitive enough for (thenically inclined) casual users to grasp and write queries. * The data doesn't hold still, it needs to be updated continuously, mutliple times per second. * Our data types are more complex than usual - for instance, we suppor tmultiple calendar models fro dates, and not only values but also different accuracies up to billions of years; we use "quantities" with unit and uncertainty instead of plain numbers, etc.

My point is that, if we had a static data set and a handful of known queries to optimize for, we could have set up a relational or graph database that would be far more performant than what we have now. The big advantage of Blazegraph is its felxibility, not raw performance.

It might be interesting to you to know that we initially started to implement the query service against a graph database, Titan - which was discontinued while we were still getting up to speed. Luckily this happened early on, it would have been quite painful to switch after we had gone live.

-- daniel

Am 06.08.2016 um 18:19 schrieb Aidan Hogan:

...

Hey all,

Recently we wrote a paper discussing the query performance for Wikidata, comparing different possible representations of the knowledge-base in Postgres (a relational database), Neo4J (a graph database), Virtuoso (a SPARQL database) and BlazeGraph (the SPARQL database currently in use) for a set of equivalent benchmark queries.

The paper was recently accepted for presentation at the International Semantic Web Conference (ISWC) 2016. A pre-print is available here:

http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf

Of course there are some caveats with these results in the sense that perhaps other engines would perform better on different hardware, or different styles of queries: for this reason we tried to use the most general types of queries possible and tried to test different representations in different engines (we did not vary the hardware). Also in the discussion of results, we tried to give a more general explanation of the trends, highlighting some strengths/weaknesses for each engine independently of the particular queries/data.

I think it's worth a glance for anyone who is interested in the technology/techniques needed to query Wikidata.

Cheers, Aidan

P.S., the paper above is a follow-up to a previous work with Markus Krötzsch that focussed purely on RDF/SPARQL:

http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf

(I'm not sure if it was previously mentioned on the list.)

P.P.S., as someone who's somewhat of an outsider but who's been watching on for a few years now, I'd like to congratulate the community for making Wikidata what it is today. It's awesome work. Keep going. :)

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Aidan Hogan

7:30 p.m.

Hey Daniel,

On 07-08-2016 7:03, Daniel Kinzler wrote:

...

Hi Aidan!

Thank you for this very interesting research!

Query performance was of course on of the key factors for selecting the technology to use for the query services. However, it was only one among several more. The Wikidata use case is different from most common scenarios in some ways, for instance:

We cannot optimize for specific queries, since users are free to submit any

query they like.

The data representation needs to be intuitive enough for (thenically inclined)

casual users to grasp and write queries.

The data doesn't hold still, it needs to be updated continuously, mutliple

times per second.

Our data types are more complex than usual - for instance, we suppor tmultiple

calendar models fro dates, and not only values but also different accuracies up to billions of years; we use "quantities" with unit and uncertainty instead of plain numbers, etc.

My point is that, if we had a static data set and a handful of known queries to optimize for, we could have set up a relational or graph database that would be far more performant than what we have now. The big advantage of Blazegraph is its felxibility, not raw performance.

Understood. :) Taking everything into account as mentioned above, and based on our own experiences with various experiments in the context of Wikidata and other works, I think the choice to use RDF/SPARQL was the right one (though I would be biased on this issue since I've worked in the area for a long time). I guess the more difficult question then, is, which RDF/SPARQL implementation to choose (since any such implementation should cover as least points 1, 2 and 4 in a similar way), which in turn reduces down to the distinguishing questions of performance, licensing, distribution, maturity, tech support, development community, and non-standard features (keyword search), etc.

Based on raw query performance, based personally on what I have seen, I think Virtuoso probably has the lead at the moment in that it has consistently outperformed other SPARQL engines, not only in our Wikidata experiments, but in other benchmarks by other authors. However, taking all the other points into account, particularly in terms of licensing, Blazegraph does seem to have been a sound choice. And the current query service does seem to be a sound base to work forward from.

...

It might be interesting to you to know that we initially started to implement the query service against a graph database, Titan - which was discontinued while we were still getting up to speed. Luckily this happened early on, it would have been quite painful to switch after we had gone live.

This is indeed good to know! (We considered other graph database engines, but we did not think Gremlin was a good fit with what Wikidata was trying to achieve in the sense of being too "imperative": though one can indeed do something like bgps with the language, it's not particularly easy, nor intuitive.)

Cheers, Aidan

...

Am 06.08.2016 um 18:19 schrieb Aidan Hogan:

...
Hey all,

Recently we wrote a paper discussing the query performance for Wikidata, comparing different possible representations of the knowledge-base in Postgres (a relational database), Neo4J (a graph database), Virtuoso (a SPARQL database) and BlazeGraph (the SPARQL database currently in use) for a set of equivalent benchmark queries.

The paper was recently accepted for presentation at the International Semantic Web Conference (ISWC) 2016. A pre-print is available here:

http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf

Of course there are some caveats with these results in the sense that perhaps other engines would perform better on different hardware, or different styles of queries: for this reason we tried to use the most general types of queries possible and tried to test different representations in different engines (we did not vary the hardware). Also in the discussion of results, we tried to give a more general explanation of the trends, highlighting some strengths/weaknesses for each engine independently of the particular queries/data.

I think it's worth a glance for anyone who is interested in the technology/techniques needed to query Wikidata.

Cheers, Aidan

P.S., the paper above is a follow-up to a previous work with Markus Krötzsch that focussed purely on RDF/SPARQL:

http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf

(I'm not sure if it was previously mentioned on the list.)

P.P.S., as someone who's somewhat of an outsider but who's been watching on for a few years now, I'd like to congratulate the community for making Wikidata what it is today. It's awesome work. Keep going. :)

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Stas Malyshev

8:58 p.m.

Hi!

...

the area for a long time). I guess the more difficult question then, is, which RDF/SPARQL implementation to choose (since any such implementation should cover as least points 1, 2 and 4 in a similar way), which in turn reduces down to the distinguishing questions of performance, licensing, distribution, maturity, tech support, development community, and non-standard features (keyword search), etc.

We indeed had a giant spreadsheet in which a dozen of potential solutions (some of them were eliminated very early, but some put up a robust fight :) were evaluated on about 50 criteria. Of course, some of them were hard to formalize, and some number were a bit arbitrary, but that's what we did and Blazegraph came out with the best score.

-- Stas Malyshev smalyshev@wikimedia.org

Markus Kroetzsch

8 Aug 8 Aug

8:04 a.m.

On 07.08.2016 22:58, Stas Malyshev wrote:

...

Hi!

...
the area for a long time). I guess the more difficult question then, is, which RDF/SPARQL implementation to choose (since any such implementation should cover as least points 1, 2 and 4 in a similar way), which in turn reduces down to the distinguishing questions of performance, licensing, distribution, maturity, tech support, development community, and non-standard features (keyword search), etc.

We indeed had a giant spreadsheet in which a dozen of potential solutions (some of them were eliminated very early, but some put up a robust fight :) were evaluated on about 50 criteria. Of course, some of them were hard to formalize, and some number were a bit arbitrary, but that's what we did and Blazegraph came out with the best score.

If you want to go into Wikidata history, here is the "giant spreadsheet" Stas was referring to:

https://docs.google.com/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b...

Some criteria there are obviously rather vague and subjective, but even when disregarding the scoring, it shows which systems have been looked at.

Markus

3070

Age (days ago)

3072

Last active (days ago)

wikidata@lists.wikimedia.org

16 comments

6 participants

tags (0)

participants (6)

Aidan Hogan
Daniel Kinzler
Info WorldUniversity
Markus Kroetzsch
Scott MacLeod
Stas Malyshev