Hi all,
I am happy to announce a new tool [1], written by Serge Stratan, which allows you to browse the taxonomy (subclass of & instance of relations) between Wikidata's most important class items. For example, here is the Wikidata taxonomy for Pizza (discussed recently on this list):
http://sergestratan.bitbucket.org?draw=true&optid=s0&item=177,2095,7...
== What you see there ==
Solid green lines mean "subclass of" relations (subclasses are lower), while dashed purple lines are "instance of" relations (instances are lower). Drag and zoom the view as usual. Hover over items for more information. Click on arrows with numbers to display upper or lower neighbours. Right-click on classes to get more options.
The sidebar on the left shows statistics and presumed problems in the data (redundancies and likely errors). You can select a report type to see the reports, and click on any line to show the error. If you search for a class in the search field, the errors will be narrowed down to issues related to the taxonomy of this class.
The toolbar at the top has options to show and hide items based on the current selection (left click on any box).
Edges in red are the wrong way around (top to bottom). This occurs only when there are cycles in the "taxonomy".
== Micro tutorial ==
(1) Enter "Unicorn" in the search box, press return. (2) Zoom out a bit by scrolling your mouse/touchpad (3) Click on the "Unicorn" item box. It becomes blue (selected). (4) Click "Expand up" in the toolbar at the top (5) Zoom out to see the taxonomy of unicorn (6) Find the class "Fictional Horse" (directly above unicorn) and click its downwards arrow labelled "3" to see all three children items of "fictional horse". (7) Click the share button on the top right to get a link to this view.
You can also create your own share link manually by just changing the Qids in the URL as you like.
== Status and limitations ==
This is a prototype and it still has some limits:
* It only shows "proper" classes that have at least one instance or subclass. This is to reduce the overall data size and load time. * The data is based on dumps (the date is shown on the right). It is not a live view. * The layout is sometimes too dense. You can find a "hidden" option to make it more spacy behind the sidebar (click "Sidebar" to see it). This helps to disentangle larger graphs. * There are some minor bugs in the UI. You sometimes need to click more than once until the right thing happens. * The help page at http://sergestratan.bitbucket.org/howtouse.html does not explain everything in detail yet.
It is planned to work on some of these limitations in the future.
The hope is that this tool will reveal many errors in Wikidata's taxonomy that are otherwise hard to detect. For example, you can see easily that every "Ship" is an "Event" in Wikidata, that every "Hobbit" is a "Fantasy Race", and that every "Monday" is both a "Mathematical object" and a "Unit of measurement".
Feedback is welcome (on the tool; better start new threads for feedback on the Wikidata taxonomy ;-),
Markus
[1] http://sergestratan.bitbucket.org
Great tool ! The error detection is precious !
2015-10-22 17:31 GMT+02:00 Markus Kroetzsch markus.kroetzsch@tu-dresden.de :
Hi all,
I am happy to announce a new tool [1], written by Serge Stratan, which allows you to browse the taxonomy (subclass of & instance of relations) between Wikidata's most important class items. For example, here is the Wikidata taxonomy for Pizza (discussed recently on this list):
http://sergestratan.bitbucket.org?draw=true&optid=s0&item=177,2095,7...
== What you see there ==
Solid green lines mean "subclass of" relations (subclasses are lower), while dashed purple lines are "instance of" relations (instances are lower). Drag and zoom the view as usual. Hover over items for more information. Click on arrows with numbers to display upper or lower neighbours. Right-click on classes to get more options.
The sidebar on the left shows statistics and presumed problems in the data (redundancies and likely errors). You can select a report type to see the reports, and click on any line to show the error. If you search for a class in the search field, the errors will be narrowed down to issues related to the taxonomy of this class.
The toolbar at the top has options to show and hide items based on the current selection (left click on any box).
Edges in red are the wrong way around (top to bottom). This occurs only when there are cycles in the "taxonomy".
== Micro tutorial ==
(1) Enter "Unicorn" in the search box, press return. (2) Zoom out a bit by scrolling your mouse/touchpad (3) Click on the "Unicorn" item box. It becomes blue (selected). (4) Click "Expand up" in the toolbar at the top (5) Zoom out to see the taxonomy of unicorn (6) Find the class "Fictional Horse" (directly above unicorn) and click its downwards arrow labelled "3" to see all three children items of "fictional horse". (7) Click the share button on the top right to get a link to this view.
You can also create your own share link manually by just changing the Qids in the URL as you like.
== Status and limitations ==
This is a prototype and it still has some limits:
- It only shows "proper" classes that have at least one instance or
subclass. This is to reduce the overall data size and load time.
- The data is based on dumps (the date is shown on the right). It is not a
live view.
- The layout is sometimes too dense. You can find a "hidden" option to
make it more spacy behind the sidebar (click "Sidebar" to see it). This helps to disentangle larger graphs.
- There are some minor bugs in the UI. You sometimes need to click more
than once until the right thing happens.
- The help page at http://sergestratan.bitbucket.org/howtouse.html does
not explain everything in detail yet.
It is planned to work on some of these limitations in the future.
The hope is that this tool will reveal many errors in Wikidata's taxonomy that are otherwise hard to detect. For example, you can see easily that every "Ship" is an "Event" in Wikidata, that every "Hobbit" is a "Fantasy Race", and that every "Monday" is both a "Mathematical object" and a "Unit of measurement".
Feedback is welcome (on the tool; better start new threads for feedback on the Wikidata taxonomy ;-),
Markus
[1] http://sergestratan.bitbucket.org
-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
I’m constantly getting 500 errors.
On Oct 22, 2015, at 10:25 AM, Thomas Douillard thomas.douillard@gmail.com wrote:
Great tool ! The error detection is precious !
2015-10-22 17:31 GMT+02:00 Markus Kroetzsch <markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de>: Hi all,
I am happy to announce a new tool [1], written by Serge Stratan, which allows you to browse the taxonomy (subclass of & instance of relations) between Wikidata's most important class items. For example, here is the Wikidata taxonomy for Pizza (discussed recently on this list):
http://sergestratan.bitbucket.org?draw=true&optid=s0&item=177,2095,7... http://sergestratan.bitbucket.org/?draw=true&optid=s0&item=177,2095,7802,28877,35120,223557,386724,488383,666242,736427,746549,2424752,15222213,16686448
== What you see there ==
Solid green lines mean "subclass of" relations (subclasses are lower), while dashed purple lines are "instance of" relations (instances are lower). Drag and zoom the view as usual. Hover over items for more information. Click on arrows with numbers to display upper or lower neighbours. Right-click on classes to get more options.
The sidebar on the left shows statistics and presumed problems in the data (redundancies and likely errors). You can select a report type to see the reports, and click on any line to show the error. If you search for a class in the search field, the errors will be narrowed down to issues related to the taxonomy of this class.
The toolbar at the top has options to show and hide items based on the current selection (left click on any box).
Edges in red are the wrong way around (top to bottom). This occurs only when there are cycles in the "taxonomy".
== Micro tutorial ==
(1) Enter "Unicorn" in the search box, press return. (2) Zoom out a bit by scrolling your mouse/touchpad (3) Click on the "Unicorn" item box. It becomes blue (selected). (4) Click "Expand up" in the toolbar at the top (5) Zoom out to see the taxonomy of unicorn (6) Find the class "Fictional Horse" (directly above unicorn) and click its downwards arrow labelled "3" to see all three children items of "fictional horse". (7) Click the share button on the top right to get a link to this view.
You can also create your own share link manually by just changing the Qids in the URL as you like.
== Status and limitations ==
This is a prototype and it still has some limits:
- It only shows "proper" classes that have at least one instance or subclass. This is to reduce the overall data size and load time.
- The data is based on dumps (the date is shown on the right). It is not a live view.
- The layout is sometimes too dense. You can find a "hidden" option to make it more spacy behind the sidebar (click "Sidebar" to see it). This helps to disentangle larger graphs.
- There are some minor bugs in the UI. You sometimes need to click more than once until the right thing happens.
- The help page at http://sergestratan.bitbucket.org/howtouse.html http://sergestratan.bitbucket.org/howtouse.html does not explain everything in detail yet.
It is planned to work on some of these limitations in the future.
The hope is that this tool will reveal many errors in Wikidata's taxonomy that are otherwise hard to detect. For example, you can see easily that every "Ship" is an "Event" in Wikidata, that every "Hobbit" is a "Fantasy Race", and that every "Monday" is both a "Mathematical object" and a "Unit of measurement".
Feedback is welcome (on the tool; better start new threads for feedback on the Wikidata taxonomy ;-),
Markus
[1] http://sergestratan.bitbucket.org http://sergestratan.bitbucket.org/
-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 tel:%2B49%20351%20463%2038486 http://korrekt.org/ http://korrekt.org/
Wikidata mailing list Wikidata@lists.wikimedia.org mailto:Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Dario Taraborelli Head of Research, Wikimedia Foundation wikimediafoundation.org http://wikimediafoundation.org/ • nitens.org http://nitens.org/ • @readermeter http://twitter.com/readermeter
Works for me now.
This is fantastic. :)
Please consider adding it to Hay's tools directory, so more people can discover it. https://tools.wmflabs.org/hay/directory/?search=taxonomy#/search/taxonomy
A.
On Thu, Oct 22, 2015 at 10:29 AM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
I’m constantly getting 500 errors.
On Oct 22, 2015, at 10:25 AM, Thomas Douillard thomas.douillard@gmail.com wrote:
Great tool ! The error detection is precious !
2015-10-22 17:31 GMT+02:00 Markus Kroetzsch < markus.kroetzsch@tu-dresden.de>:
Hi all,
I am happy to announce a new tool [1], written by Serge Stratan, which allows you to browse the taxonomy (subclass of & instance of relations) between Wikidata's most important class items. For example, here is the Wikidata taxonomy for Pizza (discussed recently on this list):
http://sergestratan.bitbucket.org?draw=true&optid=s0&item=177,2095,7... http://sergestratan.bitbucket.org/?draw=true&optid=s0&item=177,2095,7802,28877,35120,223557,386724,488383,666242,736427,746549,2424752,15222213,16686448
== What you see there ==
Solid green lines mean "subclass of" relations (subclasses are lower), while dashed purple lines are "instance of" relations (instances are lower). Drag and zoom the view as usual. Hover over items for more information. Click on arrows with numbers to display upper or lower neighbours. Right-click on classes to get more options.
The sidebar on the left shows statistics and presumed problems in the data (redundancies and likely errors). You can select a report type to see the reports, and click on any line to show the error. If you search for a class in the search field, the errors will be narrowed down to issues related to the taxonomy of this class.
The toolbar at the top has options to show and hide items based on the current selection (left click on any box).
Edges in red are the wrong way around (top to bottom). This occurs only when there are cycles in the "taxonomy".
== Micro tutorial ==
(1) Enter "Unicorn" in the search box, press return. (2) Zoom out a bit by scrolling your mouse/touchpad (3) Click on the "Unicorn" item box. It becomes blue (selected). (4) Click "Expand up" in the toolbar at the top (5) Zoom out to see the taxonomy of unicorn (6) Find the class "Fictional Horse" (directly above unicorn) and click its downwards arrow labelled "3" to see all three children items of "fictional horse". (7) Click the share button on the top right to get a link to this view.
You can also create your own share link manually by just changing the Qids in the URL as you like.
== Status and limitations ==
This is a prototype and it still has some limits:
- It only shows "proper" classes that have at least one instance or
subclass. This is to reduce the overall data size and load time.
- The data is based on dumps (the date is shown on the right). It is not
a live view.
- The layout is sometimes too dense. You can find a "hidden" option to
make it more spacy behind the sidebar (click "Sidebar" to see it). This helps to disentangle larger graphs.
- There are some minor bugs in the UI. You sometimes need to click more
than once until the right thing happens.
- The help page at http://sergestratan.bitbucket.org/howtouse.html does
not explain everything in detail yet.
It is planned to work on some of these limitations in the future.
The hope is that this tool will reveal many errors in Wikidata's taxonomy that are otherwise hard to detect. For example, you can see easily that every "Ship" is an "Event" in Wikidata, that every "Hobbit" is a "Fantasy Race", and that every "Monday" is both a "Mathematical object" and a "Unit of measurement".
Feedback is welcome (on the tool; better start new threads for feedback on the Wikidata taxonomy ;-),
Markus
[1] http://sergestratan.bitbucket.org
-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
*Dario Taraborelli *Head of Research, Wikimedia Foundation wikimediafoundation.org • nitens.org • @readermeter http://twitter.com/readermeter
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 22.10.2015 19:29, Dario Taraborelli wrote:
I’m constantly getting 500 errors.
I also observed short outages in the past, and I sometimes had to run a request twice to get an answer. It seems that the hosting on bitbucket is not very reliable. At the moment, this is still a first preview of the tool without everything set up as it should be. The tool should certainly move to Wikimedia labs in the future.
Markus
I am having the same kinds of 500 problems. Bitbucket is generally suffering today: http://status.bitbucket.org
On Thu, Oct 22, 2015 at 12:27 PM, Markus Kroetzsch < markus.kroetzsch@tu-dresden.de> wrote:
On 22.10.2015 19:29, Dario Taraborelli wrote:
I’m constantly getting 500 errors.
I also observed short outages in the past, and I sometimes had to run a request twice to get an answer. It seems that the hosting on bitbucket is not very reliable. At the moment, this is still a first preview of the tool without everything set up as it should be. The tool should certainly move to Wikimedia labs in the future.
Markus
-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 22.10.2015 21:49, Benjamin Good wrote:
I am having the same kinds of 500 problems. Bitbucket is generally suffering today: http://status.bitbucket.org
Indeed, they had a site-wide issue. Seems to be fixed now.
Markus
On Thu, Oct 22, 2015 at 12:27 PM, Markus Kroetzsch <markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de> wrote:
On 22.10.2015 19:29, Dario Taraborelli wrote: I’m constantly getting 500 errors. I also observed short outages in the past, and I sometimes had to run a request twice to get an answer. It seems that the hosting on bitbucket is not very reliable. At the moment, this is still a first preview of the tool without everything set up as it should be. The tool should certainly move to Wikimedia labs in the future. Markus -- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 <tel:%2B49%20351%20463%2038486> http://korrekt.org/ _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 23.10.2015 09:12, Markus Krötzsch wrote:
On 22.10.2015 21:49, Benjamin Good wrote:
I am having the same kinds of 500 problems. Bitbucket is generally suffering today: http://status.bitbucket.org
Indeed, they had a site-wide issue. Seems to be fixed now.
I was rejoicing too early here ... http://status.bitbucket.org/ now reports "major outage" for most services. :-(
Maybe we should move to labs earlier ...
Markus
On Thu, Oct 22, 2015 at 12:27 PM, Markus Kroetzsch <markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de> wrote:
On 22.10.2015 19:29, Dario Taraborelli wrote: I’m constantly getting 500 errors. I also observed short outages in the past, and I sometimes had to run a request twice to get an answer. It seems that the hosting on bitbucket is not very reliable. At the moment, this is still a first preview of the tool without everything set up as it should be. The tool should certainly move to Wikimedia labs in the future. Markus -- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 <tel:%2B49%20351%20463%2038486> http://korrekt.org/ _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hoi, The problem with tools like this is that they get a moment attention. Particularly when they are stand alone, not integrated, they will lose interest.
Would it be an option to host this tool on Labs? Thanks, GerardM
On 22 October 2015 at 21:27, Markus Kroetzsch < markus.kroetzsch@tu-dresden.de> wrote:
On 22.10.2015 19:29, Dario Taraborelli wrote:
I’m constantly getting 500 errors.
I also observed short outages in the past, and I sometimes had to run a request twice to get an answer. It seems that the hosting on bitbucket is not very reliable. At the moment, this is still a first preview of the tool without everything set up as it should be. The tool should certainly move to Wikimedia labs in the future.
Markus
-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Integration is the purpose of templates like Q' with Reasonator or P' and Item Documentation , I don't know if they are actually use. Templates like Query have a limited success however
2015-10-23 11:16 GMT+02:00 Gerard Meijssen gerard.meijssen@gmail.com:
Hoi, The problem with tools like this is that they get a moment attention. Particularly when they are stand alone, not integrated, they will lose interest.
Would it be an option to host this tool on Labs? Thanks, GerardM
On 22 October 2015 at 21:27, Markus Kroetzsch < markus.kroetzsch@tu-dresden.de> wrote:
On 22.10.2015 19:29, Dario Taraborelli wrote:
I’m constantly getting 500 errors.
I also observed short outages in the past, and I sometimes had to run a request twice to get an answer. It seems that the hosting on bitbucket is not very reliable. At the moment, this is still a first preview of the tool without everything set up as it should be. The tool should certainly move to Wikimedia labs in the future.
Markus
-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 23.10.2015 11:16, Gerard Meijssen wrote:
Hoi, The problem with tools like this is that they get a moment attention. Particularly when they are stand alone, not integrated, they will lose interest.
Problems, problems, ...
Would it be an option to host this tool on Labs?
Yes, this is planned for the future, especially to automate regular data updates, which Serge now has to do manually. Besides the changed URL, this move would make a big difference for users. What you see right now is a first prototype beta-release that is meant to gather user feedback on how to develop this tool further.
Markus
On 22 October 2015 at 21:27, Markus Kroetzsch <markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de> wrote:
On 22.10.2015 19:29, Dario Taraborelli wrote: I’m constantly getting 500 errors. I also observed short outages in the past, and I sometimes had to run a request twice to get an answer. It seems that the hosting on bitbucket is not very reliable. At the moment, this is still a first preview of the tool without everything set up as it should be. The tool should certainly move to Wikimedia labs in the future. Markus -- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 <tel:%2B49%20351%20463%2038486> http://korrekt.org/ _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
I am happy to announce a new tool [1], written by Serge Stratan, which allows you to browse the taxonomy (subclass of & instance of relations) between Wikidata's most important class items. For example, here is the Wikidata taxonomy for Pizza (discussed recently on this list):
Very nice! One specially interesting to me is cycle detection - I just discovered there are many (as in, hundreds) cycles in various important properties such as P131 which break a lot of queries, and was wondering what could be done to expose/eliminate those better.
On 23.10.2015 20:19, Stas Malyshev wrote:
Hi!
I am happy to announce a new tool [1], written by Serge Stratan, which allows you to browse the taxonomy (subclass of & instance of relations) between Wikidata's most important class items. For example, here is the Wikidata taxonomy for Pizza (discussed recently on this list):
Very nice! One specially interesting to me is cycle detection - I just discovered there are many (as in, hundreds) cycles in various important properties such as P131 which break a lot of queries, and was wondering what could be done to expose/eliminate those better.
Yes, this should indeed be fixed, although it is theoretically possible that two distinct items (with two distinct Wikipedia articles on at least one Wikipedia) are considered to refer to equivalent classes on Wikidata, which could be expressed by a small subclass-of cycle. For example, it is really not clear to me what the difference between some of our upper level classes is. For instance, what is an example of an Entity that is not an Object? (Note that Object subsumes intangible things such as Method and Activity). Maybe a better way to deal with this situation would be to avoid using one of the items as a class altogether, so that no cycle is needed.
We also have/had cycles involving instance-of, which is definitely an error. ;-)
Counting cycles in SPARQL should be tricky. There are many distinct cyclic paths through a graph with a few cycles, even if only a small number of relations is causing them. SPARQL does not count paths as such, but one would probably count nodes that are located on a path, which would still consider all possible cyclic paths. The Taxonomy Browser is looking for shortest cycles, but you cannot do this in SPARQL. This might be why there are only 11 cycles detected there.
Markus
Hi!
least one Wikipedia) are considered to refer to equivalent classes on Wikidata, which could be expressed by a small subclass-of cycle. For
We can do it, but I'd rather we didn't. The reason is that it would require engine that queries such data (e.g. SPARQL engine) to be comfortable with cycles in property paths (especially ones with + and *), and not every one is (Blazegraph for example looks like does not handle them out of the box). It can be dealt with, I assume, but why create trouble for ourselves?
We also have/had cycles involving instance-of, which is definitely an error. ;-)
Right. So I think we need to mark properties that should not form cycles with https://www.wikidata.org/wiki/Q18647519 (asymmetric property) and have constraints checking scripts/bots find out such cases and alert about them.
On 24/10/2015 00:50, Stas Malyshev wrote:
Hi!
least one Wikipedia) are considered to refer to equivalent classes on Wikidata, which could be expressed by a small subclass-of cycle. For
We can do it, but I'd rather we didn't. The reason is that it would require engine that queries such data (e.g. SPARQL engine) to be comfortable with cycles in property paths (especially ones with + and *), and not every one is (Blazegraph for example looks like does not handle them out of the box). It can be dealt with, I assume, but why create trouble for ourselves?
It should be a basic requirement of any SPARQL engine that it should be able to handle path queries that contain cycles.
For example, consider equivalence relationships like P460 "said to be the same as", which is being used to link given names together.
If we want to find all the names in a particular equivalence class, and eg rank them by their incidence count, as is done in the 'query' columns at https://www.wikidata.org/wiki/Wikidata:WikiProject_Names/given-name_variants
then being able to handle cycles in path queries is a basic requirement for the job.
-- James.
On 24.10.2015 09:36, James Heald wrote:
On 24/10/2015 00:50, Stas Malyshev wrote:
Hi!
least one Wikipedia) are considered to refer to equivalent classes on Wikidata, which could be expressed by a small subclass-of cycle. For
We can do it, but I'd rather we didn't. The reason is that it would require engine that queries such data (e.g. SPARQL engine) to be comfortable with cycles in property paths (especially ones with + and *), and not every one is (Blazegraph for example looks like does not handle them out of the box). It can be dealt with, I assume, but why create trouble for ourselves?
It should be a basic requirement of any SPARQL engine that it should be able to handle path queries that contain cycles.
For example, consider equivalence relationships like P460 "said to be the same as", which is being used to link given names together.
If we want to find all the names in a particular equivalence class, and eg rank them by their incidence count, as is done in the 'query' columns at https://www.wikidata.org/wiki/Wikidata:WikiProject_Names/given-name_variants
then being able to handle cycles in path queries is a basic requirement for the job.
I agree. Even if we discourage cycles in other cases, there is still no guarantee that there won't be any, so the engine should be robust against this.
On the other hand, we have to live with the technical infrastructure we got. If BlazeGraph does not handle cycles well, we should encourage their team to work on fixing this, but at the same time we need to work around the issue for a while.
"Said to be the same as" is a good example of a case where cycles are unavoidable. A possible workaround in this case is to make sure that the transitive closure of "said to be the same as" is already in the data, such that the path "P460+" returns the same results as a mere "P460" would. It's not ideal, but maybe workable.
Markus
"Said to be the same as" is a good example of a case where cycles are unavoidable. A possible workaround in this case is to make sure that the transitive closure of "said to be the same as" is already in the data, such that the path "P460+" returns the same results as a mere "P460" would. It's not ideal, but maybe workable.
I think we have to distinguish between different use cases:
1) Antisymmetric transitive relations, like subclass-of or part-of, which should form an acyclic graph. For these, the "*" notation in sparql can be used to query a sub-graph, such as all kinds of cars or all places in Idaho. This is our primary use case for path traversal, I believe
2) Symmetric transitive relations, such as "said to be the same as". These (should) form small "islands" of fully connected graphs that are (hopefully) unconnected to each other. Here, the "*" notation can be used to include the entire clique instead of only a single node in a query. This might be useful in some cases, but doesn't strike me as a typical use case.
3) Cycles in non-transitive properties: these are not errors at all, and problems only arise when such properties as used in a query as if they were transitive. We could perhaps detect and reject attempts to apply the "*" notation to properties that are not transitive.
4) Intransitive symmetrical relations (e.g. "souse of"). Do we need any special handling for them, or do they just get treated like (3)?
Anyway: we need a solution for (1) that allows transitive queries, and a solution for (3) that prevents pathological behavior. If we get nice handling for case (2), that's a bonus, but not a requirement, I think.
The standard algorithm for a path search is very simple:
Keep adding a new generation of links, until the new link brings in no node not already seen.
This works for graphs of equivalence relations, it works for directed acyclic graphs.
It's not the /graphs/ that are causing the problem here, because Blazegraph can handle either of them by themself and give the right answer.
Rather, in the query like:
SELECT (COUNT(DISTINCT(?city)) AS ?count) WHERE { ?city wdt:P31/wdt:P279* wd:Q515 . # find instances of subclasses of city ?city wdt:P131* wd:Q1202 . }
something is going wrong with the way Blazegraph handles the two conditions *together*.
I suspect this may be closely related to whatever is going wrong with a query like:
SELECT (COUNT(DISTINCT(?a)) AS ?count) WHERE { BIND (wd:Q3305213 AS ?class) . ?a wdt:P31/wdt:P279* ?class . }
which times out.;
It's the plan of joins which is going wrong, not whether the graph is acyclic or not.
-- James.
On 25/10/2015 17:53, Daniel Kinzler wrote:
"Said to be the same as" is a good example of a case where cycles are unavoidable. A possible workaround in this case is to make sure that the transitive closure of "said to be the same as" is already in the data, such that the path "P460+" returns the same results as a mere "P460" would. It's not ideal, but maybe workable.
I think we have to distinguish between different use cases:
- Antisymmetric transitive relations, like subclass-of or part-of, which should
form an acyclic graph. For these, the "*" notation in sparql can be used to query a sub-graph, such as all kinds of cars or all places in Idaho. This is our primary use case for path traversal, I believe
- Symmetric transitive relations, such as "said to be the same as". These
(should) form small "islands" of fully connected graphs that are (hopefully) unconnected to each other. Here, the "*" notation can be used to include the entire clique instead of only a single node in a query. This might be useful in some cases, but doesn't strike me as a typical use case.
- Cycles in non-transitive properties: these are not errors at all, and
problems only arise when such properties as used in a query as if they were transitive. We could perhaps detect and reject attempts to apply the "*" notation to properties that are not transitive.
- Intransitive symmetrical relations (e.g. "souse of"). Do we need any special
handling for them, or do they just get treated like (3)?
Anyway: we need a solution for (1) that allows transitive queries, and a solution for (3) that prevents pathological behavior. If we get nice handling for case (2), that's a bonus, but not a requirement, I think.
Am 25.10.2015 um 19:20 schrieb James Heald:
It's not the /graphs/ that are causing the problem here, because Blazegraph can handle either of them by themself and give the right answer.
That's an interesting observation, would you add your examples to https://phabricator.wikimedia.org/T116298?
Am 25.10.2015 um 19:50 schrieb Daniel Kinzler:
Am 25.10.2015 um 19:20 schrieb James Heald:
It's not the /graphs/ that are causing the problem here, because Blazegraph can handle either of them by themself and give the right answer.
That's an interesting observation, would you add your examples to https://phabricator.wikimedia.org/T116298?
Oh, you just did :) thanks!
I don't see how cycle queries can be a requirement for SPARQL engines if they are not part of SPARQL spec? The closest thing you have is property paths.
On Sat, 24 Oct 2015 at 09:37, James Heald j.heald@ucl.ac.uk wrote:
On 24/10/2015 00:50, Stas Malyshev wrote:
Hi!
least one Wikipedia) are considered to refer to equivalent classes on Wikidata, which could be expressed by a small subclass-of cycle. For
We can do it, but I'd rather we didn't. The reason is that it would require engine that queries such data (e.g. SPARQL engine) to be comfortable with cycles in property paths (especially ones with + and *), and not every one is (Blazegraph for example looks like does not handle them out of the box). It can be dealt with, I assume, but why create trouble for ourselves?
It should be a basic requirement of any SPARQL engine that it should be able to handle path queries that contain cycles.
For example, consider equivalence relationships like P460 "said to be the same as", which is being used to link given names together.
If we want to find all the names in a particular equivalence class, and eg rank them by their incidence count, as is done in the 'query' columns at
https://www.wikidata.org/wiki/Wikidata:WikiProject_Names/given-name_variants
then being able to handle cycles in path queries is a basic requirement for the job.
-- James.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 24.10.2015 12:29, Martynas Jusevičius wrote:
I don't see how cycle queries can be a requirement for SPARQL engines if they are not part of SPARQL spec? The closest thing you have is property paths.
We were talking about *cyclic data* not cyclic queries (which you can also create easily using BGPs, but that's unrelated here). Apparently, BlazeGraph has performance issues when computing a path expression over a cyclic graph.
Markus
On Sat, 24 Oct 2015 at 09:37, James Heald <j.heald@ucl.ac.uk mailto:j.heald@ucl.ac.uk> wrote:
On 24/10/2015 00:50, Stas Malyshev wrote: > Hi! > >> least one Wikipedia) are considered to refer to equivalent classes on >> Wikidata, which could be expressed by a small subclass-of cycle. For > > We can do it, but I'd rather we didn't. The reason is that it would > require engine that queries such data (e.g. SPARQL engine) to be > comfortable with cycles in property paths (especially ones with + and > *), and not every one is (Blazegraph for example looks like does not > handle them out of the box). It can be dealt with, I assume, but why > create trouble for ourselves? It should be a basic requirement of any SPARQL engine that it should be able to handle path queries that contain cycles. For example, consider equivalence relationships like P460 "said to be the same as", which is being used to link given names together. If we want to find all the names in a particular equivalence class, and eg rank them by their incidence count, as is done in the 'query' columns at https://www.wikidata.org/wiki/Wikidata:WikiProject_Names/given-name_variants then being able to handle cycles in path queries is a basic requirement for the job. -- James. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 10/24/15 10:51 AM, Markus Krötzsch wrote:
On 24.10.2015 12:29, Martynas Jusevičius wrote:
I don't see how cycle queries can be a requirement for SPARQL engines if they are not part of SPARQL spec? The closest thing you have is property paths.
We were talking about *cyclic data* not cyclic queries (which you can also create easily using BGPs, but that's unrelated here). Apparently, BlazeGraph has performance issues when computing a path expression over a cyclic graph.
Markus
Markus,
Out of curiosity, can you share a SPARQL query example (text or query results url) that demonstrates your point?
On 25.10.2015 02:18, Kingsley Idehen wrote:
On 10/24/15 10:51 AM, Markus Krötzsch wrote:
On 24.10.2015 12:29, Martynas Jusevičius wrote:
I don't see how cycle queries can be a requirement for SPARQL engines if they are not part of SPARQL spec? The closest thing you have is property paths.
We were talking about *cyclic data* not cyclic queries (which you can also create easily using BGPs, but that's unrelated here). Apparently, BlazeGraph has performance issues when computing a path expression over a cyclic graph.
Markus
Markus,
Out of curiosity, can you share a SPARQL query example (text or query results url) that demonstrates your point?
You mean a query with BlazeGraph having performance issues? That problem was reported by Stas. He should have examples. In any case, it is always a combination of query and data.
Markus
On 25/10/2015 09:31, Markus Krötzsch wrote:
On 25.10.2015 02:18, Kingsley Idehen wrote:
On 10/24/15 10:51 AM, Markus Krötzsch wrote:
We were talking about *cyclic data* not cyclic queries (which you can also create easily using BGPs, but that's unrelated here). Apparently, BlazeGraph has performance issues when computing a path expression over a cyclic graph.
Markus
Markus,
Out of curiosity, can you share a SPARQL query example (text or query results url) that demonstrates your point?
You mean a query with BlazeGraph having performance issues? That problem was reported by Stas. He should have examples. In any case, it is always a combination of query and data.
Hi Kingsley,
I had a problem with Blazgraph queries that had path requirements containing a compound path predicate, and ending in a variable, eg
wd:Q289 wdt:P31/wdt:P279* ?o.
However, this particular example now appears to work. (With the recent upgrade of the SPARQL endpoint to the latest Blazegraph production release ?)
On the other hand, it appears that path queries can still fail if they involve a variable intended to be a fixed constant set by a BIND statement (usually the first thing a query engine will do).
So, for example, a query to count incidences of instances of subclasses of painting, where the key requirement statement is
?a wdt:P31/wdt:P279* wd:Q3305213
runs in about 0.4 seconds. However, a very similar query where the identity of that target superclass is set using a BIND statement,
BIND (wd:Q3305213 AS ?class) . ?a wdt:P31/wdt:P279* ?class .
times out -- or rather: it ought to be reporting that it has timed out, and used to, but now it doesn't throw a "Query Timed Out" error, but instead now after 120 seconds returns an (incorrect) count of zero. (An additional, new bug).
Complete versions of these queries can be found at https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/suggestions#Path...
and as a Blazegraph bug at
https://jira.blazegraph.com/browse/BLZG-1543
(although, as with a couple of other issues described on the same wiki page linked above, that I've filed a Blazegraph bug for, there doesn't seem to be any indication that anybody has actually read the bug...)
I'm not sure if Stas knows of other current issues with path queries.
I did post a complaint to this list, just after the query service was publicly announced, that path queries seemed very slow. They *are* still slower than the equivalent search on WDQ. But I think it was this issue with binding variables that was underlying the worst of what I was seeing.
As for cyclical paths, as I posted a couple of days ago, the queries at https://www.wikidata.org/wiki/Wikidata:WikiProject_Names/given-name_variants for counting up incidences of given-name variants involve graphs that are anything but directed (based on the P460 "said to be the same as" property), and Blazegraph seems to handle them without any particular difficulty; though it's possible that there may have been earlier problems when the service was still at an alpha stage.
-- James.
Further to the below, I've just seen that Stas filed this on Phabricator this week, https://phabricator.wikimedia.org/T116298
relating to queries like:
SELECT (COUNT(DISTINCT(?city)) AS ?count) WHERE { ?city wdt:P31/wdt:P279* wd:Q515 . # find instances of subclasses of city ?city wdt:P131* wd:Q1202 . }
Either of these path requirements runs fine by itself; but something about the combination causes the engine to time out.
-- James.
Hi Kingsley,
I had a problem with Blazegraph queries that had path requirements containing a compound path predicate, and ending in a variable, eg
wd:Q289 wdt:P31/wdt:P279* ?o.
However, this particular example now appears to work. (With the recent upgrade of the SPARQL endpoint to the latest Blazegraph production release ?)
On the other hand, it appears that path queries can still fail if they involve a variable intended to be a fixed constant set by a BIND statement (usually the first thing a query engine will do).
So, for example, a query to count incidences of instances of subclasses of painting, where the key requirement statement is
?a wdt:P31/wdt:P279* wd:Q3305213
runs in about 0.4 seconds. However, a very similar query where the identity of that target superclass is set using a BIND statement,
BIND (wd:Q3305213 AS ?class) . ?a wdt:P31/wdt:P279* ?class .
times out -- or rather: it ought to be reporting that it has timed out, and used to, but now it doesn't throw a "Query Timed Out" error, but instead now after 120 seconds returns an (incorrect) count of zero. (An additional, new bug).
Complete versions of these queries can be found at https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/suggestions#Path...
and as a Blazegraph bug at
https://jira.blazegraph.com/browse/BLZG-1543
(although, as with a couple of other issues described on the same wiki page linked above, that I've filed a Blazegraph bug for, there doesn't seem to be any indication that anybody has actually read the bug...)
I'm not sure if Stas knows of other current issues with path queries.
I did post a complaint to this list, just after the query service was publicly announced, that path queries seemed very slow. They *are* still slower than the equivalent search on WDQ. But I think it was this issue with binding variables that was underlying the worst of what I was seeing.
As for cyclical paths, as I posted a couple of days ago, the queries at https://www.wikidata.org/wiki/Wikidata:WikiProject_Names/given-name_variants
for counting up incidences of given-name variants involve graphs that are anything but directed (based on the P460 "said to be the same as" property), and Blazegraph seems to handle them without any particular difficulty; though it's possible that there may have been earlier problems when the service was still at an alpha stage.
-- James.
On 10/25/15 5:31 AM, Markus Krötzsch wrote:
On 25.10.2015 02:18, Kingsley Idehen wrote:
On 10/24/15 10:51 AM, Markus Krötzsch wrote:
On 24.10.2015 12:29, Martynas Jusevičius wrote:
I don't see how cycle queries can be a requirement for SPARQL engines if they are not part of SPARQL spec? The closest thing you have is property paths.
We were talking about *cyclic data* not cyclic queries (which you can also create easily using BGPs, but that's unrelated here). Apparently, BlazeGraph has performance issues when computing a path expression over a cyclic graph.
Markus
Markus,
Out of curiosity, can you share a SPARQL query example (text or query results url) that demonstrates your point?
You mean a query with BlazeGraph having performance issues? That problem was reported by Stas. He should have examples. In any case, it is always a combination of query and data.
Markus
I just want to see an example of one of these cyclic graph oriented SPARQL queries.
Am 25.10.2015 um 02:18 schrieb Kingsley Idehen:
Out of curiosity, can you share a SPARQL query example (text or query results url) that demonstrates your point?
See https://phabricator.wikimedia.org/T116298
The example query I tried (again, just now) is:
prefix wd: http://www.wikidata.org/entity/ prefix wdt: http://www.wikidata.org/prop/direct/
SELECT DISTINCT ?city WHERE { ?city wdt:P31/wdt:P279* wd:Q515 . # find instances of subclasses of city ?city wdt:P131* wd:Q1202 . # ...located in Saxony. }
This will work fine of course if the data is fixed. So if it doesn't time out for you, try it with a data set that is known to contain cycles.
Hi!
It should be a basic requirement of any SPARQL engine that it should be able to handle path queries that contain cycles.
So I did some simple checks, and on simple examples Blazegraph handles cycles just fine. However, on more complex queries, the cycles seem to be causing trouble. I don't know yet why, I'll look at it further, probably next week.
So the problem is not "handling cycles" in general, it is handling some specific data set, and most probably is a consequence of some bug. I'll report when I have more data about what exactly triggers the bug.
On 27/10/2015 08:42, Stas Malyshev wrote:
Hi!
It should be a basic requirement of any SPARQL engine that it should be able to handle path queries that contain cycles.
So I did some simple checks, and on simple examples Blazegraph handles cycles just fine. However, on more complex queries, the cycles seem to be causing trouble. I don't know yet why, I'll look at it further, probably next week.
So the problem is not "handling cycles" in general, it is handling some specific data set, and most probably is a consequence of some bug. I'll report when I have more data about what exactly triggers the bug.
The key issue with if a graph contains cycles is that you can not then just assume that each successive generation of nodes obtained by adding another path step are by definition new nodes (as they would be for an acyclic graph -- well not entirely, because you might already have reached them by a shorter path; but nothing's going to seriously break with an acyclic graph if you get this check wrong).
In contrast, with a graph that contains cycles, you need to do some sort of hash join with what you have already seen, to specifically identify the new nodes.
If the query planner is somehow messing up those hash joins when given multiple interrelated path requirements, that could be a source of trouble.
-- James.
Blazegraph for example looks like does not
handle them out of the box
As Wikidata is an Open Wiki, I think we can't avoid the query engine having to deal with cycles from time to times. I can't imagine the Wikidata query engine having troubles with cycles. It must be robust.
2015-10-24 1:50 GMT+02:00 Stas Malyshev smalyshev@wikimedia.org:
Hi!
least one Wikipedia) are considered to refer to equivalent classes on Wikidata, which could be expressed by a small subclass-of cycle. For
We can do it, but I'd rather we didn't. The reason is that it would require engine that queries such data (e.g. SPARQL engine) to be comfortable with cycles in property paths (especially ones with + and *), and not every one is (Blazegraph for example looks like does not handle them out of the box). It can be dealt with, I assume, but why create trouble for ourselves?
We also have/had cycles involving instance-of, which is definitely an error. ;-)
Right. So I think we need to mark properties that should not form cycles with https://www.wikidata.org/wiki/Q18647519 (asymmetric property) and have constraints checking scripts/bots find out such cases and alert about them. -- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On Thu, Oct 22, 2015 at 5:31 PM, Markus Kroetzsch markus.kroetzsch@tu-dresden.de wrote:
Hi all,
I am happy to announce a new tool [1], written by Serge Stratan, which allows you to browse the taxonomy (subclass of & instance of relations) between Wikidata's most important class items.
Nice work! Thanks for sharing.
Cheers Lydia