Hi all,
We're trying to extract full type hierarchy of Wikidata starting from all occurrences of P31 and P279. While we have some custom code for this, we're thinking there may be a smarter/more-efficient way of doing it using SPARQL or a tool that we are probably unaware of. Any hint would be appreciated. :)
Thanks, Leila
In case you wonder why we ended up with this question and who "we" is ;):
The research is being documented at https://meta.wikimedia.org/wiki/Research_talk:Expanding_Wikipedia_stubs_acro... . (The documentation is not most up-to-date, but it will give you the gist of what we are doing.)
We are interested in building systems that can help editors and editathon organizers identify the most common structures for different article types given the already existing articles in each type/category in Wikipedia (in a fixed language or across languages) and the information available in those articles.
The challenge we have run into, and we're not the first to run into it, is that the categories in Wikipedia don't have (as a whole) is-a relationship. This is a big problem for information extraction based on the category system, and we're trying to find a way to clean it up before starting to use it for this research. (We've looked at the body of research that attempts to clean up Wikipedia category system for knowledge extraction and none of what we've found addresses the problem we have. More on that once we complete the documentation.)
Hi!
We're trying to extract full type hierarchy of Wikidata starting from all occurrences of P31 and P279. While we have some custom code for this, we're thinking there may be a smarter/more-efficient way of doing it using SPARQL or a tool that we are probably unaware of. Any hint would be appreciated. :)
Well, Blazegraph implements BFS: https://wiki.blazegraph.com/wiki/index.php/RDF_GAS_API#GAS_Examples which may be useful in this case, though I am not sure it is possible to map the whole thing in one query without running into timeouts.
Also, I'm not sure P31 and P279 currently represent hierarchy as such - t.e. loops have been known to exist in those (maybe already fixed, but not 100% sure). So one needs to be aware of that too.
Hi Leila,
I am using WDQS regularly to retrieve all P279 relationships in one query. This is about the size of result that you can still get, and it is not hard to compute the transitive closure of this data offline so as to get the full type hierarchy.
If you would include P31 as well, you'd get essentially all Wikidata items, which would be too much for a query result within the timeout. But since essentially all items have this property, you don't need much sophisticated query support to find those parts, e.g., using Wikidata Toolkit in an offline program.
Regards,
Markus
On 07.07.2017 01:47, Stas Malyshev wrote:
Hi!
We're trying to extract full type hierarchy of Wikidata starting from all occurrences of P31 and P279. While we have some custom code for this, we're thinking there may be a smarter/more-efficient way of doing it using SPARQL or a tool that we are probably unaware of. Any hint would be appreciated. :)
Well, Blazegraph implements BFS: https://wiki.blazegraph.com/wiki/index.php/RDF_GAS_API#GAS_Examples which may be useful in this case, though I am not sure it is possible to map the whole thing in one query without running into timeouts.
Also, I'm not sure P31 and P279 currently represent hierarchy as such - t.e. loops have been known to exist in those (maybe already fixed, but not 100% sure). So one needs to be aware of that too.
Hi Markus, Hi Stas,
Thanks for your responses. Please find our comments below.
On Fri, Jul 7, 2017 at 2:51 AM, Markus Kroetzsch markus.kroetzsch@tu-dresden.de wrote:
Hi Leila,
I am using WDQS regularly to retrieve all P279 relationships in one query. This is about the size of result that you can still get, and it is not hard to compute the transitive closure of this data offline so as to get the full type hierarchy.
If you would include P31 as well, you'd get essentially all Wikidata items, which would be too much for a query result within the timeout. But since essentially all items have this property, you don't need much sophisticated query support to find those parts, e.g., using Wikidata Toolkit in an offline program.
It is reassuring to hear from you that some of the work will need to be performed offline (especially because we do need all the P31s). It is safe to say that if there were a ready made solution out there, you would know it. :) We can walk more confidently on the path of using a combination of WDQS and offline processing.
On 07.07.2017 01:47, Stas Malyshev wrote:
We're trying to extract full type hierarchy of Wikidata starting from all occurrences of P31 and P279. While we have some custom code for this, we're thinking there may be a smarter/more-efficient way of doing it using SPARQL or a tool that we are probably unaware of. Any hint would be appreciated. :)
Well, Blazegraph implements BFS: https://wiki.blazegraph.com/wiki/index.php/RDF_GAS_API#GAS_Examples which may be useful in this case, though I am not sure it is possible to map the whole thing in one query without running into timeouts.
Thanks for the pointer. Michele is telling me that he has tried to run it in local and the BFS query has been giving him problem even on a local server, so timeouts on the remote endpoint is something we expect as well.
Also, I'm not sure P31 and P279 currently represent hierarchy as such - t.e. loops have been known to exist in those (maybe already fixed, but not 100% sure). So one needs to be aware of that too.
Correct, it doesn't, and we are painfully ;) aware of it.
Thanks to both of you and have a good weekend! :)
Best, Leila