Thank you for the explanation. The content negotion for an Item IRI is clear. Any request for http://www.wikidata.org/entity/Q... requires an Accept application/rdf+xml header in order to get the RDF. The default response is JSON and Accept text/html returns a 200 response delivering the UI page.
For statement resolution in the Item RDF, is not this a fragment? So in the Item context, the IRI for a statement resource would be http://www.wikidata.org/entity/Q16521#Statement_UUID. Otherwise, the statement IRI http://www.wikidata.org/entity/statement/Statement_UUID could just return the statement as a separate entity.
On the topic of references, a use case is to measure data quality by counting the number of "unreferenced statements". At https://phabricator.wikimedia.org/T117234#1834728, I propose the possibility of using blank reference nodes to identify these "bad" statements. Having an object to count greatly expedites the query process because of the estimated cardinality feature of Blazegraph. The only alternative to this is to count distinct statements with the prov:wasDerivedFrom predicate, and this is extremely slow (in fact, it may not be possible without a huge amount of memory).
I do not know what would be involved in implementing blank reference nodes and what performance consequences may also occur. It seems to me that the pairing of statements and references is a core feature of the data model, and it is odd that there can exist statements that have no associated reference node in the RDF.
Cheers, Christopher
On 27 November 2015 at 13:00, wikidata-tech-request@lists.wikimedia.org wrote:
Send Wikidata-tech mailing list submissions to wikidata-tech@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/wikidata-tech or, via email, send a message with subject or body 'help' to wikidata-tech-request@lists.wikimedia.org
You can reach the person managing the list at wikidata-tech-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Wikidata-tech digest..."
Today's Topics:
- RDF Item, Statement and Reference IRI Resolution? (Christopher Johnson)
- Re: RDF Item, Statement and Reference IRI Resolution? (Markus Krötzsch)
Message: 1 Date: Fri, 27 Nov 2015 07:21:10 +0100 From: Christopher Johnson christopher.johnson@wikimedia.de To: wikidata-tech@lists.wikimedia.org, wikimedia-de-tech wikimedia-de-tech@wikimedia.de Subject: [Wikidata-tech] RDF Item, Statement and Reference IRI Resolution? Message-ID: <CACzuuKvGK1dM1+dn4ypocjhO= psuk4LLtWngZp1yFVP6wmVqFA@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
Hi,
After looking at the RDF format closely, I am asking if the item, statement and reference IRIs could/should be directly resolvable to XML/JSON formatted resources.
It seems that currently http://www.wikidata.org/entity/.... redirects to the UI at https://www.wikidata.org/wiki/ which is not what a machine reader would expect. Without a simple method to resolve the IRIs (perhaps a RESTful API?), these RDF data objects are opaque for parsers.
Of course, with wbgetclaims, it is possible to get the statement like this:
https://www.wikidata.org/w/api.php?action=wbgetclaims&format=xml&cla...
but the API expected GUID format does not match the RDF UUID representation (there is a $ or "%24" after the item instead of a -) and it returns both the statement and the references.
Since the reference is its own node in the RDF, it can be queried independently. For example, to ask "return all of the statements where reference R is bound." But then, the return value is a list of statement IDs and a subquery or separate query is then required to return the associated statement node.
I am also wondering why item, statement and reference "UUIDs" are not in canonical format in the RDF. This is a question of compliance with IETF guidelines, which may or may not be relevant.
Item: Q20913766 Statement: Q20913766-CD281698-E1D0-43A1-BEEA-E2A60E5A88F1 Reference: 39f3ce979f9d84a0ebf09abe1702bf22326695e9
See: https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format See: http://www.iana.org/assignments/urn-namespaces/urn-namespaces.xhtml and http://tools.ietf.org/html/rfc4122 for information on urn:uuid guidelines.
Thanks for your feedback, Christopher
The statement to reference relation problem also relates to the topic of Metadata Reification which from what I can gather, not really addressed in the current WDQS RDF approach.
In Blazegraph, this could be supported by Quads or RDR (Reification Done Right). See http://arxiv.org/pdf/1406.3399.pdf , https://wiki.blazegraph.com/wiki/index.php/Reification_Done_Right
One possible approach using triples for the use case could be to assign a blank node to a reference placeholder and introduce the valid range class for prov:wasDerivedFrom (prov:entity) with the canonical reference UUID like this:
wds:Q3-24bf3704-4c5d-083a-9b59-1881f82b6b37 prov:wasDerivedFrom _:refhash .
_:refhash a prov:entity, wikibase:Reference, wdref:referenceUUID ; pr:P7 "Some data" ; pr:P8 "1976-01-12T00:00:00Z"^^xsd:dateTime ; prv:P8 wdv:b74072c03a5ced412a336ff213d69ef1 .
Introducing a owl:minCardinality on prov:wasDerivedFrom would mean that if there were no refhash for a statement than a null object (similar to wdno) would identify "unreferenced statements" like this:
wds:Q3-24bf3704-4c5d-083a-9b59-1881f82b6b37 prov:wasDerivedFrom wikibase:nullRef .
There are a lot ways to deal with this issue, I guess. But, it seems to me that having a simple programmatic method to validate statement integrity (as supported or unsupported claims) is very important to substantiating the utility of Wikidata for the academic community.
On 28 November 2015 at 11:20, Christopher Johnson < christopher.johnson@wikimedia.de> wrote:
Thank you for the explanation. The content negotion for an Item IRI is clear. Any request for http://www.wikidata.org/entity/Q... requires an Accept application/rdf+xml header in order to get the RDF. The default response is JSON and Accept text/html returns a 200 response delivering the UI page.
For statement resolution in the Item RDF, is not this a fragment? So in the Item context, the IRI for a statement resource would be http://www.wikidata.org/entity/Q16521#Statement_UUID. Otherwise, the statement IRI http://www.wikidata.org/entity/statement/Statement_UUID could just return the statement as a separate entity.
On the topic of references, a use case is to measure data quality by counting the number of "unreferenced statements". At https://phabricator.wikimedia.org/T117234#1834728, I propose the possibility of using blank reference nodes to identify these "bad" statements. Having an object to count greatly expedites the query process because of the estimated cardinality feature of Blazegraph. The only alternative to this is to count distinct statements with the prov:wasDerivedFrom predicate, and this is extremely slow (in fact, it may not be possible without a huge amount of memory).
I do not know what would be involved in implementing blank reference nodes and what performance consequences may also occur. It seems to me that the pairing of statements and references is a core feature of the data model, and it is odd that there can exist statements that have no associated reference node in the RDF.
Cheers, Christopher
On 27 November 2015 at 13:00, wikidata-tech-request@lists.wikimedia.org wrote:
Send Wikidata-tech mailing list submissions to wikidata-tech@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/wikidata-tech or, via email, send a message with subject or body 'help' to wikidata-tech-request@lists.wikimedia.org
You can reach the person managing the list at wikidata-tech-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Wikidata-tech digest..."
Today's Topics:
- RDF Item, Statement and Reference IRI Resolution? (Christopher Johnson)
- Re: RDF Item, Statement and Reference IRI Resolution? (Markus Krötzsch)
Message: 1 Date: Fri, 27 Nov 2015 07:21:10 +0100 From: Christopher Johnson christopher.johnson@wikimedia.de To: wikidata-tech@lists.wikimedia.org, wikimedia-de-tech wikimedia-de-tech@wikimedia.de Subject: [Wikidata-tech] RDF Item, Statement and Reference IRI Resolution? Message-ID: <CACzuuKvGK1dM1+dn4ypocjhO= psuk4LLtWngZp1yFVP6wmVqFA@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
Hi,
After looking at the RDF format closely, I am asking if the item, statement and reference IRIs could/should be directly resolvable to XML/JSON formatted resources.
It seems that currently http://www.wikidata.org/entity/.... redirects to the UI at https://www.wikidata.org/wiki/ which is not what a machine reader would expect. Without a simple method to resolve the IRIs (perhaps a RESTful API?), these RDF data objects are opaque for parsers.
Of course, with wbgetclaims, it is possible to get the statement like this:
https://www.wikidata.org/w/api.php?action=wbgetclaims&format=xml&cla...
but the API expected GUID format does not match the RDF UUID representation (there is a $ or "%24" after the item instead of a -) and it returns both the statement and the references.
Since the reference is its own node in the RDF, it can be queried independently. For example, to ask "return all of the statements where reference R is bound." But then, the return value is a list of statement IDs and a subquery or separate query is then required to return the associated statement node.
I am also wondering why item, statement and reference "UUIDs" are not in canonical format in the RDF. This is a question of compliance with IETF guidelines, which may or may not be relevant.
Item: Q20913766 Statement: Q20913766-CD281698-E1D0-43A1-BEEA-E2A60E5A88F1 Reference: 39f3ce979f9d84a0ebf09abe1702bf22326695e9
See: https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format See: http://www.iana.org/assignments/urn-namespaces/urn-namespaces.xhtml and http://tools.ietf.org/html/rfc4122 for information on urn:uuid guidelines.
Thanks for your feedback, Christopher
Hi!
In Blazegraph, this could be supported by Quads or RDR (Reification Done Right).
We considered using RDR but decided against it because RDR is not standard and existing tools and libraries would not understand it. So in the interest of better data integration we decided to use regular RDF representation that can be queries by standard SPARQL.
One possible approach using triples for the use case could be to assign a blank node to a reference placeholder and introduce the valid range class for prov:wasDerivedFrom (prov:entity) with the canonical reference UUID like this:
I'm not sure I understand - what would doing this earn us? This looks like just adding one more join to the lookups.
wikidata-tech@lists.wikimedia.org