OK, I try to make this clear. The use case is to be able to simply
identify with a SPARQL query and/or count "unreferenced statements" using
fastRangeCount.
How do we do this with the current implementation?
What this gains is a useful method for measuring data quality that does not
exist now (as far as I can understand). And, it could also provide a
stable URI (UUID) for references that would fix the questionable use of the
"unstable" reference hash as a resource URI and facilitate the
implementation of reusability for them.
Thanks,
Christopher
On 30 November 2015 at 13:00, <wikidata-tech-request(a)lists.wikimedia.org>
wrote:
> Send Wikidata-tech mailing list submissions to
> wikidata-tech(a)lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
> or, via email, send a message with subject or body 'help' to
> wikidata-tech-request(a)lists.wikimedia.org
>
> You can reach the person managing the list at
> wikidata-tech-owner(a)lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Wikidata-tech digest..."
>
>
> Today's Topics:
>
> 1. Re: Wikidata-tech Digest, Vol 31, Issue 5 (Stas Malyshev)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sun, 29 Nov 2015 16:54:29 -0800
> From: Stas Malyshev <smalyshev(a)wikimedia.org>
> To: Wikidata technical discussion <wikidata-tech(a)lists.wikimedia.org>
> Subject: Re: [Wikidata-tech] Wikidata-tech Digest, Vol 31, Issue 5
> Message-ID: <565B9E45.6090106(a)wikimedia.org>
> Content-Type: text/plain; charset=utf-8
>
> Hi!
>
> > In Blazegraph, this could be supported by Quads or RDR (Reification Done
> > Right).
>
> We considered using RDR but decided against it because RDR is not
> standard and existing tools and libraries would not understand it. So in
> the interest of better data integration we decided to use regular RDF
> representation that can be queries by standard SPARQL.
>
> > One possible approach using triples for the use case could be to assign
> > a blank node to a reference placeholder and introduce the valid range
> > class for prov:wasDerivedFrom (prov:entity) with the canonical reference
> > UUID like this:
>
> I'm not sure I understand - what would doing this earn us? This looks
> like just adding one more join to the lookups.
> --
> Stas Malyshev
> smalyshev(a)wikimedia.org
>
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> Wikidata-tech mailing list
> Wikidata-tech(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
>
>
> ------------------------------
>
> End of Wikidata-tech Digest, Vol 31, Issue 8
> ********************************************
>
Thank you for the explanation. The content negotion for an Item IRI is
clear. Any request for http://www.wikidata.org/entity/Q... requires an
Accept application/rdf+xml header in order to get the RDF. The default
response is JSON and Accept text/html returns a 200 response delivering the
UI page.
For statement resolution in the Item RDF, is not this a fragment? So in
the Item context, the IRI for a statement resource would be
http://www.wikidata.org/entity/Q16521#Statement_UUID. Otherwise, the
statement IRI http://www.wikidata.org/entity/statement/Statement_UUID could
just return the statement as a separate entity.
On the topic of references, a use case is to measure data quality by
counting the number of "unreferenced statements". At
https://phabricator.wikimedia.org/T117234#1834728, I propose the
possibility of using blank reference nodes to identify these "bad"
statements. Having an object to count greatly expedites the query process
because of the estimated cardinality feature of Blazegraph. The only
alternative to this is to count distinct statements with the
prov:wasDerivedFrom predicate, and this is extremely slow (in fact, it may
not be possible without a huge amount of memory).
I do not know what would be involved in implementing blank reference nodes
and what performance consequences may also occur. It seems to me that the
pairing of statements and references is a core feature of the data model,
and it is odd that there can exist statements that have no associated
reference node in the RDF.
Cheers,
Christopher
On 27 November 2015 at 13:00, <wikidata-tech-request(a)lists.wikimedia.org>
wrote:
> Send Wikidata-tech mailing list submissions to
> wikidata-tech(a)lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
> or, via email, send a message with subject or body 'help' to
> wikidata-tech-request(a)lists.wikimedia.org
>
> You can reach the person managing the list at
> wikidata-tech-owner(a)lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Wikidata-tech digest..."
>
>
> Today's Topics:
>
> 1. RDF Item, Statement and Reference IRI Resolution?
> (Christopher Johnson)
> 2. Re: RDF Item, Statement and Reference IRI Resolution?
> (Markus Krötzsch)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 27 Nov 2015 07:21:10 +0100
> From: Christopher Johnson <christopher.johnson(a)wikimedia.de>
> To: wikidata-tech(a)lists.wikimedia.org, wikimedia-de-tech
> <wikimedia-de-tech(a)wikimedia.de>
> Subject: [Wikidata-tech] RDF Item, Statement and Reference IRI
> Resolution?
> Message-ID:
> <CACzuuKvGK1dM1+dn4ypocjhO=
> psuk4LLtWngZp1yFVP6wmVqFA(a)mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi,
>
> After looking at the RDF format closely, I am asking if the item, statement
> and reference IRIs could/should be directly resolvable to XML/JSON
> formatted resources.
>
> It seems that currently http://www.wikidata.org/entity/.... redirects to
> the UI at https://www.wikidata.org/wiki/ which is not what a machine
> reader
> would expect.
> Without a simple method to resolve the IRIs (perhaps a RESTful API?), these
> RDF data objects are opaque for parsers.
>
> Of course, with wbgetclaims, it is possible to get the statement like this:
>
> https://www.wikidata.org/w/api.php?action=wbgetclaims&format=xml&claim=Q209…
>
> but the API expected GUID format does not match the RDF UUID representation
> (there is a $ or "%24" after the item instead of a -) and it returns both
> the statement and the references.
>
> Since the reference is its own node in the RDF, it can be queried
> independently. For example, to ask "return all of the statements where
> reference R is bound." But then, the return value is a list of statement
> IDs and a subquery or separate query is then required to return the
> associated statement node.
>
> I am also wondering why item, statement and reference "UUIDs" are not in
> canonical format in the RDF. This is a question of compliance with IETF
> guidelines, which may or may not be relevant.
>
> Item: Q20913766
> Statement: Q20913766-CD281698-E1D0-43A1-BEEA-E2A60E5A88F1
> Reference: 39f3ce979f9d84a0ebf09abe1702bf22326695e9
>
> See: https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format
> See: http://www.iana.org/assignments/urn-namespaces/urn-namespaces.xhtml
> and http://tools.ietf.org/html/rfc4122 for information on urn:uuid
> guidelines.
>
> Thanks for your feedback,
> Christopher
>
Hi,
After looking at the RDF format closely, I am asking if the item, statement
and reference IRIs could/should be directly resolvable to XML/JSON
formatted resources.
It seems that currently http://www.wikidata.org/entity/.... redirects to
the UI at https://www.wikidata.org/wiki/ which is not what a machine reader
would expect.
Without a simple method to resolve the IRIs (perhaps a RESTful API?), these
RDF data objects are opaque for parsers.
Of course, with wbgetclaims, it is possible to get the statement like this:
https://www.wikidata.org/w/api.php?action=wbgetclaims&format=xml&claim=Q209…
but the API expected GUID format does not match the RDF UUID representation
(there is a $ or "%24" after the item instead of a -) and it returns both
the statement and the references.
Since the reference is its own node in the RDF, it can be queried
independently. For example, to ask "return all of the statements where
reference R is bound." But then, the return value is a list of statement
IDs and a subquery or separate query is then required to return the
associated statement node.
I am also wondering why item, statement and reference "UUIDs" are not in
canonical format in the RDF. This is a question of compliance with IETF
guidelines, which may or may not be relevant.
Item: Q20913766
Statement: Q20913766-CD281698-E1D0-43A1-BEEA-E2A60E5A88F1
Reference: 39f3ce979f9d84a0ebf09abe1702bf22326695e9
See: https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format
See: http://www.iana.org/assignments/urn-namespaces/urn-namespaces.xhtml
and http://tools.ietf.org/html/rfc4122 for information on urn:uuid
guidelines.
Thanks for your feedback,
Christopher
Hi all!
For weeks and months now, we have been discussion how to best represent "extra"
information in (or associated with) the wikibase data model. After some more
discussion and a bit of research, I think I have found what we need: The Role
Object Pattern aka Role Class Model, see
<https://en.wikipedia.org/wiki/Role_Class_Model>.
Please have a look at https://phabricator.wikimedia.org/T118860 and let me know
if you have any objections. If not, let's use this sprint to discuss the details
of the implementations, and do a task breakdown.
PS: I came across quite a few famous names when during my research. Looks like
we are not first in having this need...
--
Daniel Kinzler
Senior Software Developer
Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.
Hi all!
I posted this on twitter and IRC already!
But I made a lovely new dashboard showing all things dispatchy on Wikidata.
https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch
Enjoy!
And if you want anything added let me know!
--
Addshore
Am 09.11.2015 um 03:26 schrieb S Page:
> I think these other identifiers are all "Wikidata property representing a unique
> identifier" and there are about 350 of them [2] But surprisingly, I couldn't
> find an easy way to look up a Wikidata item using these other identifiers.
We discussed some loose plans for implementing this in Currus when Stas was in
Berlin a few weeks ago. On Special:Search, you would ask for
property:P212:978-2-07-027437-6, and that would find the item with that ISBN.
Stas: do we have a ticket for this somewhere? All I can find are the notes in
the etherpad.
> Also, is this a temporary thing? Will Wikidata eventually have items for every
> book published, every musical recording, etc. and become a superset of all those
> unique identifiers?
It's highly unlikely that wikidata will become a superset of any and all
vocuabularies in existance. Better integration of external identifiers is high
on our priority list right now. The first step will however be to property
expose URIs for them, so we are no longer a dead end in the linked data web.
But since we need to work on Cirrus integration anyway, I expect that we will
have search-by-property soonish, too. I certrainly hope so.
--
Daniel Kinzler
Senior Software Developer
Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.