Re: [Wikidata-l] Provenance tracking on the Web with NIF-URIs

22 Jun 2012

      Dear Martynas,
as far as I understand it, Wikidata will not need to worry about named 
graphs or alike.
IIRC Wikidata is building a fast software to edit facts and generate 
infoboxes. You do not need the full expressive power of SPARQL or graph 
querying.
That is a different use case and can be done by exporting the data and 
loading it into a triple store/graph database.
I would assume that the most efficient operation is to retrieve all data 
for one "entity"/entry/page?
So the database needs to be optimized for lookup/update, not graph 
querying.
In another mail you said that:
...
Regarding scalability -- I can only see those possible cases: either
Wikidata will not have any query language, or it's query language will
be SQL with never ending JOINs too complicated to be useful, or it's
gonna be another query language translated to SQL -- for example
SPARQL, which is doable but attempts have shown it doesn't scale. A
native RDF store is much more performant.
Do you have a reference for this? I always thought it was exactly the 
opposite, i.e. SPARQL2SQL mappers performing better than native stores.
Cheers,
Sebastian
On 06/22/2012 08:43 PM, Martynas Jusevičius wrote:
...
It says "deprecated" on the Data model wiki.
So maybe Wikidata doesn't need statement-level granularity? Maybe the named
graph approach is good enough? But it's not based on statements.
If you build this kind of data model on the relational, not to mention
provenance, you will not be able to provide a reasonable query mechanism.
That's the reason why the development of Jena's SDB store is pretty much
abandoned.
Martynas
  On Jun 22, 2012 8:18 PM, "Sebastian Hellmann" <
hellmann@informatik.uni-leipzig.de> wrote:
...
Denny didn't even use the word "deprecated".
Reification for statement-level provenance works, but you won't be able to
sell it as an elegant solution to the problem.
So "could" - yes , "should" - ?? - probably not
If Wikidata is using statement-level provenance,  there might be better
ways to serialize it in RDF than reification in the future
e.g. NQuads: http://sw.deri.org/2008/07/n-quads/
or JSON ;)
For internal use I would discourage reification.
If using a relational scheme, a statement id, which can be joined with
another SQL table for provenance is the best way to do it imho.
Before you are driving us all mad with explaining why reifiction is bad, I
would really like you to justify why WikiData should consider reification.
I really do not know many use case (if any) where reification is the right
choice of modelling. Before going into the discussion any further [1], I
think you should name an example where reification is really better than
other options.
All the best,
Sebastian
[1]http://ceur-ws.org/Vol-699/Paper5.pdf
On 06/22/2012 06:20 PM, Martynas Jusevičius wrote:
Denny, the statement-level of granularity you're describing is achieved by
RDF reification. You describe it however as a "deprecated mechanism" of
provenance, without backing it up.
Why do you think there must be a better mechanism? Maybe you should take
another look at reification, or lower your provenance requirements, at
least initially?
Martynasgraphity.org
On Jun 22, 2012 5:20 PM, "Denny Vrandečić" denny.vrandecic@wikimedia.de denny.vrandecic@wikimedia.de
wrote:
Here's the use case:
Every statement in Wikidata will have a URI. Every statement can have
one more references.
In many cases, the reference might be text on a website.
Whereas it is always possible (and probably what we will do first) as
well as correct to state:
Statement1 accordingTo SlashDot .
it would be preferable to be a bit more specific on that, and most
preferably it would be to go all the way down to the sentence saying
Statement1 accordingTo X .
with X being a URI denoting the sentence that I mean in a specific
Slashdot-Article.
I would prefer a standard or widely adopted way to how to do that, and
NIF-URIs seem to be a viable solution for that. We will come back to
this once we start modeling references in more detail.
The reference could be pointing to a book, to a video, to a
mesopotamic stone table, etc. (OK, I admit that the different media
types will be differently prioritized).
I hope this helps,
Cheers,
Denny
2012/6/21 Sebastian Hellmann hellmann@informatik.uni-leipzig.de hellmann@informatik.uni-leipzig.de:
Hello Denny,
I was traveling for the past few weeks and can finally answer your email.
See my comments inline.
On 05/29/2012 05:25 PM, Denny VrandeÄ iÄ‡ wrote:
Hello Sebastian,
Just a few questions - as you note, it is easier if we all use the same
standards, and so I want to ask about the relation to other related
standards:

I understand that you dismiss IETF RFC 5147 because it is not stable

enough, right?
The offset scheme of NIF is built on this RFC.
So the following would hold:
@prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# http://www.w3.org/DesignIssues/LinkedData.html# .
@prefix owl: http://www.w3.org/2002/07/owl# http://www.w3.org/2002/07/owl# .
ld:offset_717_729  owl:sameAs ld:char=717,12 .
We might change the syntax and reuse the RFC syntax, but it has several
issues:

The optional part is not easy to handle, because you would need to

add
owl:sameAs statements:
ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12;length=12 .
ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12 .
ld:char=717,12;UTF-8 owl:sameAs ld:char=717,12;length=9876 .
So theoretically ok, but annoying to implement and check.

When implementing web services, NIF allows the client to choose the

prefix:
http://nlp2rdf.lod2.eu/demo/NIFStemmer?input-type=text&nif=true&pref...
.
returning URIs like http://this.is/a/slash/prefix/offset_10_15 http://this.is/a/slash/prefix/offset_10_15
So RFC 5147 would look like:http://this.is/a/slash/prefix/char=717,12 http://this.is/a/slash/prefix/char=717,12http://this.is/a/slash/prefix/char=717,12;UTF-8 http://this.is/a/slash/prefix/char=717,12;UTF-8
orhttp://this.is/a/slash/prefix?char=717,12 http://this.is/a/slash/prefix?char=717,12http://this.is/a/slash/prefix?char=717,12;UTF-8 http://this.is/a/slash/prefix?char=717,12;UTF-8

Character like = , prevent the use of prefixes:

echo "@prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# http://www.w3.org/DesignIssues/LinkedData.html#
.
@prefix owl: http://www.w3.org/2002/07/owl# http://www.w3.org/2002/07/owl# .
ld:offset_717_729  owl:sameAs ld:char=717,12 .
" > test.ttl ; rapper -i turtle  test.ttl

implementation is a little bit more difficult, given that :

$arr = split("_", "offset_717_729") ;
switch ($arr[0]){
     case 'offset' :
         $begin = $arr[1];
         $end = $arr[2];
         break;
     case 'hash' :
         $clength = $arr[1];
         $slength = $arr[2];
         $hash = $arr[3];
         $rest = /*merge remaining with '_' */
         break;
}

RFC assumes a certain mime type, i.e. plain text. NIF does have a

broader
assumption.

what is the relation to the W3C media fragment URIs? Did not find a

pointer there.
They are designed for media such as images, video, not strings.
Potentially, the same principle can be applied, but it is not yet
engineered/researched.

any plans of standardizing your approach?

We will do NIF 2.0  as a community standard and finish it in a couple of
months. It will be published under open licences, so anybody W3C or ISO
might pick it up, easily. Other than that there are plans by several EU
projects (see e.g. here
http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jun/0101.h...
)
and a US project to use it and there are several third party
implementations, already.  We would rather have it adopted first on a
large
scale and then standardized, properly, i.e. W3C. This worked quite well
for
the FOAF project or for RDB2RDF Mappers.
Chances for fast standardization are not so unlikely, I would assume.
We would strongly prefer to just use a standard instead of advocating
contenders for one -- if one exists.
You might want to look at:http://www.w3.org/community/openannotation/wiki/TextCommentOnWebPage
and the same highlighting here:
http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...
NIF equivalent (4 triples instad of 14 and only one generated uuid):
ld:hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%20Web a
str:String ;
  oa:hasBody [
     oa:annotator <mailto:Bob> <Bob> ;
     cnt:chars "Hey Tim, good idea that Semantic Web!" .
 ]

So you might not think in a "contender" way. Approaches are
complementary.
NIF is simpler and the URIs have some features that might be wanted
(stability, uniqueness, easy to implement).
This is why I was asking for your *use case* .
Note that: there are still some problems, when annotating DOM with URIs,
e.g. xPointer is abandoned and was never finished. Xpath has its limits
and
is also expensive (i.e. SAX not possible).
I think there is no proper solution as of now.
All the best,
Sebastian
Cheers,
Denny
2012/5/18 Sebastian Hellmann hellmann@informatik.uni-leipzig.de hellmann@informatik.uni-leipzig.de
Hello again,
maybe the question, I asked was lost, as the text was TL;DR
I heard that, it is planned to track provenance of facts. e.g. Berlin has
3,337,000 citizens found here:http://www.worldatlas.com/**citypops.htm<
http://www.worldatlas.com/citypops.htm%3E
Do you have a place where the use case and the requirements are
documented
for this? Or is it out of scope?
Will it be course grained, i.e. website level ? Or fine grained, i.e.
text
paragraph level? See e.g. how Berlin is highlighted here:http://pcai042.informatik.uni-**leipzig.de/~swp12-9/**
vorprojekt/index.php?**annotation_request=http%3A%2F%**2Fwww.worldatlas.com%**2Fcitypops.htm%23hash_4_30_**
7449e732716c8e68842289bf2e6667**d5_Berlin%2C%2520Germany%2520-**%25203%2C<http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...
in this very early prototype.
Could you give me a link were I can read more about any Wikidata plans
towards this direction?
Sebastian
On 05/16/2012 09:10 AM, Sebastian Hellmann wrote:
Dear all,
(Note: I could not find the document, where your requirements regarding
the tracking of facts on the web are written, so I am giving a general
introduction to NIF. Please send me a link to the document that specifies
your need for tracing facts on the web, thanks)
I would like to point your attention to the URIs used in the NLP
Interchange Format (NIF).
NIF-URIs are quite easy to use, understand and implement. NIF has a
one-triple-per-annotation paradigm.  The latest documentation can be
found
here:http://svn.aksw.org/papers/**2012/WWW_NIF/public/string_**ontology.pdf<
http://svn.aksw.org/papers/2012/WWW_NIF/public/string_ontology.pdf%3E
The basic idea is to use URIs with hash fragment ids to annotate or mark
pages on the web:
An example is the first occurrence of "Semantic Web" onhttp://www.w3.org/**DesignIssues/LinkedData.html<
http://www.w3.org/DesignIssues/LinkedData.html%3E
as highlighted here:http://pcai042.informatik.uni-**leipzig.de/~swp12-9/**
vorprojekt/index.php?**annotation_request=http%3A%2F%**2Fwww.w3.org%2FDesignIssues%**2FLinkedData.html%23hash_10_**12_**
60f02d3b96c55e137e13494cf9a02d**06_Semantic%2520Web<
http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...
Here is a NIF example for linking a part of the document to the DBpedia
entry of the Semantic Web:
<http://www.w3.org/**DesignIssues/LinkedData.html#**offset_717_729<
http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729%3E
  a str:StringInContext ;
  sso:oen

<http://dbpedia.org/resource/**Semantic_Web<
http://dbpedia.org/resource/Semantic_Web%3E%3E
.
We are currently preparing a new draft for the spec 2.0. The old one can
be found here:http://nlp2rdf.org/nif-1-0/
There are several EU projects that intend to use NIF. Furthermore, it is
easier for everybody, if we standardize a Web annotation format together.
Please give feedback of your use cases.
All the best,
Sebastian
--
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Projects: http://nlp2rdf.org , http://dbpedia.org
Homepage:http://bis.informatik.uni-**leipzig.de/SebastianHellmann<
http://bis.informatik.uni-leipzig.de/SebastianHellmann%3E
Research Group: http://aksw.org
______________________________**_________________
Wikidata-l mailing listWikidata-l@lists.wikimedia.orghttps://lists.wikimedia.org/**mailman/listinfo/wikidata-l<
https://lists.wikimedia.org/mailman/listinfo/wikidata-l%3E

Wikidata-l mailing listWikidata-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata-l
--
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Projects: http://nlp2rdf.org , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org
--
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Wikidata-l mailing listWikidata-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata-l

Wikidata-l mailing listWikidata-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata-l
--
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Projects: http://nlp2rdf.org , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org
-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Projects: http://nlp2rdf.org , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata-l] Provenance tracking on the Web with NIF-URIs