[Wikidata-l] Provenance tracking on the Web with NIF-URIs

List overview All Threads
Download

newer

older

[Wikidata-l] demo system updated

[Wikidata-l] less hard-core...

Sebastian Hellmann

16 May 2012 16 May '12

12:40 p.m.

Dear all, (Note: I could not find the document, where your requirements regarding the tracking of facts on the web are written, so I am giving a general introduction to NIF. Please send me a link to the document that specifies your need for tracing facts on the web, thanks)

I would like to point your attention to the URIs used in the NLP Interchange Format (NIF). NIF-URIs are quite easy to use, understand and implement. NIF has a one-triple-per-annotation paradigm. The latest documentation can be found here: http://svn.aksw.org/papers/2012/WWW_NIF/public/string_ontology.pdf

The basic idea is to use URIs with hash fragment ids to annotate or mark pages on the web: An example is the first occurrence of "Semantic Web" on http://www.w3.org/DesignIssues/LinkedData.html as highlighted here: http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

Here is a NIF example for linking a part of the document to the DBpedia entry of the Semantic Web: http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729 a str:StringInContext ; sso:oen http://dbpedia.org/resource/Semantic_Web .

We are currently preparing a new draft for the spec 2.0. The old one can be found here: http://nlp2rdf.org/nif-1-0/

There are several EU projects that intend to use NIF. Furthermore, it is easier for everybody, if we standardize a Web annotation format together. Please give feedback of your use cases. All the best, Sebastian

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

Show replies by date

Sebastian Hellmann

18 May 18 May

12:13 p.m.

Hello again, maybe the question, I asked was lost, as the text was TL;DR

I heard that, it is planned to track provenance of facts. e.g. Berlin has 3,337,000 citizens found here: http://www.worldatlas.com/citypops.htm Do you have a place where the use case and the requirements are documented for this? Or is it out of scope? Will it be course grained, i.e. website level ? Or fine grained, i.e. text paragraph level? See e.g. how Berlin is highlighted here: http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot... in this very early prototype.

Could you give me a link were I can read more about any Wikidata plans towards this direction? Sebastian

On 05/16/2012 09:10 AM, Sebastian Hellmann wrote:

...

Dear all, (Note: I could not find the document, where your requirements regarding the tracking of facts on the web are written, so I am giving a general introduction to NIF. Please send me a link to the document that specifies your need for tracing facts on the web, thanks)

I would like to point your attention to the URIs used in the NLP Interchange Format (NIF). NIF-URIs are quite easy to use, understand and implement. NIF has a one-triple-per-annotation paradigm. The latest documentation can be found here: http://svn.aksw.org/papers/2012/WWW_NIF/public/string_ontology.pdf

The basic idea is to use URIs with hash fragment ids to annotate or mark pages on the web: An example is the first occurrence of "Semantic Web" on http://www.w3.org/DesignIssues/LinkedData.html as highlighted here: http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

Here is a NIF example for linking a part of the document to the DBpedia entry of the Semantic Web: http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729 a str:StringInContext ; sso:oen http://dbpedia.org/resource/Semantic_Web .

We are currently preparing a new draft for the spec 2.0. The old one can be found here: http://nlp2rdf.org/nif-1-0/

There are several EU projects that intend to use NIF. Furthermore, it is easier for everybody, if we standardize a Web annotation format together. Please give feedback of your use cases. All the best, Sebastian

Denny Vrandečić

29 May 29 May

8:55 p.m.

Hello Sebastian,

thanks for the pointers, this is indeed a very interesting project that tie in neatly with what we want to achieve in Wikidata. I didn't read your paper in detail, but merely skimmed it, so I hope I have a reasonable understanding of your proposal.

We have not yet written down the use cases, and we are aware that this still needs to happen. I do think that your approach - especially with the context-hash-based URIs - is very promising.

We aim to start with a simple website-level scope for Web references, but personally I would be very happy if we could move down to the more fine-grained level that your tools support in the first year already, down to the very sentence that supports a statement in Wikidata. If we get this far, your approach seems like a very strong contender for representing this.

Just a few questions - as you note, it is easier if we all use the same standards, and so I want to ask about the relation to other related standards: * I understand that you dismiss IETF RFC 5147 because it is not stable enough, right? * what is the relation to the W3C media fragment URIs? Did not find a pointer there. * any plans of standardizing your approach?

We would strongly prefer to just use a standard instead of advocating contenders for one -- if one exists.

Cheers, Denny

2012/5/18 Sebastian Hellmann hellmann@informatik.uni-leipzig.de

...

Hello again, maybe the question, I asked was lost, as the text was TL;DR

I heard that, it is planned to track provenance of facts. e.g. Berlin has 3,337,000 citizens found here: http://www.worldatlas.com/**citypops.htm http://www.worldatlas.com/citypops.htm Do you have a place where the use case and the requirements are documented for this? Or is it out of scope? Will it be course grained, i.e. website level ? Or fine grained, i.e. text paragraph level? See e.g. how Berlin is highlighted here: http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%** 2Fwww.worldatlas.com%**2Fcitypops.htm%23hash_4_30_** 7449e732716c8e68842289bf2e6667**d5_Berlin%2C%2520Germany%2520-**%25203%2Chttp://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annotation_request=http%3A%2F%2Fwww.worldatlas.com%2Fcitypops.htm%23hash_4_30_7449e732716c8e68842289bf2e6667d5_Berlin%2C%2520Germany%2520-%25203%2C in this very early prototype.

Could you give me a link were I can read more about any Wikidata plans towards this direction? Sebastian

On 05/16/2012 09:10 AM, Sebastian Hellmann wrote:

...
Dear all, (Note: I could not find the document, where your requirements regarding the tracking of facts on the web are written, so I am giving a general introduction to NIF. Please send me a link to the document that specifies your need for tracing facts on the web, thanks)

I would like to point your attention to the URIs used in the NLP Interchange Format (NIF). NIF-URIs are quite easy to use, understand and implement. NIF has a one-triple-per-annotation paradigm. The latest documentation can be found here: http://svn.aksw.org/papers/**2012/WWW_NIF/public/string_**ontology.pdf http://svn.aksw.org/papers/2012/WWW_NIF/public/string_ontology.pdf

The basic idea is to use URIs with hash fragment ids to annotate or mark pages on the web: An example is the first occurrence of "Semantic Web" on http://www.w3.org/**DesignIssues/LinkedData.html http://www.w3.org/DesignIssues/LinkedData.html as highlighted here: http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%** 2Fwww.w3.org%2FDesignIssues%**2FLinkedData.html%23hash_10_**12_** 60f02d3b96c55e137e13494cf9a02d**06_Semantic%2520Webhttp://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annotation_request=http%3A%2F%2Fwww.w3.org%2FDesignIssues%2FLinkedData.html%23hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%2520Web

Here is a NIF example for linking a part of the document to the DBpedia entry of the Semantic Web: <http://www.w3.org/**DesignIssues/LinkedData.html#**offset_717_729 http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729

...
 a str:StringInContext ;
 sso:oen <http://dbpedia.org/resource/**Semantic_Web<http://dbpedia.org/resource/Semantic_Web>>
.

We are currently preparing a new draft for the spec 2.0. The old one can be found here: http://nlp2rdf.org/nif-1-0/

There are several EU projects that intend to use NIF. Furthermore, it is easier for everybody, if we standardize a Web annotation format together. Please give feedback of your use cases. All the best, Sebastian
-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://dbpedia.org Homepage: http://bis.informatik.uni-**leipzig.de/SebastianHellmann http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 2 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Sebastian Hellmann

21 Jun 21 Jun

11:35 p.m.

Hello Denny, I was traveling for the past few weeks and can finally answer your email. See my comments inline.

On 05/29/2012 05:25 PM, Denny VrandeÄiÄ wrote:

...

Hello Sebastian,

Just a few questions - as you note, it is easier if we all use the same standards, and so I want to ask about the relation to other related standards:

I understand that you dismiss IETF RFC 5147 because it is not stable

enough, right?

The offset scheme of NIF is built on this RFC. So the following would hold: @prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 .

We might change the syntax and reuse the RFC syntax, but it has several issues: 1. The optional part is not easy to handle, because you would need to add owl:sameAs statements:

ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12;length=12 . ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12 . ld:char=717,12;UTF-8 owl:sameAs ld:char=717,12;length=9876 .

So theoretically ok, but annoying to implement and check.

2. When implementing web services, NIF allows the client to choose the prefix: http://nlp2rdf.lod2.eu/demo/NIFStemmer?input-type=text&nif=true&pref.... returning URIs like http://this.is/a/slash/prefix/offset_10_15 So RFC 5147 would look like: http://this.is/a/slash/prefix/char=717,12 http://this.is/a/slash/prefix/char=717,12;UTF-8 or http://this.is/a/slash/prefix?char=717,12 http://this.is/a/slash/prefix?char=717,12;UTF-8

3. Character like = , prevent the use of prefixes: echo "@prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 . " > test.ttl ; rapper -i turtle test.ttl

4. implementation is a little bit more difficult, given that : $arr = split("_", "offset_717_729") ; switch ($arr[0]){ case 'offset' : $begin = $arr[1]; $end = $arr[2]; break; case 'hash' : $clength = $arr[1]; $slength = $arr[2]; $hash = $arr[3]; $rest = /*merge remaining with '_' */ break; }

5. RFC assumes a certain mime type, i.e. plain text. NIF does have a broader assumption.

...

what is the relation to the W3C media fragment URIs? Did not find a

pointer there.

They are designed for media such as images, video, not strings. Potentially, the same principle can be applied, but it is not yet engineered/researched.

...

any plans of standardizing your approach?

We will do NIF 2.0 as a community standard and finish it in a couple of months. It will be published under open licences, so anybody W3C or ISO might pick it up, easily. Other than that there are plans by several EU projects (see e.g. here http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jun/0101.h...) and a US project to use it and there are several third party implementations, already. We would rather have it adopted first on a large scale and then standardized, properly, i.e. W3C. This worked quite well for the FOAF project or for RDB2RDF Mappers. Chances for fast standardization are not so unlikely, I would assume.

...

We would strongly prefer to just use a standard instead of advocating contenders for one -- if one exists.

You might want to look at: http://www.w3.org/community/openannotation/wiki/TextCommentOnWebPage and the same highlighting here: http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

NIF equivalent (4 triples instad of 14 and only one generated uuid): ld:hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%20Web a str:String ; oa:hasBody [ oa:annotator mailto:Bob ; cnt:chars "Hey Tim, good idea that Semantic Web!" . ]

So you might not think in a "contender" way. Approaches are complementary. NIF is simpler and the URIs have some features that might be wanted (stability, uniqueness, easy to implement). This is why I was asking for your *use case* .

Note that: there are still some problems, when annotating DOM with URIs, e.g. xPointer is abandoned and was never finished. Xpath has its limits and is also expensive (i.e. SAX not possible). I think there is no proper solution as of now. All the best, Sebastian

...

Cheers, Denny

2012/5/18 Sebastian Hellmannhellmann@informatik.uni-leipzig.de

...
Hello again, maybe the question, I asked was lost, as the text was TL;DR

I heard that, it is planned to track provenance of facts. e.g. Berlin has 3,337,000 citizens found here:http://www.worldatlas.com/**citypops.htm http://www.worldatlas.com/citypops.htm Do you have a place where the use case and the requirements are documented for this? Or is it out of scope? Will it be course grained, i.e. website level ? Or fine grained, i.e. text paragraph level? See e.g. how Berlin is highlighted here: http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%** 2Fwww.worldatlas.com%**2Fcitypops.htm%23hash_4_30_** 7449e732716c8e68842289bf2e6667**d5_Berlin%2C%2520Germany%2520-**%25203%2Chttp://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annotation_request=http%3A%2F%2Fwww.worldatlas.com%2Fcitypops.htm%23hash_4_30_7449e732716c8e68842289bf2e6667d5_Berlin%2C%2520Germany%2520-%25203%2C in this very early prototype.

Could you give me a link were I can read more about any Wikidata plans towards this direction? Sebastian

On 05/16/2012 09:10 AM, Sebastian Hellmann wrote:

...
Dear all, (Note: I could not find the document, where your requirements regarding the tracking of facts on the web are written, so I am giving a general introduction to NIF. Please send me a link to the document that specifies your need for tracing facts on the web, thanks)

I would like to point your attention to the URIs used in the NLP Interchange Format (NIF). NIF-URIs are quite easy to use, understand and implement. NIF has a one-triple-per-annotation paradigm. The latest documentation can be found here: http://svn.aksw.org/papers/**2012/WWW_NIF/public/string_**ontology.pdf http://svn.aksw.org/papers/2012/WWW_NIF/public/string_ontology.pdf

The basic idea is to use URIs with hash fragment ids to annotate or mark pages on the web: An example is the first occurrence of "Semantic Web" on http://www.w3.org/**DesignIssues/LinkedData.html http://www.w3.org/DesignIssues/LinkedData.html as highlighted here: http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%** 2Fwww.w3.org%2FDesignIssues%**2FLinkedData.html%23hash_10_**12_** 60f02d3b96c55e137e13494cf9a02d**06_Semantic%2520Webhttp://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annotation_request=http%3A%2F%2Fwww.w3.org%2FDesignIssues%2FLinkedData.html%23hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%2520Web

Here is a NIF example for linking a part of the document to the DBpedia entry of the Semantic Web: <http://www.w3.org/**DesignIssues/LinkedData.html#**offset_717_729 http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729 a str:StringInContext ; sso:oen<http://dbpedia.org/resource/**Semantic_Web http://dbpedia.org/resource/Semantic_Web> .

We are currently preparing a new draft for the spec 2.0. The old one can be found here: http://nlp2rdf.org/nif-1-0/

There are several EU projects that intend to use NIF. Furthermore, it is easier for everybody, if we standardize a Web annotation format together. Please give feedback of your use cases. All the best, Sebastian

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects:http://nlp2rdf.org ,http://dbpedia.org Homepage:http://bis.informatik.uni-**leipzig.de/SebastianHellmann http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group:http://aksw.org

______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects:http://nlp2rdf.org ,http://dbpedia.org Homepage:http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group:http://aksw.org

Barry Norton

22 Jun 22 Jun

12:21 a.m.

Sorry to jump in (without really understanding the context), but you guys saw this today, right? http://www.w3.org/TR/2012/WD-prov-aq-20120619/

Barry

On 21/06/2012 19:05, Sebastian Hellmann wrote:

...

Hello Denny, I was traveling for the past few weeks and can finally answer your email. See my comments inline.

On 05/29/2012 05:25 PM, Denny VrandeÄ?iÄ? wrote:

...
Hello Sebastian,

Just a few questions - as you note, it is easier if we all use the same standards, and so I want to ask about the relation to other related standards:

I understand that you dismiss IETF RFC 5147 because it is not stable

enough, right?

The offset scheme of NIF is built on this RFC. So the following would hold: @prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 .

We might change the syntax and reuse the RFC syntax, but it has several issues:

The optional part is not easy to handle, because you would need to

add owl:sameAs statements:

ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12;length=12 . ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12 . ld:char=717,12;UTF-8 owl:sameAs ld:char=717,12;length=9876 .

So theoretically ok, but annoying to implement and check.

When implementing web services, NIF allows the client to choose the

prefix: http://nlp2rdf.lod2.eu/demo/NIFStemmer?input-type=text&nif=true&pref.... returning URIs like http://this.is/a/slash/prefix/offset_10_15 So RFC 5147 would look like: http://this.is/a/slash/prefix/char=717,12 http://this.is/a/slash/prefix/char=717,12;UTF-8 or http://this.is/a/slash/prefix?char=717,12 http://this.is/a/slash/prefix?char=717,12;UTF-8

Character like = , prevent the use of prefixes:

echo "@prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 . " > test.ttl ; rapper -i turtle test.ttl

implementation is a little bit more difficult, given that :

$arr = split("_", "offset_717_729") ; switch ($arr[0]){ case 'offset' : $begin = $arr[1]; $end = $arr[2]; break; case 'hash' : $clength = $arr[1]; $slength = $arr[2]; $hash = $arr[3]; $rest = /*merge remaining with '_' */ break; }

RFC assumes a certain mime type, i.e. plain text. NIF does have a

broader assumption.

...

what is the relation to the W3C media fragment URIs? Did not find a

pointer there.

They are designed for media such as images, video, not strings. Potentially, the same principle can be applied, but it is not yet engineered/researched.

...

any plans of standardizing your approach?

We will do NIF 2.0 as a community standard and finish it in a couple of months. It will be published under open licences, so anybody W3C or ISO might pick it up, easily. Other than that there are plans by several EU projects (see e.g. here http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jun/0101.h...) and a US project to use it and there are several third party implementations, already. We would rather have it adopted first on a large scale and then standardized, properly, i.e. W3C. This worked quite well for the FOAF project or for RDB2RDF Mappers. Chances for fast standardization are not so unlikely, I would assume.

...
We would strongly prefer to just use a standard instead of advocating contenders for one -- if one exists.

You might want to look at: http://www.w3.org/community/openannotation/wiki/TextCommentOnWebPage and the same highlighting here: http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

NIF equivalent (4 triples instad of 14 and only one generated uuid): ld:hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%20Web a str:String ; oa:hasBody [ oa:annotator mailto:Bob ; cnt:chars "Hey Tim, good idea that Semantic Web!" . ]

So you might not think in a "contender" way. Approaches are complementary. NIF is simpler and the URIs have some features that might be wanted (stability, uniqueness, easy to implement). This is why I was asking for your *use case* .

Note that: there are still some problems, when annotating DOM with URIs, e.g. xPointer is abandoned and was never finished. Xpath has its limits and is also expensive (i.e. SAX not possible). I think there is no proper solution as of now. All the best, Sebastian

...
Cheers, Denny

2012/5/18 Sebastian Hellmannhellmann@informatik.uni-leipzig.de

...
Hello again, maybe the question, I asked was lost, as the text was TL;DR

I heard that, it is planned to track provenance of facts. e.g. Berlin has 3,337,000 citizens found here:http://www.worldatlas.com/**citypops.htm http://www.worldatlas.com/citypops.htm Do you have a place where the use case and the requirements are documented for this? Or is it out of scope? Will it be course grained, i.e. website level ? Or fine grained, i.e. text paragraph level? See e.g. how Berlin is highlighted here: http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%** 2Fwww.worldatlas.com%**2Fcitypops.htm%23hash_4_30_** 7449e732716c8e68842289bf2e6667**d5_Berlin%2C%2520Germany%2520-**%25203%2Chttp://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annotation_request=http%3A%2F%2Fwww.worldatlas.com%2Fcitypops.htm%23hash_4_30_7449e732716c8e68842289bf2e6667d5_Berlin%2C%2520Germany%2520-%25203%2C in this very early prototype.

Could you give me a link were I can read more about any Wikidata plans towards this direction? Sebastian

On 05/16/2012 09:10 AM, Sebastian Hellmann wrote:

...
Dear all, (Note: I could not find the document, where your requirements regarding the tracking of facts on the web are written, so I am giving a general introduction to NIF. Please send me a link to the document that specifies your need for tracing facts on the web, thanks)

I would like to point your attention to the URIs used in the NLP Interchange Format (NIF). NIF-URIs are quite easy to use, understand and implement. NIF has a one-triple-per-annotation paradigm. The latest documentation can be found here: http://svn.aksw.org/papers/**2012/WWW_NIF/public/string_**ontology.pdf http://svn.aksw.org/papers/2012/WWW_NIF/public/string_ontology.pdf

The basic idea is to use URIs with hash fragment ids to annotate or mark pages on the web: An example is the first occurrence of "Semantic Web" on http://www.w3.org/**DesignIssues/LinkedData.html http://www.w3.org/DesignIssues/LinkedData.html as highlighted here: http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%** 2Fwww.w3.org%2FDesignIssues%**2FLinkedData.html%23hash_10_**12_** 60f02d3b96c55e137e13494cf9a02d**06_Semantic%2520Webhttp://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annotation_request=http%3A%2F%2Fwww.w3.org%2FDesignIssues%2FLinkedData.html%23hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%2520Web

Here is a NIF example for linking a part of the document to the DBpedia entry of the Semantic Web: <http://www.w3.org/**DesignIssues/LinkedData.html#**offset_717_729 http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729 a str:StringInContext ; sso:oen <http://dbpedia.org/resource/**Semantic_Web http://dbpedia.org/resource/Semantic_Web> .

We are currently preparing a new draft for the spec 2.0. The old one can be found here: http://nlp2rdf.org/nif-1-0/

There are several EU projects that intend to use NIF. Furthermore, it is easier for everybody, if we standardize a Web annotation format together. Please give feedback of your use cases. All the best, Sebastian

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects:http://nlp2rdf.org ,http://dbpedia.org Homepage:http://bis.informatik.uni-**leipzig.de/SebastianHellmann http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group:http://aksw.org

______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects:http://nlp2rdf.org ,http://dbpedia.org Homepage:http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group:http://aksw.org

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Sebastian Hellmann

12:59 a.m.

Hi Barry,

On 06/21/2012 08:51 PM, Barry Norton wrote:

...

Sorry to jump in (without really understanding the context), but you guys saw this today, right? http://www.w3.org/TR/2012/WD-prov-aq-20120619/

It seems to be very unrelated. That is only resource-level, right? "Fundamentally, provenance information http://www.w3.org/TR/2012/WD-prov-aq-20120619/#dfn-provenance-information is /about/ resource http://www.w3.org/TR/2012/WD-prov-aq-20120619/#dfn-resources." So you would need a subject first. How do you say that the fact you just added to WikiData comes from a specific fragment of a resource? i.e. http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729 the first occurence of "Semantic Web"

Do you suggest, that NIF URIs might be standardized by inclusion in the PROV-AQ ? Might work. It could be compatible.

Sebastian

...

Barry

On 21/06/2012 19:05, Sebastian Hellmann wrote:

...
Hello Denny, I was traveling for the past few weeks and can finally answer your email. See my comments inline.

On 05/29/2012 05:25 PM, Denny VrandeÄ?iÄ? wrote:

...
Hello Sebastian,

Just a few questions - as you note, it is easier if we all use the same standards, and so I want to ask about the relation to other related standards:

I understand that you dismiss IETF RFC 5147 because it is not stable

enough, right?

The offset scheme of NIF is built on this RFC. So the following would hold: @prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 .

We might change the syntax and reuse the RFC syntax, but it has several issues:

The optional part is not easy to handle, because you would need

to add owl:sameAs statements:

ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12;length=12 . ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12 . ld:char=717,12;UTF-8 owl:sameAs ld:char=717,12;length=9876 .

So theoretically ok, but annoying to implement and check.

When implementing web services, NIF allows the client to choose

the prefix: http://nlp2rdf.lod2.eu/demo/NIFStemmer?input-type=text&nif=true&pref....

returning URIs like http://this.is/a/slash/prefix/offset_10_15 So RFC 5147 would look like: http://this.is/a/slash/prefix/char=717,12 http://this.is/a/slash/prefix/char=717,12;UTF-8 or http://this.is/a/slash/prefix?char=717,12 http://this.is/a/slash/prefix?char=717,12;UTF-8

Character like = , prevent the use of prefixes:

echo "@prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 . " > test.ttl ; rapper -i turtle test.ttl

implementation is a little bit more difficult, given that :

$arr = split("_", "offset_717_729") ; switch ($arr[0]){ case 'offset' : $begin = $arr[1]; $end = $arr[2]; break; case 'hash' : $clength = $arr[1]; $slength = $arr[2]; $hash = $arr[3]; $rest = /*merge remaining with '_' */ break; }

RFC assumes a certain mime type, i.e. plain text. NIF does have a

broader assumption.

...

what is the relation to the W3C media fragment URIs? Did not find a

pointer there.

They are designed for media such as images, video, not strings. Potentially, the same principle can be applied, but it is not yet engineered/researched.

...

any plans of standardizing your approach?

We will do NIF 2.0 as a community standard and finish it in a couple of months. It will be published under open licences, so anybody W3C or ISO might pick it up, easily. Other than that there are plans by several EU projects (see e.g. here http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jun/0101.h...) and a US project to use it and there are several third party implementations, already. We would rather have it adopted first on a large scale and then standardized, properly, i.e. W3C. This worked quite well for the FOAF project or for RDB2RDF Mappers. Chances for fast standardization are not so unlikely, I would assume.

...
We would strongly prefer to just use a standard instead of advocating contenders for one -- if one exists.

You might want to look at: http://www.w3.org/community/openannotation/wiki/TextCommentOnWebPage and the same highlighting here: http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

NIF equivalent (4 triples instad of 14 and only one generated uuid): ld:hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%20Web a str:String ; oa:hasBody [ oa:annotator mailto:Bob ; cnt:chars "Hey Tim, good idea that Semantic Web!" . ]

So you might not think in a "contender" way. Approaches are complementary. NIF is simpler and the URIs have some features that might be wanted (stability, uniqueness, easy to implement). This is why I was asking for your *use case* .

Note that: there are still some problems, when annotating DOM with URIs, e.g. xPointer is abandoned and was never finished. Xpath has its limits and is also expensive (i.e. SAX not possible). I think there is no proper solution as of now. All the best, Sebastian

...
Cheers, Denny

2012/5/18 Sebastian Hellmannhellmann@informatik.uni-leipzig.de

...
Hello again, maybe the question, I asked was lost, as the text was TL;DR

I heard that, it is planned to track provenance of facts. e.g. Berlin has 3,337,000 citizens found here:http://www.worldatlas.com/**citypops.htm http://www.worldatlas.com/citypops.htm Do you have a place where the use case and the requirements are documented for this? Or is it out of scope? Will it be course grained, i.e. website level ? Or fine grained, i.e. text paragraph level? See e.g. how Berlin is highlighted here: http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%** 2Fwww.worldatlas.com%**2Fcitypops.htm%23hash_4_30_** 7449e732716c8e68842289bf2e6667**d5_Berlin%2C%2520Germany%2520-**%25203%2Chttp://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annotation_request=http%3A%2F%2Fwww.worldatlas.com%2Fcitypops.htm%23hash_4_30_7449e732716c8e68842289bf2e6667d5_Berlin%2C%2520Germany%2520-%25203%2C

in this very early prototype.

Could you give me a link were I can read more about any Wikidata plans towards this direction? Sebastian

On 05/16/2012 09:10 AM, Sebastian Hellmann wrote:

...
Dear all, (Note: I could not find the document, where your requirements regarding the tracking of facts on the web are written, so I am giving a general introduction to NIF. Please send me a link to the document that specifies your need for tracing facts on the web, thanks)

I would like to point your attention to the URIs used in the NLP Interchange Format (NIF). NIF-URIs are quite easy to use, understand and implement. NIF has a one-triple-per-annotation paradigm. The latest documentation can be found here: http://svn.aksw.org/papers/**2012/WWW_NIF/public/string_**ontology.pdf http://svn.aksw.org/papers/2012/WWW_NIF/public/string_ontology.pdf

The basic idea is to use URIs with hash fragment ids to annotate or mark pages on the web: An example is the first occurrence of "Semantic Web" on http://www.w3.org/**DesignIssues/LinkedData.html http://www.w3.org/DesignIssues/LinkedData.html as highlighted here: http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%** 2Fwww.w3.org%2FDesignIssues%**2FLinkedData.html%23hash_10_**12_** 60f02d3b96c55e137e13494cf9a02d**06_Semantic%2520Webhttp://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annotation_request=http%3A%2F%2Fwww.w3.org%2FDesignIssues%2FLinkedData.html%23hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%2520Web

Here is a NIF example for linking a part of the document to the DBpedia entry of the Semantic Web: <http://www.w3.org/**DesignIssues/LinkedData.html#**offset_717_729 http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729
  a str:StringInContext ;
  sso:oen 
<http://dbpedia.org/resource/**Semantic_Web http://dbpedia.org/resource/Semantic_Web> .

We are currently preparing a new draft for the spec 2.0. The old one can be found here: http://nlp2rdf.org/nif-1-0/

There are several EU projects that intend to use NIF. Furthermore, it is easier for everybody, if we standardize a Web annotation format together. Please give feedback of your use cases. All the best, Sebastian
-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects:http://nlp2rdf.org ,http://dbpedia.org Homepage:http://bis.informatik.uni-**leipzig.de/SebastianHellmann http://bis.informatik.uni-leipzig.de/SebastianHellmann

Research Group:http://aksw.org

______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects:http://nlp2rdf.org ,http://dbpedia.org Homepage:http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group:http://aksw.org

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Barry Norton

1:04 a.m.

As excused I wasn't really following your discussion, but indeed if you're giving URIs to these fragments...

Barry

On 21/06/2012 20:29, Sebastian Hellmann wrote:

...

Hi Barry,

On 06/21/2012 08:51 PM, Barry Norton wrote:

...
Sorry to jump in (without really understanding the context), but you guys saw this today, right? http://www.w3.org/TR/2012/WD-prov-aq-20120619/

It seems to be very unrelated. That is only resource-level, right? "Fundamentally, provenance information http://www.w3.org/TR/2012/WD-prov-aq-20120619/#dfn-provenance-information is /about/ resource http://www.w3.org/TR/2012/WD-prov-aq-20120619/#dfn-resources." So you would need a subject first. How do you say that the fact you just added to WikiData comes from a specific fragment of a resource? i.e. http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729 the first occurence of "Semantic Web"

Do you suggest, that NIF URIs might be standardized by inclusion in the PROV-AQ ? Might work. It could be compatible.

Sebastian

...
Barry

On 21/06/2012 19:05, Sebastian Hellmann wrote:

...
Hello Denny, I was traveling for the past few weeks and can finally answer your email. See my comments inline.

On 05/29/2012 05:25 PM, Denny VrandeÄ?iÄ? wrote:

...
Hello Sebastian,

Just a few questions - as you note, it is easier if we all use the same standards, and so I want to ask about the relation to other related standards:

I understand that you dismiss IETF RFC 5147 because it is not stable

enough, right?

The offset scheme of NIF is built on this RFC. So the following would hold: @prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 .

We might change the syntax and reuse the RFC syntax, but it has several issues:

The optional part is not easy to handle, because you would need

to add owl:sameAs statements:

ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12;length=12 . ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12 . ld:char=717,12;UTF-8 owl:sameAs ld:char=717,12;length=9876 .

So theoretically ok, but annoying to implement and check.

When implementing web services, NIF allows the client to choose

the prefix: http://nlp2rdf.lod2.eu/demo/NIFStemmer?input-type=text&nif=true&pref....

returning URIs like http://this.is/a/slash/prefix/offset_10_15 So RFC 5147 would look like: http://this.is/a/slash/prefix/char=717,12 http://this.is/a/slash/prefix/char=717,12;UTF-8 or http://this.is/a/slash/prefix?char=717,12 http://this.is/a/slash/prefix?char=717,12;UTF-8

Character like = , prevent the use of prefixes:

echo "@prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 . " > test.ttl ; rapper -i turtle test.ttl

implementation is a little bit more difficult, given that :

$arr = split("_", "offset_717_729") ; switch ($arr[0]){ case 'offset' : $begin = $arr[1]; $end = $arr[2]; break; case 'hash' : $clength = $arr[1]; $slength = $arr[2]; $hash = $arr[3]; $rest = /*merge remaining with '_' */ break; }

RFC assumes a certain mime type, i.e. plain text. NIF does have a

broader assumption.

...

what is the relation to the W3C media fragment URIs? Did not find a

pointer there.

They are designed for media such as images, video, not strings. Potentially, the same principle can be applied, but it is not yet engineered/researched.

...

any plans of standardizing your approach?

We will do NIF 2.0 as a community standard and finish it in a couple of months. It will be published under open licences, so anybody W3C or ISO might pick it up, easily. Other than that there are plans by several EU projects (see e.g. here http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jun/0101.h...) and a US project to use it and there are several third party implementations, already. We would rather have it adopted first on a large scale and then standardized, properly, i.e. W3C. This worked quite well for the FOAF project or for RDB2RDF Mappers. Chances for fast standardization are not so unlikely, I would assume.

...
We would strongly prefer to just use a standard instead of advocating contenders for one -- if one exists.

You might want to look at: http://www.w3.org/community/openannotation/wiki/TextCommentOnWebPage and the same highlighting here: http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

NIF equivalent (4 triples instad of 14 and only one generated uuid): ld:hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%20Web a str:String ; oa:hasBody [ oa:annotator mailto:Bob ; cnt:chars "Hey Tim, good idea that Semantic Web!" . ]

So you might not think in a "contender" way. Approaches are complementary. NIF is simpler and the URIs have some features that might be wanted (stability, uniqueness, easy to implement). This is why I was asking for your *use case* .

Note that: there are still some problems, when annotating DOM with URIs, e.g. xPointer is abandoned and was never finished. Xpath has its limits and is also expensive (i.e. SAX not possible). I think there is no proper solution as of now. All the best, Sebastian

...
Cheers, Denny

2012/5/18 Sebastian Hellmannhellmann@informatik.uni-leipzig.de

...
Hello again, maybe the question, I asked was lost, as the text was TL;DR

I heard that, it is planned to track provenance of facts. e.g. Berlin has 3,337,000 citizens found here:http://www.worldatlas.com/**citypops.htm http://www.worldatlas.com/citypops.htm Do you have a place where the use case and the requirements are documented for this? Or is it out of scope? Will it be course grained, i.e. website level ? Or fine grained, i.e. text paragraph level? See e.g. how Berlin is highlighted here: http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%** 2Fwww.worldatlas.com%**2Fcitypops.htm%23hash_4_30_** 7449e732716c8e68842289bf2e6667**d5_Berlin%2C%2520Germany%2520-**%25203%2Chttp://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annotation_request=http%3A%2F%2Fwww.worldatlas.com%2Fcitypops.htm%23hash_4_30_7449e732716c8e68842289bf2e6667d5_Berlin%2C%2520Germany%2520-%25203%2C

in this very early prototype.

Could you give me a link were I can read more about any Wikidata plans towards this direction? Sebastian

On 05/16/2012 09:10 AM, Sebastian Hellmann wrote:

...
Dear all, (Note: I could not find the document, where your requirements regarding the tracking of facts on the web are written, so I am giving a general introduction to NIF. Please send me a link to the document that specifies your need for tracing facts on the web, thanks)

I would like to point your attention to the URIs used in the NLP Interchange Format (NIF). NIF-URIs are quite easy to use, understand and implement. NIF has a one-triple-per-annotation paradigm. The latest documentation can be found here: http://svn.aksw.org/papers/**2012/WWW_NIF/public/string_**ontology.pdf http://svn.aksw.org/papers/2012/WWW_NIF/public/string_ontology.pdf

The basic idea is to use URIs with hash fragment ids to annotate or mark pages on the web: An example is the first occurrence of "Semantic Web" on http://www.w3.org/**DesignIssues/LinkedData.html http://www.w3.org/DesignIssues/LinkedData.html as highlighted here: http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%** 2Fwww.w3.org%2FDesignIssues%**2FLinkedData.html%23hash_10_**12_** 60f02d3b96c55e137e13494cf9a02d**06_Semantic%2520Webhttp://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annotation_request=http%3A%2F%2Fwww.w3.org%2FDesignIssues%2FLinkedData.html%23hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%2520Web

Here is a NIF example for linking a part of the document to the DBpedia entry of the Semantic Web: <http://www.w3.org/**DesignIssues/LinkedData.html#**offset_717_729 http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729
  a str:StringInContext ;
  sso:oen 
<http://dbpedia.org/resource/**Semantic_Web http://dbpedia.org/resource/Semantic_Web> .

We are currently preparing a new draft for the spec 2.0. The old one can be found here: http://nlp2rdf.org/nif-1-0/

There are several EU projects that intend to use NIF. Furthermore, it is easier for everybody, if we standardize a Web annotation format together. Please give feedback of your use cases. All the best, Sebastian
-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects:http://nlp2rdf.org ,http://dbpedia.org Homepage:http://bis.informatik.uni-**leipzig.de/SebastianHellmann http://bis.informatik.uni-leipzig.de/SebastianHellmann

Research Group:http://aksw.org

______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects:http://nlp2rdf.org ,http://dbpedia.org Homepage:http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group:http://aksw.org

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects:http://nlp2rdf.org ,http://dbpedia.org Homepage:http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group:http://aksw.org

Denny Vrandečić

8:50 p.m.

Here's the use case:

Every statement in Wikidata will have a URI. Every statement can have one more references. In many cases, the reference might be text on a website.

Whereas it is always possible (and probably what we will do first) as well as correct to state:

Statement1 accordingTo SlashDot .

it would be preferable to be a bit more specific on that, and most preferably it would be to go all the way down to the sentence saying

Statement1 accordingTo X .

with X being a URI denoting the sentence that I mean in a specific Slashdot-Article.

I would prefer a standard or widely adopted way to how to do that, and NIF-URIs seem to be a viable solution for that. We will come back to this once we start modeling references in more detail.

The reference could be pointing to a book, to a video, to a mesopotamic stone table, etc. (OK, I admit that the different media types will be differently prioritized).

I hope this helps, Cheers, Denny

2012/6/21 Sebastian Hellmann hellmann@informatik.uni-leipzig.de:

...

Hello Denny, I was traveling for the past few weeks and can finally answer your email. See my comments inline.

On 05/29/2012 05:25 PM, Denny VrandeÄ iÄ‡ wrote:

Hello Sebastian,

Just a few questions - as you note, it is easier if we all use the same standards, and so I want to ask about the relation to other related standards:

I understand that you dismiss IETF RFC 5147 because it is not stable

enough, right?

The offset scheme of NIF is built on this RFC. So the following would hold: @prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 .

We might change the syntax and reuse the RFC syntax, but it has several issues: 1. The optional part is not easy to handle, because you would need to add owl:sameAs statements:

ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12;length=12 . ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12 . ld:char=717,12;UTF-8 owl:sameAs ld:char=717,12;length=9876 .

So theoretically ok, but annoying to implement and check.

When implementing web services, NIF allows the client to choose the

prefix: http://nlp2rdf.lod2.eu/demo/NIFStemmer?input-type=text&nif=true&pref.... returning URIs like http://this.is/a/slash/prefix/offset_10_15 So RFC 5147 would look like: http://this.is/a/slash/prefix/char=717,12 http://this.is/a/slash/prefix/char=717,12;UTF-8 or http://this.is/a/slash/prefix?char=717,12 http://this.is/a/slash/prefix?char=717,12;UTF-8

Character like = , prevent the use of prefixes:

echo "@prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 . " > test.ttl ; rapper -i turtle test.ttl

implementation is a little bit more difficult, given that :

$arr = split("_", "offset_717_729") ; switch ($arr[0]){ case 'offset' : $begin = $arr[1]; $end = $arr[2]; break; case 'hash' : $clength = $arr[1]; $slength = $arr[2]; $hash = $arr[3]; $rest = /*merge remaining with '_' */ break; }

RFC assumes a certain mime type, i.e. plain text. NIF does have a broader

assumption.

what is the relation to the W3C media fragment URIs? Did not find a

pointer there.

They are designed for media such as images, video, not strings. Potentially, the same principle can be applied, but it is not yet engineered/researched.

any plans of standardizing your approach?

We will do NIF 2.0 as a community standard and finish it in a couple of months. It will be published under open licences, so anybody W3C or ISO might pick it up, easily. Other than that there are plans by several EU projects (see e.g. here http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jun/0101.h...) and a US project to use it and there are several third party implementations, already. We would rather have it adopted first on a large scale and then standardized, properly, i.e. W3C. This worked quite well for the FOAF project or for RDB2RDF Mappers. Chances for fast standardization are not so unlikely, I would assume.

We would strongly prefer to just use a standard instead of advocating contenders for one -- if one exists.

You might want to look at: http://www.w3.org/community/openannotation/wiki/TextCommentOnWebPage and the same highlighting here: http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

NIF equivalent (4 triples instad of 14 and only one generated uuid): ld:hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%20Web a str:String ; oa:hasBody [ oa:annotator mailto:Bob ; cnt:chars "Hey Tim, good idea that Semantic Web!" . ]

So you might not think in a "contender" way. Approaches are complementary. NIF is simpler and the URIs have some features that might be wanted (stability, uniqueness, easy to implement). This is why I was asking for your *use case* .

Note that: there are still some problems, when annotating DOM with URIs, e.g. xPointer is abandoned and was never finished. Xpath has its limits and is also expensive (i.e. SAX not possible). I think there is no proper solution as of now. All the best, Sebastian

Cheers, Denny

2012/5/18 Sebastian Hellmann hellmann@informatik.uni-leipzig.de

Hello again, maybe the question, I asked was lost, as the text was TL;DR

I heard that, it is planned to track provenance of facts. e.g. Berlin has 3,337,000 citizens found here: http://www.worldatlas.com/**citypops.htm http://www.worldatlas.com/citypops.htm Do you have a place where the use case and the requirements are documented for this? Or is it out of scope? Will it be course grained, i.e. website level ? Or fine grained, i.e. text paragraph level? See e.g. how Berlin is highlighted here: http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%** 2Fwww.worldatlas.com%**2Fcitypops.htm%23hash_4_30_** 7449e732716c8e68842289bf2e6667**d5_Berlin%2C%2520Germany%2520-**%25203%2Chttp://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annotation_request=http%3A%2F%2Fwww.worldatlas.com%2Fcitypops.htm%23hash_4_30_7449e732716c8e68842289bf2e6667d5_Berlin%2C%2520Germany%2520-%25203%2C in this very early prototype.

Could you give me a link were I can read more about any Wikidata plans towards this direction? Sebastian

On 05/16/2012 09:10 AM, Sebastian Hellmann wrote:

Dear all, (Note: I could not find the document, where your requirements regarding the tracking of facts on the web are written, so I am giving a general introduction to NIF. Please send me a link to the document that specifies your need for tracing facts on the web, thanks)

I would like to point your attention to the URIs used in the NLP Interchange Format (NIF). NIF-URIs are quite easy to use, understand and implement. NIF has a one-triple-per-annotation paradigm. The latest documentation can be found here: http://svn.aksw.org/papers/**2012/WWW_NIF/public/string_**ontology.pdf http://svn.aksw.org/papers/2012/WWW_NIF/public/string_ontology.pdf

The basic idea is to use URIs with hash fragment ids to annotate or mark pages on the web: An example is the first occurrence of "Semantic Web" on http://www.w3.org/**DesignIssues/LinkedData.html http://www.w3.org/DesignIssues/LinkedData.html as highlighted here: http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%** 2Fwww.w3.org%2FDesignIssues%**2FLinkedData.html%23hash_10_**12_** 60f02d3b96c55e137e13494cf9a02d**06_Semantic%2520Webhttp://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annotation_request=http%3A%2F%2Fwww.w3.org%2FDesignIssues%2FLinkedData.html%23hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%2520Web

Here is a NIF example for linking a part of the document to the DBpedia entry of the Semantic Web: <http://www.w3.org/**DesignIssues/LinkedData.html#**offset_717_729 http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729
 a str:StringInContext ;
 sso:oen
<http://dbpedia.org/resource/**Semantic_Web http://dbpedia.org/resource/Semantic_Web> .

We are currently preparing a new draft for the spec 2.0. The old one can be found here: http://nlp2rdf.org/nif-1-0/

There are several EU projects that intend to use NIF. Furthermore, it is easier for everybody, if we standardize a Web annotation format together. Please give feedback of your use cases. All the best, Sebastian

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://dbpedia.org Homepage: http://bis.informatik.uni-**leipzig.de/SebastianHellmann http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Martynas Jusevičius

9:50 p.m.

Denny, the statement-level of granularity you're describing is achieved by RDF reification. You describe it however as a "deprecated mechanism" of provenance, without backing it up.

Why do you think there must be a better mechanism? Maybe you should take another look at reification, or lower your provenance requirements, at least initially?

Martynas graphity.org On Jun 22, 2012 5:20 PM, "Denny Vrandečić" denny.vrandecic@wikimedia.de wrote:

...

Here's the use case:

Every statement in Wikidata will have a URI. Every statement can have one more references. In many cases, the reference might be text on a website.

Whereas it is always possible (and probably what we will do first) as well as correct to state:

Statement1 accordingTo SlashDot .

it would be preferable to be a bit more specific on that, and most preferably it would be to go all the way down to the sentence saying

Statement1 accordingTo X .

with X being a URI denoting the sentence that I mean in a specific Slashdot-Article.

I would prefer a standard or widely adopted way to how to do that, and NIF-URIs seem to be a viable solution for that. We will come back to this once we start modeling references in more detail.

The reference could be pointing to a book, to a video, to a mesopotamic stone table, etc. (OK, I admit that the different media types will be differently prioritized).

I hope this helps, Cheers, Denny

2012/6/21 Sebastian Hellmann hellmann@informatik.uni-leipzig.de:

...
Hello Denny, I was traveling for the past few weeks and can finally answer your email. See my comments inline.

On 05/29/2012 05:25 PM, Denny VrandeÄ iÄ‡ wrote:

Hello Sebastian,

Just a few questions - as you note, it is easier if we all use the same standards, and so I want to ask about the relation to other related standards:

I understand that you dismiss IETF RFC 5147 because it is not stable

enough, right?

The offset scheme of NIF is built on this RFC. So the following would hold: @prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 .

We might change the syntax and reuse the RFC syntax, but it has several issues:

The optional part is not easy to handle, because you would need to

add

...
owl:sameAs statements:

ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12;length=12 . ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12 . ld:char=717,12;UTF-8 owl:sameAs ld:char=717,12;length=9876 .

So theoretically ok, but annoying to implement and check.

When implementing web services, NIF allows the client to choose the

prefix:

http://nlp2rdf.lod2.eu/demo/NIFStemmer?input-type=text&nif=true&pref... .

...
returning URIs like http://this.is/a/slash/prefix/offset_10_15 So RFC 5147 would look like: http://this.is/a/slash/prefix/char=717,12 http://this.is/a/slash/prefix/char=717,12;UTF-8 or http://this.is/a/slash/prefix?char=717,12 http://this.is/a/slash/prefix?char=717,12;UTF-8

Character like = , prevent the use of prefixes:

echo "@prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 . " > test.ttl ; rapper -i turtle test.ttl

implementation is a little bit more difficult, given that :

$arr = split("_", "offset_717_729") ; switch ($arr[0]){ case 'offset' : $begin = $arr[1]; $end = $arr[2]; break; case 'hash' : $clength = $arr[1]; $slength = $arr[2]; $hash = $arr[3]; $rest = /*merge remaining with '_' */ break; }

RFC assumes a certain mime type, i.e. plain text. NIF does have a

broader

...
assumption.

what is the relation to the W3C media fragment URIs? Did not find a

pointer there.

They are designed for media such as images, video, not strings. Potentially, the same principle can be applied, but it is not yet engineered/researched.

any plans of standardizing your approach?

We will do NIF 2.0 as a community standard and finish it in a couple of months. It will be published under open licences, so anybody W3C or ISO might pick it up, easily. Other than that there are plans by several EU projects (see e.g. here

http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jun/0101.h... )

...
and a US project to use it and there are several third party implementations, already. We would rather have it adopted first on a

large

...
scale and then standardized, properly, i.e. W3C. This worked quite well

for

...
the FOAF project or for RDB2RDF Mappers. Chances for fast standardization are not so unlikely, I would assume.

We would strongly prefer to just use a standard instead of advocating contenders for one -- if one exists.

You might want to look at: http://www.w3.org/community/openannotation/wiki/TextCommentOnWebPage and the same highlighting here:

http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

...
NIF equivalent (4 triples instad of 14 and only one generated uuid): ld:hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%20Web a

str:String ;

...
oa:hasBody [
    oa:annotator <mailto:Bob> ;
    cnt:chars "Hey Tim, good idea that Semantic Web!" .
]
So you might not think in a "contender" way. Approaches are
complementary.

...
NIF is simpler and the URIs have some features that might be wanted (stability, uniqueness, easy to implement). This is why I was asking for your *use case* .

Note that: there are still some problems, when annotating DOM with URIs, e.g. xPointer is abandoned and was never finished. Xpath has its limits

and

...
is also expensive (i.e. SAX not possible). I think there is no proper solution as of now. All the best, Sebastian

Cheers, Denny

2012/5/18 Sebastian Hellmann hellmann@informatik.uni-leipzig.de

Hello again, maybe the question, I asked was lost, as the text was TL;DR

I heard that, it is planned to track provenance of facts. e.g. Berlin has 3,337,000 citizens found here: http://www.worldatlas.com/**citypops.htm<

http://www.worldatlas.com/citypops.htm%3E

...
Do you have a place where the use case and the requirements are

documented

...
for this? Or is it out of scope? Will it be course grained, i.e. website level ? Or fine grained, i.e.

text

...
paragraph level? See e.g. how Berlin is highlighted here: http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%** 2Fwww.worldatlas.com%**2Fcitypops.htm%23hash_4_30_**

7449e732716c8e68842289bf2e6667**d5_Berlin%2C%2520Germany%2520-**%25203%2C< http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

...
in this very early prototype.

Could you give me a link were I can read more about any Wikidata plans towards this direction? Sebastian

On 05/16/2012 09:10 AM, Sebastian Hellmann wrote:

Dear all, (Note: I could not find the document, where your requirements regarding the tracking of facts on the web are written, so I am giving a general introduction to NIF. Please send me a link to the document that specifies your need for tracing facts on the web, thanks)

I would like to point your attention to the URIs used in the NLP Interchange Format (NIF). NIF-URIs are quite easy to use, understand and implement. NIF has a one-triple-per-annotation paradigm. The latest documentation can be

found

...
here: http://svn.aksw.org/papers/**2012/WWW_NIF/public/string_**ontology.pdf<

http://svn.aksw.org/papers/2012/WWW_NIF/public/string_ontology.pdf%3E

...
The basic idea is to use URIs with hash fragment ids to annotate or mark pages on the web: An example is the first occurrence of "Semantic Web" on http://www.w3.org/**DesignIssues/LinkedData.html<

http://www.w3.org/DesignIssues/LinkedData.html%3E

...
as highlighted here: http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%** 2Fwww.w3.org%2FDesignIssues%**2FLinkedData.html%23hash_10_**12_** 60f02d3b96c55e137e13494cf9a02d**06_Semantic%2520Web<

http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

...
Here is a NIF example for linking a part of the document to the DBpedia entry of the Semantic Web: <http://www.w3.org/**DesignIssues/LinkedData.html#**offset_717_729<

http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729%3E

...
 a str:StringInContext ;
 sso:oen
<http://dbpedia.org/resource/**Semantic_Web<
http://dbpedia.org/resource/Semantic_Web%3E%3E

...
.

We are currently preparing a new draft for the spec 2.0. The old one can be found here: http://nlp2rdf.org/nif-1-0/

There are several EU projects that intend to use NIF. Furthermore, it is easier for everybody, if we standardize a Web annotation format together. Please give feedback of your use cases. All the best, Sebastian

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://dbpedia.org Homepage: http://bis.informatik.uni-**leipzig.de/SebastianHellmann<

http://bis.informatik.uni-leipzig.de/SebastianHellmann%3E

...
Research Group: http://aksw.org

______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-l<

https://lists.wikimedia.org/mailman/listinfo/wikidata-l%3E

...

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Sebastian Hellmann

11:48 p.m.

Denny didn't even use the word "deprecated". Reification for statement-level provenance works, but you won't be able to sell it as an elegant solution to the problem. So "could" - yes , "should" - ?? - probably not

If Wikidata is using statement-level provenance, there might be better ways to serialize it in RDF than reification in the future e.g. NQuads: http://sw.deri.org/2008/07/n-quads/ or JSON ;)

For internal use I would discourage reification. If using a relational scheme, a statement id, which can be joined with another SQL table for provenance is the best way to do it imho.

Before you are driving us all mad with explaining why reifiction is bad, I would really like you to justify why WikiData should consider reification. I really do not know many use case (if any) where reification is the right choice of modelling. Before going into the discussion any further [1], I think you should name an example where reification is really better than other options.

All the best, Sebastian

[1]http://ceur-ws.org/Vol-699/Paper5.pdf

On 06/22/2012 06:20 PM, Martynas Jusevičius wrote:

...

Denny, the statement-level of granularity you're describing is achieved by RDF reification. You describe it however as a "deprecated mechanism" of provenance, without backing it up.

Why do you think there must be a better mechanism? Maybe you should take another look at reification, or lower your provenance requirements, at least initially?

Martynas graphity.org On Jun 22, 2012 5:20 PM, "Denny Vrandečić" denny.vrandecic@wikimedia.de wrote:

...
Here's the use case:

Every statement in Wikidata will have a URI. Every statement can have one more references. In many cases, the reference might be text on a website.

Whereas it is always possible (and probably what we will do first) as well as correct to state:

Statement1 accordingTo SlashDot .

it would be preferable to be a bit more specific on that, and most preferably it would be to go all the way down to the sentence saying

Statement1 accordingTo X .

with X being a URI denoting the sentence that I mean in a specific Slashdot-Article.

I would prefer a standard or widely adopted way to how to do that, and NIF-URIs seem to be a viable solution for that. We will come back to this once we start modeling references in more detail.

The reference could be pointing to a book, to a video, to a mesopotamic stone table, etc. (OK, I admit that the different media types will be differently prioritized).

I hope this helps, Cheers, Denny

2012/6/21 Sebastian Hellmann hellmann@informatik.uni-leipzig.de:

...
Hello Denny, I was traveling for the past few weeks and can finally answer your email. See my comments inline.

On 05/29/2012 05:25 PM, Denny VrandeÄ iÄ‡ wrote:

Hello Sebastian,

Just a few questions - as you note, it is easier if we all use the same standards, and so I want to ask about the relation to other related standards:

I understand that you dismiss IETF RFC 5147 because it is not stable

enough, right?

The offset scheme of NIF is built on this RFC. So the following would hold: @prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 .

We might change the syntax and reuse the RFC syntax, but it has several issues:

The optional part is not easy to handle, because you would need to

add

...
owl:sameAs statements:

ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12;length=12 . ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12 . ld:char=717,12;UTF-8 owl:sameAs ld:char=717,12;length=9876 .

So theoretically ok, but annoying to implement and check.

When implementing web services, NIF allows the client to choose the

prefix:

http://nlp2rdf.lod2.eu/demo/NIFStemmer?input-type=text&nif=true&pref... .

...
returning URIs like http://this.is/a/slash/prefix/offset_10_15 So RFC 5147 would look like: http://this.is/a/slash/prefix/char=717,12 http://this.is/a/slash/prefix/char=717,12;UTF-8 or http://this.is/a/slash/prefix?char=717,12 http://this.is/a/slash/prefix?char=717,12;UTF-8

Character like = , prevent the use of prefixes:

echo "@prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 . " > test.ttl ; rapper -i turtle test.ttl

implementation is a little bit more difficult, given that :

$arr = split("_", "offset_717_729") ; switch ($arr[0]){ case 'offset' : $begin = $arr[1]; $end = $arr[2]; break; case 'hash' : $clength = $arr[1]; $slength = $arr[2]; $hash = $arr[3]; $rest = /*merge remaining with '_' */ break; }

RFC assumes a certain mime type, i.e. plain text. NIF does have a

broader

...
assumption.

what is the relation to the W3C media fragment URIs? Did not find a

pointer there.

They are designed for media such as images, video, not strings. Potentially, the same principle can be applied, but it is not yet engineered/researched.

any plans of standardizing your approach?

We will do NIF 2.0 as a community standard and finish it in a couple of months. It will be published under open licences, so anybody W3C or ISO might pick it up, easily. Other than that there are plans by several EU projects (see e.g. here

http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jun/0101.h... )

...
and a US project to use it and there are several third party implementations, already. We would rather have it adopted first on a

large

...
scale and then standardized, properly, i.e. W3C. This worked quite well

for

...
the FOAF project or for RDB2RDF Mappers. Chances for fast standardization are not so unlikely, I would assume.

We would strongly prefer to just use a standard instead of advocating contenders for one -- if one exists.

You might want to look at: http://www.w3.org/community/openannotation/wiki/TextCommentOnWebPage and the same highlighting here:

http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

...
NIF equivalent (4 triples instad of 14 and only one generated uuid): ld:hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%20Web a

str:String ;

...
 oa:hasBody [
     oa:annotator <mailto:Bob> ;
     cnt:chars "Hey Tim, good idea that Semantic Web!" .
 ]
So you might not think in a "contender" way. Approaches are
complementary.

...
NIF is simpler and the URIs have some features that might be wanted (stability, uniqueness, easy to implement). This is why I was asking for your *use case* .

Note that: there are still some problems, when annotating DOM with URIs, e.g. xPointer is abandoned and was never finished. Xpath has its limits

and

...
is also expensive (i.e. SAX not possible). I think there is no proper solution as of now. All the best, Sebastian

Cheers, Denny

2012/5/18 Sebastian Hellmann hellmann@informatik.uni-leipzig.de

Hello again, maybe the question, I asked was lost, as the text was TL;DR

I heard that, it is planned to track provenance of facts. e.g. Berlin has 3,337,000 citizens found here: http://www.worldatlas.com/**citypops.htm<

http://www.worldatlas.com/citypops.htm%3E

...
Do you have a place where the use case and the requirements are

documented

...
for this? Or is it out of scope? Will it be course grained, i.e. website level ? Or fine grained, i.e.

text

...
paragraph level? See e.g. how Berlin is highlighted here: http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%** 2Fwww.worldatlas.com%**2Fcitypops.htm%23hash_4_30_**

7449e732716c8e68842289bf2e6667**d5_Berlin%2C%2520Germany%2520-**%25203%2C< http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

...
in this very early prototype.

Could you give me a link were I can read more about any Wikidata plans towards this direction? Sebastian

On 05/16/2012 09:10 AM, Sebastian Hellmann wrote:

Dear all, (Note: I could not find the document, where your requirements regarding the tracking of facts on the web are written, so I am giving a general introduction to NIF. Please send me a link to the document that specifies your need for tracing facts on the web, thanks)

I would like to point your attention to the URIs used in the NLP Interchange Format (NIF). NIF-URIs are quite easy to use, understand and implement. NIF has a one-triple-per-annotation paradigm. The latest documentation can be

found

...
here: http://svn.aksw.org/papers/**2012/WWW_NIF/public/string_**ontology.pdf<

http://svn.aksw.org/papers/2012/WWW_NIF/public/string_ontology.pdf%3E

...
The basic idea is to use URIs with hash fragment ids to annotate or mark pages on the web: An example is the first occurrence of "Semantic Web" on http://www.w3.org/**DesignIssues/LinkedData.html<

http://www.w3.org/DesignIssues/LinkedData.html%3E

...
as highlighted here: http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%** 2Fwww.w3.org%2FDesignIssues%**2FLinkedData.html%23hash_10_**12_** 60f02d3b96c55e137e13494cf9a02d**06_Semantic%2520Web<

http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

...
Here is a NIF example for linking a part of the document to the DBpedia entry of the Semantic Web: <http://www.w3.org/**DesignIssues/LinkedData.html#**offset_717_729<

http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729%3E

...
  a str:StringInContext ;
  sso:oen
<http://dbpedia.org/resource/**Semantic_Web<
http://dbpedia.org/resource/Semantic_Web%3E%3E

...
.

We are currently preparing a new draft for the spec 2.0. The old one can be found here: http://nlp2rdf.org/nif-1-0/

There are several EU projects that intend to use NIF. Furthermore, it is easier for everybody, if we standardize a Web annotation format together. Please give feedback of your use cases. All the best, Sebastian

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://dbpedia.org Homepage: http://bis.informatik.uni-**leipzig.de/SebastianHellmann<

http://bis.informatik.uni-leipzig.de/SebastianHellmann%3E

...
Research Group: http://aksw.org

______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-l<

https://lists.wikimedia.org/mailman/listinfo/wikidata-l%3E

...

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Martynas Jusevičius

23 Jun 23 Jun

12:13 a.m.

It says "deprecated" on the Data model wiki.

So maybe Wikidata doesn't need statement-level granularity? Maybe the named graph approach is good enough? But it's not based on statements.

If you build this kind of data model on the relational, not to mention provenance, you will not be able to provide a reasonable query mechanism. That's the reason why the development of Jena's SDB store is pretty much abandoned.

Martynas On Jun 22, 2012 8:18 PM, "Sebastian Hellmann" < hellmann@informatik.uni-leipzig.de> wrote:

...

Denny didn't even use the word "deprecated". Reification for statement-level provenance works, but you won't be able to sell it as an elegant solution to the problem. So "could" - yes , "should" - ?? - probably not

If Wikidata is using statement-level provenance, there might be better ways to serialize it in RDF than reification in the future e.g. NQuads: http://sw.deri.org/2008/07/n-quads/ or JSON ;)

For internal use I would discourage reification. If using a relational scheme, a statement id, which can be joined with another SQL table for provenance is the best way to do it imho.

Before you are driving us all mad with explaining why reifiction is bad, I would really like you to justify why WikiData should consider reification. I really do not know many use case (if any) where reification is the right choice of modelling. Before going into the discussion any further [1], I think you should name an example where reification is really better than other options.

All the best, Sebastian

[1]http://ceur-ws.org/Vol-699/Paper5.pdf

On 06/22/2012 06:20 PM, Martynas Jusevičius wrote:

Denny, the statement-level of granularity you're describing is achieved by RDF reification. You describe it however as a "deprecated mechanism" of provenance, without backing it up.

Why do you think there must be a better mechanism? Maybe you should take another look at reification, or lower your provenance requirements, at least initially?

Martynasgraphity.org On Jun 22, 2012 5:20 PM, "Denny Vrandečić" denny.vrandecic@wikimedia.de denny.vrandecic@wikimedia.de wrote:

Here's the use case:

Every statement in Wikidata will have a URI. Every statement can have one more references. In many cases, the reference might be text on a website.

Whereas it is always possible (and probably what we will do first) as well as correct to state:

Statement1 accordingTo SlashDot .

it would be preferable to be a bit more specific on that, and most preferably it would be to go all the way down to the sentence saying

Statement1 accordingTo X .

with X being a URI denoting the sentence that I mean in a specific Slashdot-Article.

I would prefer a standard or widely adopted way to how to do that, and NIF-URIs seem to be a viable solution for that. We will come back to this once we start modeling references in more detail.

The reference could be pointing to a book, to a video, to a mesopotamic stone table, etc. (OK, I admit that the different media types will be differently prioritized).

I hope this helps, Cheers, Denny

2012/6/21 Sebastian Hellmann hellmann@informatik.uni-leipzig.de hellmann@informatik.uni-leipzig.de:

Hello Denny, I was traveling for the past few weeks and can finally answer your email. See my comments inline.

On 05/29/2012 05:25 PM, Denny VrandeÄ iÄ‡ wrote:

Hello Sebastian,

Just a few questions - as you note, it is easier if we all use the same standards, and so I want to ask about the relation to other related standards:

I understand that you dismiss IETF RFC 5147 because it is not stable

enough, right?

The offset scheme of NIF is built on this RFC. So the following would hold: @prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 .

We might change the syntax and reuse the RFC syntax, but it has several issues:

The optional part is not easy to handle, because you would need to

add

owl:sameAs statements:

ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12;length=12 . ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12 . ld:char=717,12;UTF-8 owl:sameAs ld:char=717,12;length=9876 .

So theoretically ok, but annoying to implement and check.

When implementing web services, NIF allows the client to choose the

prefix:

http://nlp2rdf.lod2.eu/demo/NIFStemmer?input-type=text&nif=true&pref... .

returning URIs like http://this.is/a/slash/prefix/offset_10_15 http://this.is/a/slash/prefix/offset_10_15 So RFC 5147 would look like:http://this.is/a/slash/prefix/char=717,12 http://this.is/a/slash/prefix/char=717,12 http://this.is/a/slash/prefix/char=717,12;UTF-8 http://this.is/a/slash/prefix/char=717,12;UTF-8 orhttp://this.is/a/slash/prefix?char=717,12 http://this.is/a/slash/prefix?char=717,12 http://this.is/a/slash/prefix?char=717,12;UTF-8 http://this.is/a/slash/prefix?char=717,12;UTF-8

Character like = , prevent the use of prefixes:

echo "@prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 . " > test.ttl ; rapper -i turtle test.ttl

implementation is a little bit more difficult, given that :

$arr = split("_", "offset_717_729") ; switch ($arr[0]){ case 'offset' : $begin = $arr[1]; $end = $arr[2]; break; case 'hash' : $clength = $arr[1]; $slength = $arr[2]; $hash = $arr[3]; $rest = /*merge remaining with '_' */ break; }

RFC assumes a certain mime type, i.e. plain text. NIF does have a

broader

assumption.

what is the relation to the W3C media fragment URIs? Did not find a

pointer there.

They are designed for media such as images, video, not strings. Potentially, the same principle can be applied, but it is not yet engineered/researched.

any plans of standardizing your approach?

We will do NIF 2.0 as a community standard and finish it in a couple of months. It will be published under open licences, so anybody W3C or ISO might pick it up, easily. Other than that there are plans by several EU projects (see e.g. here

http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jun/0101.h... )

and a US project to use it and there are several third party implementations, already. We would rather have it adopted first on a

large

scale and then standardized, properly, i.e. W3C. This worked quite well

for

the FOAF project or for RDB2RDF Mappers. Chances for fast standardization are not so unlikely, I would assume.

We would strongly prefer to just use a standard instead of advocating contenders for one -- if one exists.

You might want to look at:http://www.w3.org/community/openannotation/wiki/TextCommentOnWebPage and the same highlighting here:

http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

NIF equivalent (4 triples instad of 14 and only one generated uuid): ld:hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%20Web a

str:String ;
 oa:hasBody [
    oa:annotator <mailto:Bob> <Bob> ;
    cnt:chars "Hey Tim, good idea that Semantic Web!" .
]
So you might not think in a "contender" way. Approaches are

complementary.

NIF is simpler and the URIs have some features that might be wanted (stability, uniqueness, easy to implement). This is why I was asking for your *use case* .

Note that: there are still some problems, when annotating DOM with URIs, e.g. xPointer is abandoned and was never finished. Xpath has its limits

and

is also expensive (i.e. SAX not possible). I think there is no proper solution as of now. All the best, Sebastian

Cheers, Denny

2012/5/18 Sebastian Hellmann hellmann@informatik.uni-leipzig.de hellmann@informatik.uni-leipzig.de

Hello again, maybe the question, I asked was lost, as the text was TL;DR

I heard that, it is planned to track provenance of facts. e.g. Berlin has 3,337,000 citizens found here:http://www.worldatlas.com/**citypops.htm<

http://www.worldatlas.com/citypops.htm%3E

Do you have a place where the use case and the requirements are

documented

for this? Or is it out of scope? Will it be course grained, i.e. website level ? Or fine grained, i.e.

text

paragraph level? See e.g. how Berlin is highlighted here:http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%**2Fwww.worldatlas.com%**2Fcitypops.htm%23hash_4_30_**

7449e732716c8e68842289bf2e6667**d5_Berlin%2C%2520Germany%2520-**%25203%2C<http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

in this very early prototype.

Could you give me a link were I can read more about any Wikidata plans towards this direction? Sebastian

On 05/16/2012 09:10 AM, Sebastian Hellmann wrote:

Dear all, (Note: I could not find the document, where your requirements regarding the tracking of facts on the web are written, so I am giving a general introduction to NIF. Please send me a link to the document that specifies your need for tracing facts on the web, thanks)

I would like to point your attention to the URIs used in the NLP Interchange Format (NIF). NIF-URIs are quite easy to use, understand and implement. NIF has a one-triple-per-annotation paradigm. The latest documentation can be

found

here:http://svn.aksw.org/papers/**2012/WWW_NIF/public/string_**ontology.pdf<

http://svn.aksw.org/papers/2012/WWW_NIF/public/string_ontology.pdf%3E

The basic idea is to use URIs with hash fragment ids to annotate or mark pages on the web: An example is the first occurrence of "Semantic Web" onhttp://www.w3.org/**DesignIssues/LinkedData.html<

http://www.w3.org/DesignIssues/LinkedData.html%3E

as highlighted here:http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%**2Fwww.w3.org%2FDesignIssues%**2FLinkedData.html%23hash_10_**12_** 60f02d3b96c55e137e13494cf9a02d**06_Semantic%2520Web<

http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

Here is a NIF example for linking a part of the document to the DBpedia entry of the Semantic Web: <http://www.w3.org/**DesignIssues/LinkedData.html#**offset_717_729<

http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729%3E
 a str:StringInContext ;
 sso:oen
<http://dbpedia.org/resource/**Semantic_Web<

http://dbpedia.org/resource/Semantic_Web%3E%3E

.

We are currently preparing a new draft for the spec 2.0. The old one can be found here:http://nlp2rdf.org/nif-1-0/

There are several EU projects that intend to use NIF. Furthermore, it is easier for everybody, if we standardize a Web annotation format together. Please give feedback of your use cases. All the best, Sebastian

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://dbpedia.org Homepage:http://bis.informatik.uni-**leipzig.de/SebastianHellmann<

http://bis.informatik.uni-leipzig.de/SebastianHellmann%3E

Research Group: http://aksw.org

______________________________**_________________ Wikidata-l mailing listWikidata-l@lists.wikimedia.orghttps://lists.wikimedia.org/**mailman/listinfo/wikidata-l<

https://lists.wikimedia.org/mailman/listinfo/wikidata-l%3E

Wikidata-l mailing listWikidata-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Wikidata-l mailing listWikidata-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata-l

Wikidata-l mailing listWikidata-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

Sebastian Hellmann

12:40 a.m.

Dear Martynas, as far as I understand it, Wikidata will not need to worry about named graphs or alike. IIRC Wikidata is building a fast software to edit facts and generate infoboxes. You do not need the full expressive power of SPARQL or graph querying. That is a different use case and can be done by exporting the data and loading it into a triple store/graph database. I would assume that the most efficient operation is to retrieve all data for one "entity"/entry/page? So the database needs to be optimized for lookup/update, not graph querying.

In another mail you said that:

...

Regarding scalability -- I can only see those possible cases: either Wikidata will not have any query language, or it's query language will be SQL with never ending JOINs too complicated to be useful, or it's gonna be another query language translated to SQL -- for example SPARQL, which is doable but attempts have shown it doesn't scale. A native RDF store is much more performant.

Do you have a reference for this? I always thought it was exactly the opposite, i.e. SPARQL2SQL mappers performing better than native stores.

Cheers, Sebastian

On 06/22/2012 08:43 PM, Martynas Jusevičius wrote:

...

It says "deprecated" on the Data model wiki.

So maybe Wikidata doesn't need statement-level granularity? Maybe the named graph approach is good enough? But it's not based on statements.

If you build this kind of data model on the relational, not to mention provenance, you will not be able to provide a reasonable query mechanism. That's the reason why the development of Jena's SDB store is pretty much abandoned.

Martynas On Jun 22, 2012 8:18 PM, "Sebastian Hellmann" < hellmann@informatik.uni-leipzig.de> wrote:

...
Denny didn't even use the word "deprecated". Reification for statement-level provenance works, but you won't be able to sell it as an elegant solution to the problem. So "could" - yes , "should" - ?? - probably not

If Wikidata is using statement-level provenance, there might be better ways to serialize it in RDF than reification in the future e.g. NQuads: http://sw.deri.org/2008/07/n-quads/ or JSON ;)

For internal use I would discourage reification. If using a relational scheme, a statement id, which can be joined with another SQL table for provenance is the best way to do it imho.

Before you are driving us all mad with explaining why reifiction is bad, I would really like you to justify why WikiData should consider reification. I really do not know many use case (if any) where reification is the right choice of modelling. Before going into the discussion any further [1], I think you should name an example where reification is really better than other options.

All the best, Sebastian

[1]http://ceur-ws.org/Vol-699/Paper5.pdf

On 06/22/2012 06:20 PM, Martynas Jusevičius wrote:

Denny, the statement-level of granularity you're describing is achieved by RDF reification. You describe it however as a "deprecated mechanism" of provenance, without backing it up.

Why do you think there must be a better mechanism? Maybe you should take another look at reification, or lower your provenance requirements, at least initially?

Martynasgraphity.org On Jun 22, 2012 5:20 PM, "Denny Vrandečić" denny.vrandecic@wikimedia.de denny.vrandecic@wikimedia.de wrote:

Here's the use case:

Every statement in Wikidata will have a URI. Every statement can have one more references. In many cases, the reference might be text on a website.

Whereas it is always possible (and probably what we will do first) as well as correct to state:

Statement1 accordingTo SlashDot .

it would be preferable to be a bit more specific on that, and most preferably it would be to go all the way down to the sentence saying

Statement1 accordingTo X .

with X being a URI denoting the sentence that I mean in a specific Slashdot-Article.

I would prefer a standard or widely adopted way to how to do that, and NIF-URIs seem to be a viable solution for that. We will come back to this once we start modeling references in more detail.

The reference could be pointing to a book, to a video, to a mesopotamic stone table, etc. (OK, I admit that the different media types will be differently prioritized).

I hope this helps, Cheers, Denny

2012/6/21 Sebastian Hellmann hellmann@informatik.uni-leipzig.de hellmann@informatik.uni-leipzig.de:

Hello Denny, I was traveling for the past few weeks and can finally answer your email. See my comments inline.

On 05/29/2012 05:25 PM, Denny VrandeÄ iÄ‡ wrote:

Hello Sebastian,

Just a few questions - as you note, it is easier if we all use the same standards, and so I want to ask about the relation to other related standards:

I understand that you dismiss IETF RFC 5147 because it is not stable

enough, right?

The offset scheme of NIF is built on this RFC. So the following would hold: @prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 .

We might change the syntax and reuse the RFC syntax, but it has several issues:

The optional part is not easy to handle, because you would need to

add

owl:sameAs statements:

ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12;length=12 . ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12 . ld:char=717,12;UTF-8 owl:sameAs ld:char=717,12;length=9876 .

So theoretically ok, but annoying to implement and check.

When implementing web services, NIF allows the client to choose the

prefix:

http://nlp2rdf.lod2.eu/demo/NIFStemmer?input-type=text&nif=true&pref... .

returning URIs like http://this.is/a/slash/prefix/offset_10_15 http://this.is/a/slash/prefix/offset_10_15 So RFC 5147 would look like:http://this.is/a/slash/prefix/char=717,12 http://this.is/a/slash/prefix/char=717,12 http://this.is/a/slash/prefix/char=717,12;UTF-8 http://this.is/a/slash/prefix/char=717,12;UTF-8 orhttp://this.is/a/slash/prefix?char=717,12 http://this.is/a/slash/prefix?char=717,12 http://this.is/a/slash/prefix?char=717,12;UTF-8 http://this.is/a/slash/prefix?char=717,12;UTF-8

Character like = , prevent the use of prefixes:

echo "@prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 . " > test.ttl ; rapper -i turtle test.ttl

implementation is a little bit more difficult, given that :

$arr = split("_", "offset_717_729") ; switch ($arr[0]){ case 'offset' : $begin = $arr[1]; $end = $arr[2]; break; case 'hash' : $clength = $arr[1]; $slength = $arr[2]; $hash = $arr[3]; $rest = /*merge remaining with '_' */ break; }

RFC assumes a certain mime type, i.e. plain text. NIF does have a

broader

assumption.

what is the relation to the W3C media fragment URIs? Did not find a

pointer there.

They are designed for media such as images, video, not strings. Potentially, the same principle can be applied, but it is not yet engineered/researched.

any plans of standardizing your approach?

We will do NIF 2.0 as a community standard and finish it in a couple of months. It will be published under open licences, so anybody W3C or ISO might pick it up, easily. Other than that there are plans by several EU projects (see e.g. here

http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jun/0101.h... )

and a US project to use it and there are several third party implementations, already. We would rather have it adopted first on a

large

scale and then standardized, properly, i.e. W3C. This worked quite well

for

the FOAF project or for RDB2RDF Mappers. Chances for fast standardization are not so unlikely, I would assume.

We would strongly prefer to just use a standard instead of advocating contenders for one -- if one exists.

You might want to look at:http://www.w3.org/community/openannotation/wiki/TextCommentOnWebPage and the same highlighting here:

http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

NIF equivalent (4 triples instad of 14 and only one generated uuid): ld:hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%20Web a

str:String ;
  oa:hasBody [
     oa:annotator <mailto:Bob> <Bob> ;
     cnt:chars "Hey Tim, good idea that Semantic Web!" .
 ]
So you might not think in a "contender" way. Approaches are

complementary.

NIF is simpler and the URIs have some features that might be wanted (stability, uniqueness, easy to implement). This is why I was asking for your *use case* .

Note that: there are still some problems, when annotating DOM with URIs, e.g. xPointer is abandoned and was never finished. Xpath has its limits

and

is also expensive (i.e. SAX not possible). I think there is no proper solution as of now. All the best, Sebastian

Cheers, Denny

2012/5/18 Sebastian Hellmann hellmann@informatik.uni-leipzig.de hellmann@informatik.uni-leipzig.de

Hello again, maybe the question, I asked was lost, as the text was TL;DR

I heard that, it is planned to track provenance of facts. e.g. Berlin has 3,337,000 citizens found here:http://www.worldatlas.com/**citypops.htm<

http://www.worldatlas.com/citypops.htm%3E

Do you have a place where the use case and the requirements are

documented

for this? Or is it out of scope? Will it be course grained, i.e. website level ? Or fine grained, i.e.

text

paragraph level? See e.g. how Berlin is highlighted here:http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%**2Fwww.worldatlas.com%**2Fcitypops.htm%23hash_4_30_**

7449e732716c8e68842289bf2e6667**d5_Berlin%2C%2520Germany%2520-**%25203%2C<http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

in this very early prototype.

Could you give me a link were I can read more about any Wikidata plans towards this direction? Sebastian

On 05/16/2012 09:10 AM, Sebastian Hellmann wrote:

Dear all, (Note: I could not find the document, where your requirements regarding the tracking of facts on the web are written, so I am giving a general introduction to NIF. Please send me a link to the document that specifies your need for tracing facts on the web, thanks)

I would like to point your attention to the URIs used in the NLP Interchange Format (NIF). NIF-URIs are quite easy to use, understand and implement. NIF has a one-triple-per-annotation paradigm. The latest documentation can be

found

here:http://svn.aksw.org/papers/**2012/WWW_NIF/public/string_**ontology.pdf<

http://svn.aksw.org/papers/2012/WWW_NIF/public/string_ontology.pdf%3E

The basic idea is to use URIs with hash fragment ids to annotate or mark pages on the web: An example is the first occurrence of "Semantic Web" onhttp://www.w3.org/**DesignIssues/LinkedData.html<

http://www.w3.org/DesignIssues/LinkedData.html%3E

as highlighted here:http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%**2Fwww.w3.org%2FDesignIssues%**2FLinkedData.html%23hash_10_**12_** 60f02d3b96c55e137e13494cf9a02d**06_Semantic%2520Web<

http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

Here is a NIF example for linking a part of the document to the DBpedia entry of the Semantic Web: <http://www.w3.org/**DesignIssues/LinkedData.html#**offset_717_729<

http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729%3E
  a str:StringInContext ;
  sso:oen
<http://dbpedia.org/resource/**Semantic_Web<

http://dbpedia.org/resource/Semantic_Web%3E%3E

.

We are currently preparing a new draft for the spec 2.0. The old one can be found here:http://nlp2rdf.org/nif-1-0/

There are several EU projects that intend to use NIF. Furthermore, it is easier for everybody, if we standardize a Web annotation format together. Please give feedback of your use cases. All the best, Sebastian

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://dbpedia.org Homepage:http://bis.informatik.uni-**leipzig.de/SebastianHellmann<

http://bis.informatik.uni-leipzig.de/SebastianHellmann%3E

Research Group: http://aksw.org

______________________________**_________________ Wikidata-l mailing listWikidata-l@lists.wikimedia.orghttps://lists.wikimedia.org/**mailman/listinfo/wikidata-l<

https://lists.wikimedia.org/mailman/listinfo/wikidata-l%3E

Wikidata-l mailing listWikidata-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Wikidata-l mailing listWikidata-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata-l

Wikidata-l mailing listWikidata-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

Martynas Jusevičius

4:58 a.m.

"You do not need the full expressive power of SPARQL or graph querying" -- what kind of query mechanism is Wikidata planning to support in later stages? I don't suppose the data model will be redesigned for that? So in that case you have to have queries in mind from the start of its design.

Regarding scalability again:

"Long-term though it seems likely that native triplestores will have the advantage for performance. A difficulty with implementing triplestores over SQL is that although triples may thus be stored, implementing efficient querying of a graph-based RDF model (i.e. mapping from SPARQL) onto SQL queries is difficult." http://en.wikipedia.org/wiki/Triplestore#Implementation

"The above results indicate a superior performance of native stores like Sesame native, Mulgara and Virtuoso. This is in coherence with the current emphasis on development of native stores since their performance can be optimized for RDF." http://www.bioontology.org/wiki/images/6/6a/Triple_Stores.pdf

On Jun 22, 2012 9:10 PM, "Sebastian Hellmann" hellmann@informatik.uni-leipzig.de wrote:

...

Dear Martynas, as far as I understand it, Wikidata will not need to worry about named graphs or alike. IIRC Wikidata is building a fast software to edit facts and generate infoboxes. You do not need the full expressive power of SPARQL or graph querying. That is a different use case and can be done by exporting the data and loading it into a triple store/graph database. I would assume that the most efficient operation is to retrieve all data for one "entity"/entry/page? So the database needs to be optimized for lookup/update, not graph querying.

In another mail you said that:

...
Regarding scalability -- I can only see those possible cases: either Wikidata will not have any query language, or it's query language will be SQL with never ending JOINs too complicated to be useful, or it's gonna be another query language translated to SQL -- for example SPARQL, which is doable but attempts have shown it doesn't scale. A native RDF store is much more performant.

Do you have a reference for this? I always thought it was exactly the opposite, i.e. SPARQL2SQL mappers performing better than native stores.

Cheers, Sebastian

On 06/22/2012 08:43 PM, Martynas Jusevičius wrote:

...
It says "deprecated" on the Data model wiki.

So maybe Wikidata doesn't need statement-level granularity? Maybe the named graph approach is good enough? But it's not based on statements.

If you build this kind of data model on the relational, not to mention provenance, you will not be able to provide a reasonable query mechanism. That's the reason why the development of Jena's SDB store is pretty much abandoned.

Martynas On Jun 22, 2012 8:18 PM, "Sebastian Hellmann" < hellmann@informatik.uni-leipzig.de> wrote:

...
Denny didn't even use the word "deprecated". Reification for statement-level provenance works, but you won't be able to sell it as an elegant solution to the problem. So "could" - yes , "should" - ?? - probably not

If Wikidata is using statement-level provenance, there might be better ways to serialize it in RDF than reification in the future e.g. NQuads: http://sw.deri.org/2008/07/n-quads/ or JSON ;)

For internal use I would discourage reification. If using a relational scheme, a statement id, which can be joined with another SQL table for provenance is the best way to do it imho.

Before you are driving us all mad with explaining why reifiction is bad, I would really like you to justify why WikiData should consider reification. I really do not know many use case (if any) where reification is the right choice of modelling. Before going into the discussion any further [1], I think you should name an example where reification is really better than other options.

All the best, Sebastian

[1]http://ceur-ws.org/Vol-699/Paper5.pdf

On 06/22/2012 06:20 PM, Martynas Jusevičius wrote:

Denny, the statement-level of granularity you're describing is achieved by RDF reification. You describe it however as a "deprecated mechanism" of provenance, without backing it up.

Why do you think there must be a better mechanism? Maybe you should take another look at reification, or lower your provenance requirements, at least initially?

Martynasgraphity.org On Jun 22, 2012 5:20 PM, "Denny Vrandečić" denny.vrandecic@wikimedia.de denny.vrandecic@wikimedia.de wrote:

Here's the use case:

Every statement in Wikidata will have a URI. Every statement can have one more references. In many cases, the reference might be text on a website.

Whereas it is always possible (and probably what we will do first) as well as correct to state:

Statement1 accordingTo SlashDot .

it would be preferable to be a bit more specific on that, and most preferably it would be to go all the way down to the sentence saying

Statement1 accordingTo X .

with X being a URI denoting the sentence that I mean in a specific Slashdot-Article.

I would prefer a standard or widely adopted way to how to do that, and NIF-URIs seem to be a viable solution for that. We will come back to this once we start modeling references in more detail.

The reference could be pointing to a book, to a video, to a mesopotamic stone table, etc. (OK, I admit that the different media types will be differently prioritized).

I hope this helps, Cheers, Denny

2012/6/21 Sebastian Hellmann hellmann@informatik.uni-leipzig.de hellmann@informatik.uni-leipzig.de:

Hello Denny, I was traveling for the past few weeks and can finally answer your email. See my comments inline.

On 05/29/2012 05:25 PM, Denny VrandeÄ iÄ‡ wrote:

Hello Sebastian,

Just a few questions - as you note, it is easier if we all use the same standards, and so I want to ask about the relation to other related standards:

I understand that you dismiss IETF RFC 5147 because it is not stable

enough, right?

The offset scheme of NIF is built on this RFC. So the following would hold: @prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 .

We might change the syntax and reuse the RFC syntax, but it has several issues:

The optional part is not easy to handle, because you would need to

add

owl:sameAs statements:

ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12;length=12 . ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12 . ld:char=717,12;UTF-8 owl:sameAs ld:char=717,12;length=9876 .

So theoretically ok, but annoying to implement and check.

When implementing web services, NIF allows the client to choose the

prefix:

http://nlp2rdf.lod2.eu/demo/NIFStemmer?input-type=text&nif=true&pref... .

returning URIs like http://this.is/a/slash/prefix/offset_10_15 http://this.is/a/slash/prefix/offset_10_15 So RFC 5147 would look like:http://this.is/a/slash/prefix/char=717,12 http://this.is/a/slash/prefix/char=717,12 http://this.is/a/slash/prefix/char=717,12;UTF-8 http://this.is/a/slash/prefix/char=717,12;UTF-8 orhttp://this.is/a/slash/prefix?char=717,12 http://this.is/a/slash/prefix?char=717,12 http://this.is/a/slash/prefix?char=717,12;UTF-8 http://this.is/a/slash/prefix?char=717,12;UTF-8

Character like = , prevent the use of prefixes:

echo "@prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 . " > test.ttl ; rapper -i turtle test.ttl

implementation is a little bit more difficult, given that :

$arr = split("_", "offset_717_729") ; switch ($arr[0]){ case 'offset' : $begin = $arr[1]; $end = $arr[2]; break; case 'hash' : $clength = $arr[1]; $slength = $arr[2]; $hash = $arr[3]; $rest = /*merge remaining with '_' */ break; }

RFC assumes a certain mime type, i.e. plain text. NIF does have a

broader

assumption.

what is the relation to the W3C media fragment URIs? Did not find a

pointer there.

They are designed for media such as images, video, not strings. Potentially, the same principle can be applied, but it is not yet engineered/researched.

any plans of standardizing your approach?

We will do NIF 2.0 as a community standard and finish it in a couple of months. It will be published under open licences, so anybody W3C or ISO might pick it up, easily. Other than that there are plans by several EU projects (see e.g. here

http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jun/0101.h... )

and a US project to use it and there are several third party implementations, already. We would rather have it adopted first on a

large

scale and then standardized, properly, i.e. W3C. This worked quite well

for

the FOAF project or for RDB2RDF Mappers. Chances for fast standardization are not so unlikely, I would assume.

We would strongly prefer to just use a standard instead of advocating contenders for one -- if one exists.

You might want to look at:http://www.w3.org/community/openannotation/wiki/TextCommentOnWebPage and the same highlighting here:

http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

NIF equivalent (4 triples instad of 14 and only one generated uuid): ld:hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%20Web a

str:String ;

oa:hasBody [ oa:annotator mailto:Bob <Bob> ; cnt:chars "Hey Tim, good idea that Semantic Web!" . ]

So you might not think in a "contender" way. Approaches are

complementary.

NIF is simpler and the URIs have some features that might be wanted (stability, uniqueness, easy to implement). This is why I was asking for your *use case* .

Note that: there are still some problems, when annotating DOM with URIs, e.g. xPointer is abandoned and was never finished. Xpath has its limits

and

is also expensive (i.e. SAX not possible). I think there is no proper solution as of now. All the best, Sebastian

Cheers, Denny

2012/5/18 Sebastian Hellmann hellmann@informatik.uni-leipzig.de hellmann@informatik.uni-leipzig.de

Hello again, maybe the question, I asked was lost, as the text was TL;DR

I heard that, it is planned to track provenance of facts. e.g. Berlin has 3,337,000 citizens found here:http://www.worldatlas.com/**citypops.htm<

http://www.worldatlas.com/citypops.htm%3E

Do you have a place where the use case and the requirements are

documented

for this? Or is it out of scope? Will it be course grained, i.e. website level ? Or fine grained, i.e.

text

paragraph level? See e.g. how Berlin is highlighted here:http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%**2Fwww.worldatlas.com%**2Fcitypops.htm%23hash_4_30_**

7449e732716c8e68842289bf2e6667**d5_Berlin%2C%2520Germany%2520-**%25203%2C<http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

in this very early prototype.

Could you give me a link were I can read more about any Wikidata plans towards this direction? Sebastian

On 05/16/2012 09:10 AM, Sebastian Hellmann wrote:

Dear all, (Note: I could not find the document, where your requirements regarding the tracking of facts on the web are written, so I am giving a general introduction to NIF. Please send me a link to the document that specifies your need for tracing facts on the web, thanks)

I would like to point your attention to the URIs used in the NLP Interchange Format (NIF). NIF-URIs are quite easy to use, understand and implement. NIF has a one-triple-per-annotation paradigm. The latest documentation can be

found

here:http://svn.aksw.org/papers/**2012/WWW_NIF/public/string_**ontology.pdf<

http://svn.aksw.org/papers/2012/WWW_NIF/public/string_ontology.pdf%3E

The basic idea is to use URIs with hash fragment ids to annotate or mark pages on the web: An example is the first occurrence of "Semantic Web" onhttp://www.w3.org/**DesignIssues/LinkedData.html<

http://www.w3.org/DesignIssues/LinkedData.html%3E

as highlighted here:http://pcai042.informatik.uni-**leipzig.de/~swp12-9/** vorprojekt/index.php?**annotation_request=http%3A%2F%**2Fwww.w3.org%2FDesignIssues%**2FLinkedData.html%23hash_10_**12_** 60f02d3b96c55e137e13494cf9a02d**06_Semantic%2520Web<

http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annot...

Here is a NIF example for linking a part of the document to the DBpedia entry of the Semantic Web: <http://www.w3.org/**DesignIssues/LinkedData.html#**offset_717_729<

http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729%3E

a str:StringInContext ; sso:oen <http://dbpedia.org/resource/**Semantic_Web<

http://dbpedia.org/resource/Semantic_Web%3E%3E

.

We are currently preparing a new draft for the spec 2.0. The old one can be found here:http://nlp2rdf.org/nif-1-0/

There are several EU projects that intend to use NIF. Furthermore, it is easier for everybody, if we standardize a Web annotation format together. Please give feedback of your use cases. All the best, Sebastian

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://dbpedia.org Homepage:http://bis.informatik.uni-**leipzig.de/SebastianHellmann<

http://bis.informatik.uni-leipzig.de/SebastianHellmann%3E

Research Group: http://aksw.org

______________________________**_________________ Wikidata-l mailing listWikidata-l@lists.wikimedia.orghttps://lists.wikimedia.org/**mailman/listinfo/wikidata-l<

https://lists.wikimedia.org/mailman/listinfo/wikidata-l%3E

Wikidata-l mailing listWikidata-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Wikidata-l mailing listWikidata-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata-l

Wikidata-l mailing listWikidata-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

Dan Brickley

1:40 p.m.

On 22 Jun 2012, at 17:20, Denny Vrandečić denny.vrandecic@wikimedia.de wrote:

...

Here's the use case:

Every statement in Wikidata will have a URI. Every statement can have one more references. In many cases, the reference might be text on a website.

As an aside, a growing number of such pages may come with some basic machine-readable data. For example IMDB actor pages may expose basic background facts, or e-govt sites may publish data geo/demographic/etc data.

I hope schema.org (augmented with Wikidata-derrived vocab) will help encourage this. I'm not sure Wikidata's provenance machinery needs to worry about such things, although the lurking problem of cycles in the provenance/source graph may eventually be an issue here. For example, if some BBC music site is built from -say- MusicBrainz + Wikipedia data, should their embedded rdfa expose this sourcing so that someone citing it in support of a Wikidata factoid can be made aware of the circularity?

Dan

4566

Age (days ago)

4604

Last active (days ago)

wikidata@lists.wikimedia.org

13 comments

5 participants

tags (0)

participants (5)

Barry Norton
Dan Brickley
Denny Vrandečić
Martynas Jusevičius
Sebastian Hellmann