Hi all,
We are now offering regular RDF dumps for the content of Wikidata:
http://tools.wmflabs.org/wikidata-exports/rdf/
RDF is the Resource Description Framework of the W3C that can be used to exchange data on the Web. The Wikidata RDF exports consist of several files that contain different parts and views of the data, and which can be used independently. Details on the available exports and the RDF encoding used in each can be found in the paper "Introducing Wikidata to the Linked Data Web" [1].
The available RDF exports can be found in the directory http://tools.wmflabs.org/wikidata-exports/rdf/exports/. New exports are generated regularly from current data dumps of Wikidata and will appear in this directory shortly afterwards.
All dump files have been generated using Wikidata Toolkit [2]. There are some important differences in comparison to earlier dumps:
* Data is split into several dump files for convenience. Pick whatever you are most interested in. * All dumps are generated using the OpenRDF library for Java (better quality than ad hoc serialization; much slower too ;-) * All dumps are in N3 format, the simplest RDF serialization format that there is * In addition to the faithful dumps, some simplified dumps are also available (one statement = one triple; no qualifiers and references). * Links to external data sets are added to the data for Wikidata properties that point to datasets with RDF exports. That's the "Linked" in "Linked Open Data".
Suggestions for improvements and contributions on github are welcome.
Cheers,
Markus
[1] http://korrekt.org/page/Introducing_Wikidata_to_the_Linked_Data_Web [2] https://www.mediawiki.org/wiki/Wikidata_Toolkit
Hoi, It is stated that there are no qualifiers included. In one of the articles you write that it is to be understood that the vailidity of the information is dependent on the existing qualifiers.
What is the value of these RDF exports with the qualifiers missing? Thanks, GerardM
On 10 June 2014 10:43, Markus Kroetzsch markus.kroetzsch@tu-dresden.de wrote:
Hi all,
We are now offering regular RDF dumps for the content of Wikidata:
http://tools.wmflabs.org/wikidata-exports/rdf/
RDF is the Resource Description Framework of the W3C that can be used to exchange data on the Web. The Wikidata RDF exports consist of several files that contain different parts and views of the data, and which can be used independently. Details on the available exports and the RDF encoding used in each can be found in the paper "Introducing Wikidata to the Linked Data Web" [1].
The available RDF exports can be found in the directory http://tools.wmflabs.org/wikidata-exports/rdf/exports/. New exports are generated regularly from current data dumps of Wikidata and will appear in this directory shortly afterwards.
All dump files have been generated using Wikidata Toolkit [2]. There are some important differences in comparison to earlier dumps:
- Data is split into several dump files for convenience. Pick whatever you
are most interested in.
- All dumps are generated using the OpenRDF library for Java (better
quality than ad hoc serialization; much slower too ;-)
- All dumps are in N3 format, the simplest RDF serialization format that
there is
- In addition to the faithful dumps, some simplified dumps are also
available (one statement = one triple; no qualifiers and references).
- Links to external data sets are added to the data for Wikidata
properties that point to datasets with RDF exports. That's the "Linked" in "Linked Open Data".
Suggestions for improvements and contributions on github are welcome.
Cheers,
Markus
[1] http://korrekt.org/page/Introducing_Wikidata_to_the_Linked_Data_Web [2] https://www.mediawiki.org/wiki/Wikidata_Toolkit
-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
On 10/06/14 22:50, Gerard Meijssen wrote:
Hoi, It is stated that there are no qualifiers included. In one of the articles you write that it is to be understood that the vailidity of the information is dependent on the existing qualifiers.
What is the value of these RDF exports with the qualifiers missing?
Our normal exports include all the qualifiers and references.
Our simplified exports include only those statements that don't have qualifiers. You are right that it would lead to wrong information to leave away quantifiers.
Cheers,
Markus
Thanks, GerardM
On 10 June 2014 10:43, Markus Kroetzsch <markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de> wrote:
Hi all, We are now offering regular RDF dumps for the content of Wikidata: http://tools.wmflabs.org/__wikidata-exports/rdf/ <http://tools.wmflabs.org/wikidata-exports/rdf/> RDF is the Resource Description Framework of the W3C that can be used to exchange data on the Web. The Wikidata RDF exports consist of several files that contain different parts and views of the data, and which can be used independently. Details on the available exports and the RDF encoding used in each can be found in the paper "Introducing Wikidata to the Linked Data Web" [1]. The available RDF exports can be found in the directory http://tools.wmflabs.org/__wikidata-exports/rdf/exports/ <http://tools.wmflabs.org/wikidata-exports/rdf/exports/>. New exports are generated regularly from current data dumps of Wikidata and will appear in this directory shortly afterwards. All dump files have been generated using Wikidata Toolkit [2]. There are some important differences in comparison to earlier dumps: * Data is split into several dump files for convenience. Pick whatever you are most interested in. * All dumps are generated using the OpenRDF library for Java (better quality than ad hoc serialization; much slower too ;-) * All dumps are in N3 format, the simplest RDF serialization format that there is * In addition to the faithful dumps, some simplified dumps are also available (one statement = one triple; no qualifiers and references). * Links to external data sets are added to the data for Wikidata properties that point to datasets with RDF exports. That's the "Linked" in "Linked Open Data". Suggestions for improvements and contributions on github are welcome. Cheers, Markus [1] http://korrekt.org/page/__Introducing_Wikidata_to_the___Linked_Data_Web <http://korrekt.org/page/Introducing_Wikidata_to_the_Linked_Data_Web> [2] https://www.mediawiki.org/__wiki/Wikidata_Toolkit <https://www.mediawiki.org/wiki/Wikidata_Toolkit> -- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 <tel:%2B49%20351%20463%2038486> http://korrekt.org/ _________________________________________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org> https://lists.wikimedia.org/__mailman/listinfo/wikidata-l <https://lists.wikimedia.org/mailman/listinfo/wikidata-l>
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Hoi, When you leave out qualifiers, you will find that Ronald Reagan was never president of the United States and only an actor. Yes, omitting the statements with qualifiers is wrong but as a consequence the total of the information is wrong as well.
I do not see the point of this functionality. It is wrong any way I look at it. Without qualifiers information is wrong. Without statements information is wrong and without the items involved the information is incomplete and wrong.
As I see it you cannot win. Including this type of RDF export produces something that I fail to see serves any purpose or it is the purpose that you can.
Thanks, GerardM
On 11 June 2014 12:03, Markus Krötzsch markus@semantic-mediawiki.org wrote:
On 10/06/14 22:50, Gerard Meijssen wrote:
Hoi, It is stated that there are no qualifiers included. In one of the articles you write that it is to be understood that the vailidity of the information is dependent on the existing qualifiers.
What is the value of these RDF exports with the qualifiers missing?
Our normal exports include all the qualifiers and references.
Our simplified exports include only those statements that don't have qualifiers. You are right that it would lead to wrong information to leave away quantifiers.
Cheers,
Markus
Thanks,
GerardM
On 10 June 2014 10:43, Markus Kroetzsch <markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de> wrote:
Hi all, We are now offering regular RDF dumps for the content of Wikidata: http://tools.wmflabs.org/__wikidata-exports/rdf/ <http://tools.wmflabs.org/wikidata-exports/rdf/> RDF is the Resource Description Framework of the W3C that can be used to exchange data on the Web. The Wikidata RDF exports consist of several files that contain different parts and views of the data, and which can be used independently. Details on the available exports and the RDF encoding used in each can be found in the paper "Introducing Wikidata to the Linked Data Web" [1]. The available RDF exports can be found in the directory http://tools.wmflabs.org/__wikidata-exports/rdf/exports/ <http://tools.wmflabs.org/wikidata-exports/rdf/exports/>. New exports are generated regularly from current data dumps of Wikidata and will appear in this directory shortly afterwards. All dump files have been generated using Wikidata Toolkit [2]. There are some important differences in comparison to earlier dumps: * Data is split into several dump files for convenience. Pick whatever you are most interested in. * All dumps are generated using the OpenRDF library for Java (better quality than ad hoc serialization; much slower too ;-) * All dumps are in N3 format, the simplest RDF serialization format that there is * In addition to the faithful dumps, some simplified dumps are also available (one statement = one triple; no qualifiers and references). * Links to external data sets are added to the data for Wikidata properties that point to datasets with RDF exports. That's the "Linked" in "Linked Open Data". Suggestions for improvements and contributions on github are welcome. Cheers, Markus [1] http://korrekt.org/page/__Introducing_Wikidata_to_the___
Linked_Data_Web http://korrekt.org/page/Introducing_Wikidata_to_the_Linked_Data_Web [2] https://www.mediawiki.org/__wiki/Wikidata_Toolkit
<https://www.mediawiki.org/wiki/Wikidata_Toolkit> -- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 <tel:%2B49%20351%20463%2038486> http://korrekt.org/ _________________________________________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/__mailman/listinfo/wikidata-l <https://lists.wikimedia.org/mailman/listinfo/wikidata-l>
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Hi Gerard,
On 13/06/14 11:08, Gerard Meijssen wrote:
Hoi, When you leave out qualifiers, you will find that Ronald Reagan was never president of the United States and only an actor. Yes, omitting the statements with qualifiers is wrong but as a consequence the total of the information is wrong as well.
I do not see the point of this functionality. It is wrong any way I look at it. Without qualifiers information is wrong. Without statements information is wrong and without the items involved the information is incomplete and wrong.
As I see it you cannot win. Including this type of RDF export produces something that I fail to see serves any purpose or it is the purpose that you can.
Surely, Wikidata will never be complete. There will always be some statements missing. If we would follow your reasoning, the data would therefore never be of any use. I think this is a bit drastic.
Anyway, why argue? If you don't like the simplified exports, just use the full ones. We clearly say that "simplified" is not "faithful", and we have a detailed documentation about what is in each of the files. So it does not seem likely that people will be confused.
Best regards,
Markus
Hoi, There is a huge difference between being complete and leaving out essential information. When you consider Ronald Reagan [1], it is essential information that he was a president of the USA and a governor of California. When you only make him an actor and a politician, the information you are left with gives the impression he is more relevant as an actor.
You brought attention to new functionality that is essentially broken. It does not give a fair impression of the Wikidata content. I have been arguing against overly referring to academic tools and standards. For me this announcement is yet another pointer that many of the tools are overrated and only have an "academic relevance. Thanks, GerardM
[1] http://tools.wmflabs.org/reasonator/?&q=9960
On 13 June 2014 11:41, Markus Krötzsch markus@semantic-mediawiki.org wrote:
Hi Gerard,
On 13/06/14 11:08, Gerard Meijssen wrote:
Hoi, When you leave out qualifiers, you will find that Ronald Reagan was never president of the United States and only an actor. Yes, omitting the statements with qualifiers is wrong but as a consequence the total of the information is wrong as well.
I do not see the point of this functionality. It is wrong any way I look at it. Without qualifiers information is wrong. Without statements information is wrong and without the items involved the information is incomplete and wrong.
As I see it you cannot win. Including this type of RDF export produces something that I fail to see serves any purpose or it is the purpose that you can.
Surely, Wikidata will never be complete. There will always be some statements missing. If we would follow your reasoning, the data would therefore never be of any use. I think this is a bit drastic.
Anyway, why argue? If you don't like the simplified exports, just use the full ones. We clearly say that "simplified" is not "faithful", and we have a detailed documentation about what is in each of the files. So it does not seem likely that people will be confused.
Best regards,
Markus
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Never forget that even the full data, with all the qualifiers included, is, in most cases, little more information than what is contained in the lead paragraph of a complete wikipedia article.
Wikidata will be useful but it will never replace the encyclopedia articles and will, I believe, be most useful as a tool for finding those articles.
If current tools can only manage the basic datadump then that can be a starting point. Better tools will come because students have to do something to get those PhDs and this is going to be the data dump they can work on without having to wait years to get permission. :)
Joe
On Fri, Jun 13, 2014 at 11:37 AM, Gerard Meijssen <gerard.meijssen@gmail.com
wrote:
Hoi, There is a huge difference between being complete and leaving out essential information. When you consider Ronald Reagan [1], it is essential information that he was a president of the USA and a governor of California. When you only make him an actor and a politician, the information you are left with gives the impression he is more relevant as an actor.
You brought attention to new functionality that is essentially broken. It does not give a fair impression of the Wikidata content. I have been arguing against overly referring to academic tools and standards. For me this announcement is yet another pointer that many of the tools are overrated and only have an "academic relevance. Thanks, GerardM
[1] http://tools.wmflabs.org/reasonator/?&q=9960
On 13 June 2014 11:41, Markus Krötzsch markus@semantic-mediawiki.org wrote:
Hi Gerard,
On 13/06/14 11:08, Gerard Meijssen wrote:
Hoi, When you leave out qualifiers, you will find that Ronald Reagan was never president of the United States and only an actor. Yes, omitting the statements with qualifiers is wrong but as a consequence the total of the information is wrong as well.
I do not see the point of this functionality. It is wrong any way I look at it. Without qualifiers information is wrong. Without statements information is wrong and without the items involved the information is incomplete and wrong.
As I see it you cannot win. Including this type of RDF export produces something that I fail to see serves any purpose or it is the purpose that you can.
Surely, Wikidata will never be complete. There will always be some statements missing. If we would follow your reasoning, the data would therefore never be of any use. I think this is a bit drastic.
Anyway, why argue? If you don't like the simplified exports, just use the full ones. We clearly say that "simplified" is not "faithful", and we have a detailed documentation about what is in each of the files. So it does not seem likely that people will be confused.
Best regards,
Markus
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Hoi, Joe, plain vanilla Wikidata is not informative. It provides statements in no particular order and it does it in a way where you have to scroll-a-lot to see it all. It takes tools like Reasonator to organise the data so that it becomes informative. With a little code it is possible to provide some narrative about people. This works for English in the Reasonator, a good example is JS Bach [1]. However this script works for any human, add info and you may get a more informative text.
For most humans most Wikipedias do not have an article. As they are considered notable and as there is information available, the available information can be served. This is how we "share in the sum of available knowledge". When you compare a Wikidata item with a Wikipedia article, you will find that for most items there is no article and consequently Wikidata has the edge in its ability to inform. Even for the English Wikipedia there are many people who do not have an article.
I cannot wait for all those students to get cracking and deliver something that is useful.
Markus, you indicate that you do not understand my arguments. You try to refute my argument by referring to WDQ or Wikidata Query. Indeed, initially it did not support qualifiers however the intention for the tool was to do this eventually and, it always included all statements in a result. By comparison the simple RDF export does not export statements with qualifiers and there is no intention to change this in the future.
When you indicate that you disagree, I assume that you understand my arguments.However you indicate you do not understand them so it is appropriate to try and clarify my arguments.
[1] http://tools.wmflabs.org/reasonator/?q=Q1339&lang=en
On 13 June 2014 12:56, Joe Filceolaire filceolaire@gmail.com wrote:
Never forget that even the full data, with all the qualifiers included, is, in most cases, little more information than what is contained in the lead paragraph of a complete wikipedia article.
Wikidata will be useful but it will never replace the encyclopedia articles and will, I believe, be most useful as a tool for finding those articles.
If current tools can only manage the basic datadump then that can be a starting point. Better tools will come because students have to do something to get those PhDs and this is going to be the data dump they can work on without having to wait years to get permission. :)
Joe
On Fri, Jun 13, 2014 at 11:37 AM, Gerard Meijssen < gerard.meijssen@gmail.com> wrote:
Hoi, There is a huge difference between being complete and leaving out essential information. When you consider Ronald Reagan [1], it is essential information that he was a president of the USA and a governor of California. When you only make him an actor and a politician, the information you are left with gives the impression he is more relevant as an actor.
You brought attention to new functionality that is essentially broken. It does not give a fair impression of the Wikidata content. I have been arguing against overly referring to academic tools and standards. For me this announcement is yet another pointer that many of the tools are overrated and only have an "academic relevance. Thanks, GerardM
[1] http://tools.wmflabs.org/reasonator/?&q=9960
On 13 June 2014 11:41, Markus Krötzsch markus@semantic-mediawiki.org wrote:
Hi Gerard,
On 13/06/14 11:08, Gerard Meijssen wrote:
Hoi, When you leave out qualifiers, you will find that Ronald Reagan was never president of the United States and only an actor. Yes, omitting the statements with qualifiers is wrong but as a consequence the total of the information is wrong as well.
I do not see the point of this functionality. It is wrong any way I look at it. Without qualifiers information is wrong. Without statements information is wrong and without the items involved the information is incomplete and wrong.
As I see it you cannot win. Including this type of RDF export produces something that I fail to see serves any purpose or it is the purpose that you can.
Surely, Wikidata will never be complete. There will always be some statements missing. If we would follow your reasoning, the data would therefore never be of any use. I think this is a bit drastic.
Anyway, why argue? If you don't like the simplified exports, just use the full ones. We clearly say that "simplified" is not "faithful", and we have a detailed documentation about what is in each of the files. So it does not seem likely that people will be confused.
Best regards,
Markus
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Gerard,
You sometimes sound as if everything is lost just because somebody put an RDF file on the Web ;-)
If you don't like the simplified export, why don't you just use our main export which contains all the data? Can't we all be happy -- the people who want simple and the people who want complete?
Cheers,
Markus
Hoi, I do not mind RDF. I do not mind OWL. What I do mind is that people assume that everyone assumes that others know what it means and appreciate it as being "good". When people use an RDF tool that produces obviously incomplete and therefore incorrect information, it is beyond me that an implementation of that tool is developed. It is plain stupid.
When you argue that it is fine for people to be stupid, I totally agree. They may be but let them be stupid with something other than Wikidata. Because everything they do, is GIGO: garbage in garbage out. They will make the most weird and wonderful pronouncements and it will all be wrong because it is based on incomplete and incorrect data. It will not always be obvious who is stupid. The worst part is that they do not need to be stupid; it is easily prevented; just do not provide them with the tools that will only serve us wrong.
That is the kind of RDF files that I object to. Thanks, GerardM
On 13 June 2014 19:05, Markus Krötzsch markus@semantic-mediawiki.org wrote:
Gerard,
You sometimes sound as if everything is lost just because somebody put an RDF file on the Web ;-)
If you don't like the simplified export, why don't you just use our main export which contains all the data? Can't we all be happy -- the people who want simple and the people who want complete?
Cheers,
Markus
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Hi Gerard,
As I said, I don't follow your arguments. Wikidata Query, for example, has also started without any qualifiers at all, and yet it was a useful tool from the beginning.
Your feedback is always welcome, but there is a point when critique is no longer constructive, and when it is best to "agree to disagree". I think we have reached that point.
Markus
On 13/06/14 12:37, Gerard Meijssen wrote:
Hoi, There is a huge difference between being complete and leaving out essential information. When you consider Ronald Reagan [1], it is essential information that he was a president of the USA and a governor of California. When you only make him an actor and a politician, the information you are left with gives the impression he is more relevant as an actor.
You brought attention to new functionality that is essentially broken. It does not give a fair impression of the Wikidata content. I have been arguing against overly referring to academic tools and standards. For me this announcement is yet another pointer that many of the tools are overrated and only have an "academic relevance. Thanks, GerardM
[1] http://tools.wmflabs.org/reasonator/?&q=9960
On 13 June 2014 11:41, Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:
Hi Gerard, On 13/06/14 11:08, Gerard Meijssen wrote: Hoi, When you leave out qualifiers, you will find that Ronald Reagan was never president of the United States and only an actor. Yes, omitting the statements with qualifiers is wrong but as a consequence the total of the information is wrong as well. I do not see the point of this functionality. It is wrong any way I look at it. Without qualifiers information is wrong. Without statements information is wrong and without the items involved the information is incomplete and wrong. As I see it you cannot win. Including this type of RDF export produces something that I fail to see serves any purpose or it is the purpose that you can. Surely, Wikidata will never be complete. There will always be some statements missing. If we would follow your reasoning, the data would therefore never be of any use. I think this is a bit drastic. Anyway, why argue? If you don't like the simplified exports, just use the full ones. We clearly say that "simplified" is not "faithful", and we have a detailed documentation about what is in each of the files. So it does not seem likely that people will be confused. Best regards, Markus _________________________________________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org> https://lists.wikimedia.org/__mailman/listinfo/wikidata-l <https://lists.wikimedia.org/mailman/listinfo/wikidata-l>
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
I think it is a reasonable ambition that the 'preferred' statement should always provide accurate information even when the qualifiers are missing.
For example, if we have population figures for various years and 'applies to part' figures for males, females, under 20's etc. then the most recent 'total' population figure should be the preferred value. Even without qualifiers this is a useful answer.
Similarly for Ronald Reagan the fact that he held the office of President is useful even if you don't give the start/end dates or the fact that it was 'of' the USA.
Wherever a statement would be misleading if you leave out the qualifiers then I think that is an indication that we need to have another look at the syntax and see how it can be fixed to comply with this principal.
We should do an RFC on Wikidata to make this policy and then amend the help pages to highlight this.
If this is accepted as wikidata policy then I suggest that a simple data dump should include all statements, even if qualifiers are not included, but that, where a statement has multiple values, some of which are 'preferred' then only the preferred values should be included.
Yes?
Joe
On Fri, Jun 13, 2014 at 10:08 AM, Gerard Meijssen <gerard.meijssen@gmail.com
wrote:
Hoi, When you leave out qualifiers, you will find that Ronald Reagan was never president of the United States and only an actor. Yes, omitting the statements with qualifiers is wrong but as a consequence the total of the information is wrong as well.
I do not see the point of this functionality. It is wrong any way I look at it. Without qualifiers information is wrong. Without statements information is wrong and without the items involved the information is incomplete and wrong.
As I see it you cannot win. Including this type of RDF export produces something that I fail to see serves any purpose or it is the purpose that you can.
Thanks, GerardM
On 11 June 2014 12:03, Markus Krötzsch markus@semantic-mediawiki.org wrote:
On 10/06/14 22:50, Gerard Meijssen wrote:
Hoi, It is stated that there are no qualifiers included. In one of the articles you write that it is to be understood that the vailidity of the information is dependent on the existing qualifiers.
What is the value of these RDF exports with the qualifiers missing?
Our normal exports include all the qualifiers and references.
Our simplified exports include only those statements that don't have qualifiers. You are right that it would lead to wrong information to leave away quantifiers.
Cheers,
Markus
Thanks,
GerardM
On 10 June 2014 10:43, Markus Kroetzsch <markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de> wrote:
Hi all, We are now offering regular RDF dumps for the content of Wikidata: http://tools.wmflabs.org/__wikidata-exports/rdf/ <http://tools.wmflabs.org/wikidata-exports/rdf/> RDF is the Resource Description Framework of the W3C that can be used to exchange data on the Web. The Wikidata RDF exports consist of several files that contain different parts and views of the data, and which can be used independently. Details on the available exports and the RDF encoding used in each can be found in the paper "Introducing Wikidata to the Linked Data Web" [1]. The available RDF exports can be found in the directory http://tools.wmflabs.org/__wikidata-exports/rdf/exports/ <http://tools.wmflabs.org/wikidata-exports/rdf/exports/>. New exports are generated regularly from current data dumps of Wikidata and will appear in this directory shortly afterwards. All dump files have been generated using Wikidata Toolkit [2]. There are some important differences in comparison to earlier dumps: * Data is split into several dump files for convenience. Pick whatever you are most interested in. * All dumps are generated using the OpenRDF library for Java (better quality than ad hoc serialization; much slower too ;-) * All dumps are in N3 format, the simplest RDF serialization format that there is * In addition to the faithful dumps, some simplified dumps are also available (one statement = one triple; no qualifiers and references). * Links to external data sets are added to the data for Wikidata properties that point to datasets with RDF exports. That's the "Linked" in "Linked Open Data". Suggestions for improvements and contributions on github are welcome. Cheers, Markus [1] http://korrekt.org/page/__Introducing_Wikidata_to_the___
Linked_Data_Web <http://korrekt.org/page/Introducing_Wikidata_to_the_Linked_Data_Web
[2] https://www.mediawiki.org/__wiki/Wikidata_Toolkit <https://www.mediawiki.org/wiki/Wikidata_Toolkit> -- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 <tel:%2B49%20351%20463%2038486> http://korrekt.org/ _________________________________________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.
wikimedia.org> https://lists.wikimedia.org/__mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Hoi, Not really. What is being discussed is a tool that is external to Wikidata. Thanks, GerardM
On 13 June 2014 12:37, Joe Filceolaire filceolaire@gmail.com wrote:
I think it is a reasonable ambition that the 'preferred' statement should always provide accurate information even when the qualifiers are missing.
For example, if we have population figures for various years and 'applies to part' figures for males, females, under 20's etc. then the most recent 'total' population figure should be the preferred value. Even without qualifiers this is a useful answer.
Similarly for Ronald Reagan the fact that he held the office of President is useful even if you don't give the start/end dates or the fact that it was 'of' the USA.
Wherever a statement would be misleading if you leave out the qualifiers then I think that is an indication that we need to have another look at the syntax and see how it can be fixed to comply with this principal.
We should do an RFC on Wikidata to make this policy and then amend the help pages to highlight this.
If this is accepted as wikidata policy then I suggest that a simple data dump should include all statements, even if qualifiers are not included, but that, where a statement has multiple values, some of which are 'preferred' then only the preferred values should be included.
Yes?
Joe
On Fri, Jun 13, 2014 at 10:08 AM, Gerard Meijssen < gerard.meijssen@gmail.com> wrote:
Hoi, When you leave out qualifiers, you will find that Ronald Reagan was never president of the United States and only an actor. Yes, omitting the statements with qualifiers is wrong but as a consequence the total of the information is wrong as well.
I do not see the point of this functionality. It is wrong any way I look at it. Without qualifiers information is wrong. Without statements information is wrong and without the items involved the information is incomplete and wrong.
As I see it you cannot win. Including this type of RDF export produces something that I fail to see serves any purpose or it is the purpose that you can.
Thanks, GerardM
On 11 June 2014 12:03, Markus Krötzsch markus@semantic-mediawiki.org wrote:
On 10/06/14 22:50, Gerard Meijssen wrote:
Hoi, It is stated that there are no qualifiers included. In one of the articles you write that it is to be understood that the vailidity of the information is dependent on the existing qualifiers.
What is the value of these RDF exports with the qualifiers missing?
Our normal exports include all the qualifiers and references.
Our simplified exports include only those statements that don't have qualifiers. You are right that it would lead to wrong information to leave away quantifiers.
Cheers,
Markus
Thanks,
GerardM
On 10 June 2014 10:43, Markus Kroetzsch <markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de> wrote:
Hi all, We are now offering regular RDF dumps for the content of Wikidata: http://tools.wmflabs.org/__wikidata-exports/rdf/ <http://tools.wmflabs.org/wikidata-exports/rdf/> RDF is the Resource Description Framework of the W3C that can be used to exchange data on the Web. The Wikidata RDF exports consist of several files that contain different parts and views of the data, and which can be used independently. Details on the available exports and the RDF encoding used in each can be found in the paper "Introducing Wikidata to the Linked Data Web" [1]. The available RDF exports can be found in the directory http://tools.wmflabs.org/__wikidata-exports/rdf/exports/ <http://tools.wmflabs.org/wikidata-exports/rdf/exports/>. New exports are generated regularly from current data dumps of Wikidata and will appear in this directory shortly afterwards. All dump files have been generated using Wikidata Toolkit [2]. There are some important differences in comparison to earlier dumps: * Data is split into several dump files for convenience. Pick whatever you are most interested in. * All dumps are generated using the OpenRDF library for Java (better quality than ad hoc serialization; much slower too ;-) * All dumps are in N3 format, the simplest RDF serialization format that there is * In addition to the faithful dumps, some simplified dumps are also available (one statement = one triple; no qualifiers and
references). * Links to external data sets are added to the data for Wikidata properties that point to datasets with RDF exports. That's the "Linked" in "Linked Open Data".
Suggestions for improvements and contributions on github are
welcome.
Cheers, Markus [1] http://korrekt.org/page/__Introducing_Wikidata_to_the___
Linked_Data_Web http://korrekt.org/page/Introducing_Wikidata_to_the_ Linked_Data_Web [2] https://www.mediawiki.org/__wiki/Wikidata_Toolkit
<https://www.mediawiki.org/wiki/Wikidata_Toolkit> -- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 <tel:%2B49%20351%20463%2038486> http://korrekt.org/ _________________________________________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.
wikimedia.org> https://lists.wikimedia.org/__mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Am 13.06.2014 11:08, schrieb Gerard Meijssen:
Hoi, When you leave out qualifiers, you will find that Ronald Reagan was never president of the United States and only an actor. Yes, omitting the statements with qualifiers is wrong but as a consequence the total of the information is wrong as well.
I do not see the point of this functionality. It is wrong any way I look at it. Without qualifiers information is wrong. Without statements information is wrong and without the items involved the information is incomplete and wrong.
As I see it you cannot win. Including this type of RDF export produces something that I fail to see serves any purpose or it is the purpose that you can.
Thanks, GerardM
On 11 June 2014 12:03, Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:
On 10/06/14 22:50, Gerard Meijssen wrote: Hoi, It is stated that there are no qualifiers included. In one of the articles you write that it is to be understood that the vailidity of the information is dependent on the existing qualifiers. What is the value of these RDF exports with the qualifiers missing? Our normal exports include all the qualifiers and references. Our simplified exports include only those statements that don't have qualifiers. You are right that it would lead to wrong information to leave away quantifiers. Cheers, Markus
Did I understand you right, Markus, that you leave out all statements which have at least one qualifier? Wouldn't it make more sense to leave out the qualifiers only but add the statements without qualifiers anyway? Because this would solve eg. Gerard's problem with Ronald Reagan.
Best regards, Bene
On 13/06/14 15:52, Bene* wrote: ...
Did I understand you right, Markus, that you leave out all statements which have at least one qualifier? Wouldn't it make more sense to leave out the qualifiers only but add the statements without qualifiers anyway? Because this would solve eg. Gerard's problem with Ronald Reagan.
But it would introduce other problems. Qualifiers are often used with time information, for example to record many historic population figures of one town. If you just leave away the qualifiers you get many different population numbers that cannot be distinguished.
Simply put: * Leaving away statements makes the export incomplete (as Wikidata always is, just to a larger degree). * Leaving away qualifiers makes the export incorrect (since it replaces statements by different statements that may or may not hold true).
We could do both and let the users choose what they find more acceptable (if any), but we started with the first approach. If someone says they need the second approach for their application to work, we could implement this, but I'd rather wait to see if anybody wants this.
Best,
Markus
Markus,
Thank you very much for this. Translating Wikidata into the language of the Semantic Web is important. Being able to explore the Wikidata taxonomy [1] by doing SPARQL queries in Protege [2] (even primitive queries) is really neat, e.g.
SELECT ?subject WHERE { ?subject rdfs:subClassOf http://www.wikidata.org/entity/Q82586 . }
This is more of an issue of my ignorance of Protege, but I notice that the above query returns only the direct subclasses of Q82586. The full set of subclasses for Q82586 ("lepton") is visible at http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q82586&rp=279&lan... -- a few of the 2nd-level subclasses (muon neutrino, tau neutrino, electron neutrino) are shown there but not returned by that SPARQL query. It seems rdfs:subClassOf isn't being treated as a transitive property in Protege. Any ideas?
Do you know when the taxonomy data in OWL will have labels available?
Also, regarding the complete dumps, would it be possible to export a smaller subset of the faithful data? The files under "Complete Data Dumps" in http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/ look too big to load into Protege on most personal computers, and would likely require adjusting JVM settings on higher-end computers to load. If it's feasible to somehow prune those files -- and maybe even combine them into one file that could be easily loaded into Protege -- that would be especially nice.
Thanks, Eric https://www.wikidata.org/wiki/User:Emw
1. http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/wikidata-taxo... 2. http://protege.stanford.edu/
On Tue, Jun 10, 2014 at 4:43 AM, Markus Kroetzsch < markus.kroetzsch@tu-dresden.de> wrote:
Hi all,
We are now offering regular RDF dumps for the content of Wikidata:
http://tools.wmflabs.org/wikidata-exports/rdf/
RDF is the Resource Description Framework of the W3C that can be used to exchange data on the Web. The Wikidata RDF exports consist of several files that contain different parts and views of the data, and which can be used independently. Details on the available exports and the RDF encoding used in each can be found in the paper "Introducing Wikidata to the Linked Data Web" [1].
The available RDF exports can be found in the directory http://tools.wmflabs.org/wikidata-exports/rdf/exports/. New exports are generated regularly from current data dumps of Wikidata and will appear in this directory shortly afterwards.
All dump files have been generated using Wikidata Toolkit [2]. There are some important differences in comparison to earlier dumps:
- Data is split into several dump files for convenience. Pick whatever you
are most interested in.
- All dumps are generated using the OpenRDF library for Java (better
quality than ad hoc serialization; much slower too ;-)
- All dumps are in N3 format, the simplest RDF serialization format that
there is
- In addition to the faithful dumps, some simplified dumps are also
available (one statement = one triple; no qualifiers and references).
- Links to external data sets are added to the data for Wikidata
properties that point to datasets with RDF exports. That's the "Linked" in "Linked Open Data".
Suggestions for improvements and contributions on github are welcome.
Cheers,
Markus
[1] http://korrekt.org/page/Introducing_Wikidata_to_the_Linked_Data_Web [2] https://www.mediawiki.org/wiki/Wikidata_Toolkit
-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Eric,
Two general remarks first:
(1) Protege is for small and medium ontologies, but not really for such large datasets. To get SPARQL support for the whole data, you could to install Virtuoso. It also comes with a simple Web query UI. Virtuoso does not do much reasoning, but you can use SPARQL 1.1 transitive closure in queries (using "*" after properties), so you can find "all subclasses" there too. (You could also try this in Protege ...)
(2) If you want to explore the class hierarchy, you can also try our new class browser:
http://tools.wmflabs.org/wikidata-exports/miga/?classes
It has the whole class hierarchy, but without the "leaves" (=instances of classes + subclasses that have no own subclasses/instances). For example, it tells you that "lepton" has 5 direct subclasses, but shows only one:
http://tools.wmflabs.org/wikidata-exports/miga/?classes#_item=3338
On the other hand, it includes relationships of classes and properties that are not part of the RDF (we extract this from the data by considering co-occurrence). Example:
"Classes that have no superclasses but at least 10 instances, and which are often used with the property 'sex or gender'":
http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Direct%...
I already added superclasses for some of those in Wikidata now -- data in the browser is updated with some delay based on dump files.
More answers below:
On 14/06/14 05:52, emw wrote:
Markus,
Thank you very much for this. Translating Wikidata into the language of the Semantic Web is important. Being able to explore the Wikidata taxonomy [1] by doing SPARQL queries in Protege [2] (even primitive queries) is really neat, e.g.
SELECT ?subject WHERE { ?subject rdfs:subClassOf http://www.wikidata.org/entity/Q82586 . }
This is more of an issue of my ignorance of Protege, but I notice that the above query returns only the direct subclasses of Q82586. The full set of subclasses for Q82586 ("lepton") is visible at http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q82586&rp=279&lan... -- a few of the 2nd-level subclasses (muon neutrino, tau neutrino, electron neutrino) are shown there but not returned by that SPARQL query. It seems rdfs:subClassOf isn't being treated as a transitive property in Protege. Any ideas?
You need a reasoner to compute this properly. For a plain class hierarchy as in our case, ELK should be a good choice [1]. You can install the ELK Protege plugin and use it to classify the ontology [2]. Protege will then show the copmuted class hierarchy in the browser; I am not sure what happens to the SPARQL queries (it's quite possible that they don't use the reasoner).
[1] https://code.google.com/p/elk-reasoner/ [2] https://code.google.com/p/elk-reasoner/wiki/ElkProtege
Do you know when the taxonomy data in OWL will have labels available?
We had not thought of this as a use case. A challenge is that the label data is quite big because of the many languages. Should we maybe create an English label file for the classes? Descriptions too or just labels?
Also, regarding the complete dumps, would it be possible to export a smaller subset of the faithful data? The files under "Complete Data Dumps" in http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/ look too big to load into Protege on most personal computers, and would likely require adjusting JVM settings on higher-end computers to load. If it's feasible to somehow prune those files -- and maybe even combine them into one file that could be easily loaded into Protege -- that would be especially nice.
What kind of "pruning" do you have in mind? You can of course take a subset of the data, but then some of the data will be missing.
A general remark on mixing and matching RDF files. We use N3 format, where every line in the ontology is self-contained (no multi-line constructs, no header, no namespaces). Therefore, any subset of the lines of any of our files is still a valid file. So if you want to have only a slice of the data (maybe to experiment with), then you could simply do something like:
gunzip -c wikidata-statements.nt.gz | head -10000 > partial-data.nt
"head" simply selects the first 10000 lines here. You could also use grep to select specific triples instead, such as:
zgrep "http://www.w3.org/2000/01/rdf-schema#label" wikidata-terms.nt.gz | grep "@en ." > en-labels.nt
This selects all English labels. I am using zgrep here for a change; you can also use gunzip as above. Similar methods can also be used to count things in the ontology (use grep -c to count lines = triples).
Finally, you can combine multiple files into one by simply concatenating them in any order:
cat partial-data-1.nt > mydata.nt cat partial-data-2.nt >> mydata.nt ...
Maybe you can experiment a bit and let us know if there is any export that would be particularly meaningful for you.
Cheers,
Markus
Thanks, Eric https://www.wikidata.org/wiki/User:Emw
http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/wikidata-taxo... 2. http://protege.stanford.edu/
On Tue, Jun 10, 2014 at 4:43 AM, Markus Kroetzsch <markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de> wrote:
Hi all, We are now offering regular RDF dumps for the content of Wikidata: http://tools.wmflabs.org/__wikidata-exports/rdf/ <http://tools.wmflabs.org/wikidata-exports/rdf/> RDF is the Resource Description Framework of the W3C that can be used to exchange data on the Web. The Wikidata RDF exports consist of several files that contain different parts and views of the data, and which can be used independently. Details on the available exports and the RDF encoding used in each can be found in the paper "Introducing Wikidata to the Linked Data Web" [1]. The available RDF exports can be found in the directory http://tools.wmflabs.org/__wikidata-exports/rdf/exports/ <http://tools.wmflabs.org/wikidata-exports/rdf/exports/>. New exports are generated regularly from current data dumps of Wikidata and will appear in this directory shortly afterwards. All dump files have been generated using Wikidata Toolkit [2]. There are some important differences in comparison to earlier dumps: * Data is split into several dump files for convenience. Pick whatever you are most interested in. * All dumps are generated using the OpenRDF library for Java (better quality than ad hoc serialization; much slower too ;-) * All dumps are in N3 format, the simplest RDF serialization format that there is * In addition to the faithful dumps, some simplified dumps are also available (one statement = one triple; no qualifiers and references). * Links to external data sets are added to the data for Wikidata properties that point to datasets with RDF exports. That's the "Linked" in "Linked Open Data". Suggestions for improvements and contributions on github are welcome. Cheers, Markus [1] http://korrekt.org/page/__Introducing_Wikidata_to_the___Linked_Data_Web <http://korrekt.org/page/Introducing_Wikidata_to_the_Linked_Data_Web> [2] https://www.mediawiki.org/__wiki/Wikidata_Toolkit <https://www.mediawiki.org/wiki/Wikidata_Toolkit> -- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 <tel:%2B49%20351%20463%2038486> http://korrekt.org/ _________________________________________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org> https://lists.wikimedia.org/__mailman/listinfo/wikidata-l <https://lists.wikimedia.org/mailman/listinfo/wikidata-l>
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Markus,
Thanks for the thorough reply!
you can use SPARQL 1.1 transitive closure in queries (using "*" after
properties), so you can find "all subclasses" there too. (You could also try this in Protege ...)
I had a feeling I was missing something basic. (I'm also new to SPARQL.) Using "*" after the property got me what I was looking for by default in Protege. That is,
SELECT ?subject WHERE { ?subject rdfs:subClassOf* http://www.wikidata.org/entity/Q82586 . }
-- with an asterisk after rdfs:subClassOf -- got me the transitive closure and returned all subclasses of Q82586 / "lepton".
Should we maybe create an English label file for the classes? Descriptions
too or just labels?
A file with English labels and descriptions for classes would be great and, I think, address this use case. Per your note, I suppose one would simply concatenate that English terms file and wikidata-taxonomy.nt into a new .nt file, then import that into Protege to explore the class hierarchy. (Having every line in the ontology be self-contained in N3 is very convenient!)
Regarding the pruned subset, I think the command-line approach in your examples is enough for me to get started making my own.
I won't have time to experiment with these things for a few weeks, but I will return to this then and let you know any interesting findings.
Cheers, Eric
On Sat, Jun 14, 2014 at 4:41 AM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Eric,
Two general remarks first:
(1) Protege is for small and medium ontologies, but not really for such large datasets. To get SPARQL support for the whole data, you could to install Virtuoso. It also comes with a simple Web query UI. Virtuoso does not do much reasoning, but you can use SPARQL 1.1 transitive closure in queries (using "*" after properties), so you can find "all subclasses" there too. (You could also try this in Protege ...)
(2) If you want to explore the class hierarchy, you can also try our new class browser:
http://tools.wmflabs.org/wikidata-exports/miga/?classes
It has the whole class hierarchy, but without the "leaves" (=instances of classes + subclasses that have no own subclasses/instances). For example, it tells you that "lepton" has 5 direct subclasses, but shows only one:
http://tools.wmflabs.org/wikidata-exports/miga/?classes#_item=3338
On the other hand, it includes relationships of classes and properties that are not part of the RDF (we extract this from the data by considering co-occurrence). Example:
"Classes that have no superclasses but at least 10 instances, and which are often used with the property 'sex or gender'":
http://tools.wmflabs.org/wikidata-exports/miga/? classes#_cat=Classes/Direct%20superclasses=__null/Number% 20of%20direct%20instances=10%20-%2020000/Related% 20properties=sex%20or%20gender
I already added superclasses for some of those in Wikidata now -- data in the browser is updated with some delay based on dump files.
More answers below:
On 14/06/14 05:52, emw wrote:
Markus,
Thank you very much for this. Translating Wikidata into the language of the Semantic Web is important. Being able to explore the Wikidata taxonomy [1] by doing SPARQL queries in Protege [2] (even primitive queries) is really neat, e.g.
SELECT ?subject WHERE { ?subject rdfs:subClassOf http://www.wikidata.org/entity/Q82586 . }
This is more of an issue of my ignorance of Protege, but I notice that the above query returns only the direct subclasses of Q82586. The full set of subclasses for Q82586 ("lepton") is visible at http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q82586&rp=279&lan... -- a few of the 2nd-level subclasses (muon neutrino, tau neutrino, electron neutrino) are shown there but not returned by that SPARQL query. It seems rdfs:subClassOf isn't being treated as a transitive property in Protege. Any ideas?
You need a reasoner to compute this properly. For a plain class hierarchy as in our case, ELK should be a good choice [1]. You can install the ELK Protege plugin and use it to classify the ontology [2]. Protege will then show the copmuted class hierarchy in the browser; I am not sure what happens to the SPARQL queries (it's quite possible that they don't use the reasoner).
[1] https://code.google.com/p/elk-reasoner/ [2] https://code.google.com/p/elk-reasoner/wiki/ElkProtege
Do you know when the taxonomy data in OWL will have labels available?
We had not thought of this as a use case. A challenge is that the label data is quite big because of the many languages. Should we maybe create an English label file for the classes? Descriptions too or just labels?
Also, regarding the complete dumps, would it be possible to export a smaller subset of the faithful data? The files under "Complete Data Dumps" in http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/ look too big to load into Protege on most personal computers, and would likely require adjusting JVM settings on higher-end computers to load. If it's feasible to somehow prune those files -- and maybe even combine them into one file that could be easily loaded into Protege -- that would be especially nice.
What kind of "pruning" do you have in mind? You can of course take a subset of the data, but then some of the data will be missing.
A general remark on mixing and matching RDF files. We use N3 format, where every line in the ontology is self-contained (no multi-line constructs, no header, no namespaces). Therefore, any subset of the lines of any of our files is still a valid file. So if you want to have only a slice of the data (maybe to experiment with), then you could simply do something like:
gunzip -c wikidata-statements.nt.gz | head -10000 > partial-data.nt
"head" simply selects the first 10000 lines here. You could also use grep to select specific triples instead, such as:
zgrep "http://www.w3.org/2000/01/rdf-schema#label" wikidata-terms.nt.gz | grep "@en ." > en-labels.nt
This selects all English labels. I am using zgrep here for a change; you can also use gunzip as above. Similar methods can also be used to count things in the ontology (use grep -c to count lines = triples).
Finally, you can combine multiple files into one by simply concatenating them in any order:
cat partial-data-1.nt > mydata.nt cat partial-data-2.nt >> mydata.nt ...
Maybe you can experiment a bit and let us know if there is any export that would be particularly meaningful for you.
Cheers,
Markus
Thanks, Eric https://www.wikidata.org/wiki/User:Emw
http://tools.wmflabs.org/wikidata-exports/rdf/exports/ 20140526/wikidata-taxonomy.nt.gz 2. http://protege.stanford.edu/
On Tue, Jun 10, 2014 at 4:43 AM, Markus Kroetzsch <markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de>
wrote:
Hi all, We are now offering regular RDF dumps for the content of Wikidata: http://tools.wmflabs.org/__wikidata-exports/rdf/ <http://tools.wmflabs.org/wikidata-exports/rdf/> RDF is the Resource Description Framework of the W3C that can be used to exchange data on the Web. The Wikidata RDF exports consist of several files that contain different parts and views of the data, and which can be used independently. Details on the available exports and the RDF encoding used in each can be found in the paper "Introducing Wikidata to the Linked Data Web" [1]. The available RDF exports can be found in the directory http://tools.wmflabs.org/__wikidata-exports/rdf/exports/ <http://tools.wmflabs.org/wikidata-exports/rdf/exports/>. New exports are generated regularly from current data dumps of Wikidata and will appear in this directory shortly afterwards. All dump files have been generated using Wikidata Toolkit [2]. There are some important differences in comparison to earlier dumps: * Data is split into several dump files for convenience. Pick whatever you are most interested in. * All dumps are generated using the OpenRDF library for Java (better quality than ad hoc serialization; much slower too ;-) * All dumps are in N3 format, the simplest RDF serialization format that there is * In addition to the faithful dumps, some simplified dumps are also available (one statement = one triple; no qualifiers and references). * Links to external data sets are added to the data for Wikidata properties that point to datasets with RDF exports. That's the "Linked" in "Linked Open Data". Suggestions for improvements and contributions on github are welcome. Cheers, Markus [1] http://korrekt.org/page/__Introducing_Wikidata_to_the___
Linked_Data_Web http://korrekt.org/page/Introducing_Wikidata_to_the_Linked_Data_Web [2] https://www.mediawiki.org/__wiki/Wikidata_Toolkit https://www.mediawiki.org/wiki/Wikidata_Toolkit
-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 <tel:%2B49%20351%20463%2038486> http://korrekt.org/ _________________________________________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/__mailman/listinfo/wikidata-l <https://lists.wikimedia.org/mailman/listinfo/wikidata-l>
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l