Hi,
I am happy to report that an initial, yet fully functional RDF export for Wikidata is now available. The exports can be created using the wda-export-data.py script of the wda toolkit [1]. This script downloads recent Wikidata database dumps and processes them to create RDF/Turtle files. Various options are available to customize the output (e.g., to export statements but not references, or to export only texts in English and Wolof). The file creation takes a few (about three) hours on my machine depending on what exactly is exported.
For your convenience, I have created some example exports based on yesterday's dumps. These can be found at [2]. There are three Turtle files: site links only, labels/descriptions/aliases only, statements only. The fourth file is a preliminary version of the Wikibase ontology that is used in the exports.
The export format is based on our earlier proposal [3], but it adds a lot of details that had not been specified there yet (namespaces, references, ID generation, compound datavalue encoding, etc.). Details might still change, of course. We might provide regular dumps at another location once the format is stable.
As a side effect of these activities, the wda toolkit [1] is also getting more convenient to use. Creating code for exporting the data into other formats is quite easy.
Features and known limitations of the wda RDF export:
(1) All current Wikidata datatypes are supported. Commons-media data is correctly exported as URLs (not as strings).
(2) One-pass processing. Dumps are processed only once, even though this means that we may not know the types of all properties when we first need them: the script queries wikidata.org to find missing information. This is only relevant when exporting statements.
(3) Limited language support. The script uses Wikidata's internal language codes for string literals in RDF. In some cases, this might not be correct. It would be great if somebody could create a mapping from Wikidata language codes to BCP47 language codes (let me know if you think you can do this, and I'll tell you where to put it)
(4) Limited site language support. To specify the language of linked wiki sites, the script extracts a language code from the URL of the site. Again, this might not be correct in all cases, and it would be great if somebody had a proper mapping from Wikipedias/Wikivoyages to language codes.
(5) Some data excluded. Data that cannot currently be edited is not exported, even if it is found in the dumps. Examples include statement ranks and timezones for time datavalues. I also currently exclude labels and descriptions for simple English, formal German, and informal Dutch, since these would pollute the label space for English, German, and Dutch without adding much benefit (other than possibly for simple English descriptions, I cannot see any case where these languages should ever have different Wikidata texts at all).
Feedback is welcome.
Cheers,
Markus
[1] https://github.com/mkroetzsch/wda Run "python wda-export.data.py --help" for usage instructions [2] http://semanticweb.org/RDF/Wikidata/ [3] http://meta.wikimedia.org/wiki/Wikidata/Development/RDF
Update: the first bugs in the export have already been discovered -- and fixed in the script on github. The files I uploaded will be updated on Monday when I have a better upload again (the links file should be fine, the statements file requires a rather tolerant Turtle string literal parser, and the labels file has a malformed line that will hardly work anywhere).
Markus
On 03/08/13 14:48, Markus Krötzsch wrote:
Hi,
I am happy to report that an initial, yet fully functional RDF export for Wikidata is now available. The exports can be created using the wda-export-data.py script of the wda toolkit [1]. This script downloads recent Wikidata database dumps and processes them to create RDF/Turtle files. Various options are available to customize the output (e.g., to export statements but not references, or to export only texts in English and Wolof). The file creation takes a few (about three) hours on my machine depending on what exactly is exported.
For your convenience, I have created some example exports based on yesterday's dumps. These can be found at [2]. There are three Turtle files: site links only, labels/descriptions/aliases only, statements only. The fourth file is a preliminary version of the Wikibase ontology that is used in the exports.
The export format is based on our earlier proposal [3], but it adds a lot of details that had not been specified there yet (namespaces, references, ID generation, compound datavalue encoding, etc.). Details might still change, of course. We might provide regular dumps at another location once the format is stable.
As a side effect of these activities, the wda toolkit [1] is also getting more convenient to use. Creating code for exporting the data into other formats is quite easy.
Features and known limitations of the wda RDF export:
(1) All current Wikidata datatypes are supported. Commons-media data is correctly exported as URLs (not as strings).
(2) One-pass processing. Dumps are processed only once, even though this means that we may not know the types of all properties when we first need them: the script queries wikidata.org to find missing information. This is only relevant when exporting statements.
(3) Limited language support. The script uses Wikidata's internal language codes for string literals in RDF. In some cases, this might not be correct. It would be great if somebody could create a mapping from Wikidata language codes to BCP47 language codes (let me know if you think you can do this, and I'll tell you where to put it)
(4) Limited site language support. To specify the language of linked wiki sites, the script extracts a language code from the URL of the site. Again, this might not be correct in all cases, and it would be great if somebody had a proper mapping from Wikipedias/Wikivoyages to language codes.
(5) Some data excluded. Data that cannot currently be edited is not exported, even if it is found in the dumps. Examples include statement ranks and timezones for time datavalues. I also currently exclude labels and descriptions for simple English, formal German, and informal Dutch, since these would pollute the label space for English, German, and Dutch without adding much benefit (other than possibly for simple English descriptions, I cannot see any case where these languages should ever have different Wikidata texts at all).
Feedback is welcome.
Cheers,
Markus
[1] https://github.com/mkroetzsch/wda Run "python wda-export.data.py --help" for usage instructions [2] http://semanticweb.org/RDF/Wikidata/ [3] http://meta.wikimedia.org/wiki/Wikidata/Development/RDF
Hi Markus, we just had a look at your python code and created a dump. We are still getting a syntax error for the turtle dump.
I saw, that you did not use a mature framework for serializing the turtle. Let me explain the problem:
Over the last 4 years, I have seen about two dozen people (undergraduate and PhD students, as well as Post-Docs) implement "simple" serializers for RDF.
They all failed.
This was normally not due to the lack of skill, but due to the lack of missing time. They wanted to do it quick, but they didn't have the time to implement it correctly in the long run. There are some really nasty problems ahead like encoding or special characters in URIs. I would direly advise you to:
1. use a Python RDF framework 2. do some syntax tests on the output, e.g. with "rapper" 3. use a line by line format, e.g. use turtle without prefixes and just one triple per line (It's like NTriples, but with Unicode)
We are having a problem currently, because we tried to convert the dump to NTriples (which would be handled by a framework as well) with rapper. We assume that the error is an extra "<" somewhere (not confirmed) and we are still searching for it since the dump is so big.... so we can not provide a detailed bug report. If we had one triple per line, this would also be easier, plus there are advantages for stream reading. bzip2 compression is very good as well, no need for prefix optimization.
All the best, Sebastian
Am 03.08.2013 23:22, schrieb Markus Krötzsch:
Update: the first bugs in the export have already been discovered -- and fixed in the script on github. The files I uploaded will be updated on Monday when I have a better upload again (the links file should be fine, the statements file requires a rather tolerant Turtle string literal parser, and the labels file has a malformed line that will hardly work anywhere).
Markus
On 03/08/13 14:48, Markus Krötzsch wrote:
Hi,
I am happy to report that an initial, yet fully functional RDF export for Wikidata is now available. The exports can be created using the wda-export-data.py script of the wda toolkit [1]. This script downloads recent Wikidata database dumps and processes them to create RDF/Turtle files. Various options are available to customize the output (e.g., to export statements but not references, or to export only texts in English and Wolof). The file creation takes a few (about three) hours on my machine depending on what exactly is exported.
For your convenience, I have created some example exports based on yesterday's dumps. These can be found at [2]. There are three Turtle files: site links only, labels/descriptions/aliases only, statements only. The fourth file is a preliminary version of the Wikibase ontology that is used in the exports.
The export format is based on our earlier proposal [3], but it adds a lot of details that had not been specified there yet (namespaces, references, ID generation, compound datavalue encoding, etc.). Details might still change, of course. We might provide regular dumps at another location once the format is stable.
As a side effect of these activities, the wda toolkit [1] is also getting more convenient to use. Creating code for exporting the data into other formats is quite easy.
Features and known limitations of the wda RDF export:
(1) All current Wikidata datatypes are supported. Commons-media data is correctly exported as URLs (not as strings).
(2) One-pass processing. Dumps are processed only once, even though this means that we may not know the types of all properties when we first need them: the script queries wikidata.org to find missing information. This is only relevant when exporting statements.
(3) Limited language support. The script uses Wikidata's internal language codes for string literals in RDF. In some cases, this might not be correct. It would be great if somebody could create a mapping from Wikidata language codes to BCP47 language codes (let me know if you think you can do this, and I'll tell you where to put it)
(4) Limited site language support. To specify the language of linked wiki sites, the script extracts a language code from the URL of the site. Again, this might not be correct in all cases, and it would be great if somebody had a proper mapping from Wikipedias/Wikivoyages to language codes.
(5) Some data excluded. Data that cannot currently be edited is not exported, even if it is found in the dumps. Examples include statement ranks and timezones for time datavalues. I also currently exclude labels and descriptions for simple English, formal German, and informal Dutch, since these would pollute the label space for English, German, and Dutch without adding much benefit (other than possibly for simple English descriptions, I cannot see any case where these languages should ever have different Wikidata texts at all).
Feedback is welcome.
Cheers,
Markus
[1] https://github.com/mkroetzsch/wda Run "python wda-export.data.py --help" for usage instructions [2] http://semanticweb.org/RDF/Wikidata/ [3] http://meta.wikimedia.org/wiki/Wikidata/Development/RDF
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Hi Sebastian,
On 09/08/13 15:44, Sebastian Hellmann wrote:
Hi Markus, we just had a look at your python code and created a dump. We are still getting a syntax error for the turtle dump.
You mean "just" as in "at around 15:30 today" ;-)? The code is under heavy development, so changes are quite frequent. Please expect things to be broken in some cases (this is just a little community project, not part of the official Wikidata development).
I have just uploaded a new statements export (20130808) to http://semanticweb.org/RDF/Wikidata/ which you might want to try.
I saw, that you did not use a mature framework for serializing the turtle. Let me explain the problem:
Over the last 4 years, I have seen about two dozen people (undergraduate and PhD students, as well as Post-Docs) implement "simple" serializers for RDF.
They all failed.
This was normally not due to the lack of skill, but due to the lack of missing time. They wanted to do it quick, but they didn't have the time to implement it correctly in the long run. There are some really nasty problems ahead like encoding or special characters in URIs. I would direly advise you to:
- use a Python RDF framework
- do some syntax tests on the output, e.g. with "rapper"
- use a line by line format, e.g. use turtle without prefixes and just
one triple per line (It's like NTriples, but with Unicode)
Yes, URI encoding could be difficult if we were doing it manually. Note, however, that we are already using a standard library for URI encoding in all non-trivial cases, so this does not seem to be a very likely cause of the problem (though some non-zero probability remains). In general, it is not unlikely that there are bugs in the RDF somewhere; please consider this export as an early prototype that is meant for experimentation purposes. If you want an official RDF dump, you will have to wait for the Wikidata project team to get around doing it (this will surely be based on an RDF library). Personally, I already found the dump useful (I successfully imported some 109 million triples of some custom script into an RDF store), but I know that it can require some tweaking.
We are having a problem currently, because we tried to convert the dump to NTriples (which would be handled by a framework as well) with rapper. We assume that the error is an extra "<" somewhere (not confirmed) and we are still searching for it since the dump is so big....
Ok, looking forward to hear about the results of your search. A good tip for checking such things is to use grep. I did a quick grep on my current local statements export to count the numbers of < and > (this takes less than a minute on my laptop, including on-the-fly decompression). Both numbers were equal, making it unlikely that there is any unmatched < in the current dumps. Then I used grep to check that < and > only occur in the statements files in lines with "commons" URLs. These are created using urllib, so there should never be any < or > in them.
so we can not provide a detailed bug report. If we had one triple per line, this would also be easier, plus there are advantages for stream reading. bzip2 compression is very good as well, no need for prefix optimization.
Not sure what you mean here. Turtle prefixes in general seem to be a Good Thing, not just for reducing the file size. The code has no easy way to get rid of prefixes, but if you want a line-by-line export you could subclass my exporter and overwrite the methods for incremental triple writing so that they remember the last subject (or property) and create full triples instead. This would give you a line-by-line export in (almost) no time (some uses of [...] blocks in object positions would remain, but maybe you could live with that).
Best wishes,
Markus
All the best, Sebastian
Am 03.08.2013 23:22, schrieb Markus Krötzsch:
Update: the first bugs in the export have already been discovered -- and fixed in the script on github. The files I uploaded will be updated on Monday when I have a better upload again (the links file should be fine, the statements file requires a rather tolerant Turtle string literal parser, and the labels file has a malformed line that will hardly work anywhere).
Markus
On 03/08/13 14:48, Markus Krötzsch wrote:
Hi,
I am happy to report that an initial, yet fully functional RDF export for Wikidata is now available. The exports can be created using the wda-export-data.py script of the wda toolkit [1]. This script downloads recent Wikidata database dumps and processes them to create RDF/Turtle files. Various options are available to customize the output (e.g., to export statements but not references, or to export only texts in English and Wolof). The file creation takes a few (about three) hours on my machine depending on what exactly is exported.
For your convenience, I have created some example exports based on yesterday's dumps. These can be found at [2]. There are three Turtle files: site links only, labels/descriptions/aliases only, statements only. The fourth file is a preliminary version of the Wikibase ontology that is used in the exports.
The export format is based on our earlier proposal [3], but it adds a lot of details that had not been specified there yet (namespaces, references, ID generation, compound datavalue encoding, etc.). Details might still change, of course. We might provide regular dumps at another location once the format is stable.
As a side effect of these activities, the wda toolkit [1] is also getting more convenient to use. Creating code for exporting the data into other formats is quite easy.
Features and known limitations of the wda RDF export:
(1) All current Wikidata datatypes are supported. Commons-media data is correctly exported as URLs (not as strings).
(2) One-pass processing. Dumps are processed only once, even though this means that we may not know the types of all properties when we first need them: the script queries wikidata.org to find missing information. This is only relevant when exporting statements.
(3) Limited language support. The script uses Wikidata's internal language codes for string literals in RDF. In some cases, this might not be correct. It would be great if somebody could create a mapping from Wikidata language codes to BCP47 language codes (let me know if you think you can do this, and I'll tell you where to put it)
(4) Limited site language support. To specify the language of linked wiki sites, the script extracts a language code from the URL of the site. Again, this might not be correct in all cases, and it would be great if somebody had a proper mapping from Wikipedias/Wikivoyages to language codes.
(5) Some data excluded. Data that cannot currently be edited is not exported, even if it is found in the dumps. Examples include statement ranks and timezones for time datavalues. I also currently exclude labels and descriptions for simple English, formal German, and informal Dutch, since these would pollute the label space for English, German, and Dutch without adding much benefit (other than possibly for simple English descriptions, I cannot see any case where these languages should ever have different Wikidata texts at all).
Feedback is welcome.
Cheers,
Markus
[1] https://github.com/mkroetzsch/wda Run "python wda-export.data.py --help" for usage instructions [2] http://semanticweb.org/RDF/Wikidata/ [3] http://meta.wikimedia.org/wiki/Wikidata/Development/RDF
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Good morning. I just found a bug that was caused by a bug in the Wikidata dumps (a value that should be a URI was not). This led to a few dozen lines with illegal qnames of the form "w: ". The updated script fixes this.
Cheers,
Markus
On 09/08/13 18:15, Markus Krötzsch wrote:
Hi Sebastian,
On 09/08/13 15:44, Sebastian Hellmann wrote:
Hi Markus, we just had a look at your python code and created a dump. We are still getting a syntax error for the turtle dump.
You mean "just" as in "at around 15:30 today" ;-)? The code is under heavy development, so changes are quite frequent. Please expect things to be broken in some cases (this is just a little community project, not part of the official Wikidata development).
I have just uploaded a new statements export (20130808) to http://semanticweb.org/RDF/Wikidata/ which you might want to try.
I saw, that you did not use a mature framework for serializing the turtle. Let me explain the problem:
Over the last 4 years, I have seen about two dozen people (undergraduate and PhD students, as well as Post-Docs) implement "simple" serializers for RDF.
They all failed.
This was normally not due to the lack of skill, but due to the lack of missing time. They wanted to do it quick, but they didn't have the time to implement it correctly in the long run. There are some really nasty problems ahead like encoding or special characters in URIs. I would direly advise you to:
- use a Python RDF framework
- do some syntax tests on the output, e.g. with "rapper"
- use a line by line format, e.g. use turtle without prefixes and just
one triple per line (It's like NTriples, but with Unicode)
Yes, URI encoding could be difficult if we were doing it manually. Note, however, that we are already using a standard library for URI encoding in all non-trivial cases, so this does not seem to be a very likely cause of the problem (though some non-zero probability remains). In general, it is not unlikely that there are bugs in the RDF somewhere; please consider this export as an early prototype that is meant for experimentation purposes. If you want an official RDF dump, you will have to wait for the Wikidata project team to get around doing it (this will surely be based on an RDF library). Personally, I already found the dump useful (I successfully imported some 109 million triples of some custom script into an RDF store), but I know that it can require some tweaking.
We are having a problem currently, because we tried to convert the dump to NTriples (which would be handled by a framework as well) with rapper. We assume that the error is an extra "<" somewhere (not confirmed) and we are still searching for it since the dump is so big....
Ok, looking forward to hear about the results of your search. A good tip for checking such things is to use grep. I did a quick grep on my current local statements export to count the numbers of < and > (this takes less than a minute on my laptop, including on-the-fly decompression). Both numbers were equal, making it unlikely that there is any unmatched < in the current dumps. Then I used grep to check that < and > only occur in the statements files in lines with "commons" URLs. These are created using urllib, so there should never be any < or > in them.
so we can not provide a detailed bug report. If we had one triple per line, this would also be easier, plus there are advantages for stream reading. bzip2 compression is very good as well, no need for prefix optimization.
Not sure what you mean here. Turtle prefixes in general seem to be a Good Thing, not just for reducing the file size. The code has no easy way to get rid of prefixes, but if you want a line-by-line export you could subclass my exporter and overwrite the methods for incremental triple writing so that they remember the last subject (or property) and create full triples instead. This would give you a line-by-line export in (almost) no time (some uses of [...] blocks in object positions would remain, but maybe you could live with that).
Best wishes,
Markus
All the best, Sebastian
Am 03.08.2013 23:22, schrieb Markus Krötzsch:
Update: the first bugs in the export have already been discovered -- and fixed in the script on github. The files I uploaded will be updated on Monday when I have a better upload again (the links file should be fine, the statements file requires a rather tolerant Turtle string literal parser, and the labels file has a malformed line that will hardly work anywhere).
Markus
On 03/08/13 14:48, Markus Krötzsch wrote:
Hi,
I am happy to report that an initial, yet fully functional RDF export for Wikidata is now available. The exports can be created using the wda-export-data.py script of the wda toolkit [1]. This script downloads recent Wikidata database dumps and processes them to create RDF/Turtle files. Various options are available to customize the output (e.g., to export statements but not references, or to export only texts in English and Wolof). The file creation takes a few (about three) hours on my machine depending on what exactly is exported.
For your convenience, I have created some example exports based on yesterday's dumps. These can be found at [2]. There are three Turtle files: site links only, labels/descriptions/aliases only, statements only. The fourth file is a preliminary version of the Wikibase ontology that is used in the exports.
The export format is based on our earlier proposal [3], but it adds a lot of details that had not been specified there yet (namespaces, references, ID generation, compound datavalue encoding, etc.). Details might still change, of course. We might provide regular dumps at another location once the format is stable.
As a side effect of these activities, the wda toolkit [1] is also getting more convenient to use. Creating code for exporting the data into other formats is quite easy.
Features and known limitations of the wda RDF export:
(1) All current Wikidata datatypes are supported. Commons-media data is correctly exported as URLs (not as strings).
(2) One-pass processing. Dumps are processed only once, even though this means that we may not know the types of all properties when we first need them: the script queries wikidata.org to find missing information. This is only relevant when exporting statements.
(3) Limited language support. The script uses Wikidata's internal language codes for string literals in RDF. In some cases, this might not be correct. It would be great if somebody could create a mapping from Wikidata language codes to BCP47 language codes (let me know if you think you can do this, and I'll tell you where to put it)
(4) Limited site language support. To specify the language of linked wiki sites, the script extracts a language code from the URL of the site. Again, this might not be correct in all cases, and it would be great if somebody had a proper mapping from Wikipedias/Wikivoyages to language codes.
(5) Some data excluded. Data that cannot currently be edited is not exported, even if it is found in the dumps. Examples include statement ranks and timezones for time datavalues. I also currently exclude labels and descriptions for simple English, formal German, and informal Dutch, since these would pollute the label space for English, German, and Dutch without adding much benefit (other than possibly for simple English descriptions, I cannot see any case where these languages should ever have different Wikidata texts at all).
Feedback is welcome.
Cheers,
Markus
[1] https://github.com/mkroetzsch/wda Run "python wda-export.data.py --help" for usage instructions [2] http://semanticweb.org/RDF/Wikidata/ [3] http://meta.wikimedia.org/wiki/Wikidata/Development/RDF
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Hi Markus! Thank you very much.
Regarding your last email: Of course, I am aware of your arguments in your last email, that the dump is not "official". Nevertheless, I am expecting you and others to code (or supervise) similar RDF dumping projects in the future.
Here are two really important things to consider:
1. Always use a mature RDF framework for serializing: Even DBpedia was publishing RDF for years that had some errors in it, this was really frustrating for maintainers (handling bug reports) and clients (trying to quick-fix it). Other small projects (in fact exactly the same as yours Markus, a guy publishing some useful software) went the same way: Lot's of small syntax bugs, many bug requests, lot of additional work. Some of them were abandoned because the developer didn't have time anymore.
2. Use NTriples or one-triple-per-line Turtle: (Turtle supports IRIs and unicode, compare) curl http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 | bzcat | head curl http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.nt.bz2 | bzcat | head
one-triple-per-line let's you a) find errors easier and b) aids further processing, e.g. calculate the outdegree of subjects: curl http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 | bzcat | head -100 | cut -f1 -d '>' | grep -v '^#' | sed 's/<//;s/>//' | awk '{count[$1]++}END{for(j in count) print "<" j ">" "\t"count [j]}'
Furthermore: - Parsers can treat one-triple-per-line more robust, by just skipping lines - compression size is the same - alphabetical ordering of data works well (e.g. for GitHub diffs) - you can split the files in several smaller files easily
Blank nodes have some bad properties: - some databases react weird to them and they sometimes fill up indexes and make the DB slow (depends on the implementations of course, this is just my experience ) - make splitting one-triple-per-line more difficult - difficult for SPARQL to resolve recursively - see http://videolectures.net/iswc2011_mallea_nodes/ or http://web.ing.puc.cl/~marenas/publications/iswc11.pdf
Turtle prefixes: Why do you think they are a "good thing"? They are disputed as sometimes as a premature feature. They do make data more readable, but nobody is going to read 4.4 GB of Turtle. By the way, you can always convert it to turtle easily: curl http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 | bzcat | head -100 | rapper -i turtle -o turtle -I - - file
All the best, Sebastian
Am 10.08.2013 12:44, schrieb Markus Krötzsch:
Good morning. I just found a bug that was caused by a bug in the Wikidata dumps (a value that should be a URI was not). This led to a few dozen lines with illegal qnames of the form "w: ". The updated script fixes this.
Cheers,
Markus
On 09/08/13 18:15, Markus Krötzsch wrote:
Hi Sebastian,
On 09/08/13 15:44, Sebastian Hellmann wrote:
Hi Markus, we just had a look at your python code and created a dump. We are still getting a syntax error for the turtle dump.
You mean "just" as in "at around 15:30 today" ;-)? The code is under heavy development, so changes are quite frequent. Please expect things to be broken in some cases (this is just a little community project, not part of the official Wikidata development).
I have just uploaded a new statements export (20130808) to http://semanticweb.org/RDF/Wikidata/ which you might want to try.
I saw, that you did not use a mature framework for serializing the turtle. Let me explain the problem:
Over the last 4 years, I have seen about two dozen people (undergraduate and PhD students, as well as Post-Docs) implement "simple" serializers for RDF.
They all failed.
This was normally not due to the lack of skill, but due to the lack of missing time. They wanted to do it quick, but they didn't have the time to implement it correctly in the long run. There are some really nasty problems ahead like encoding or special characters in URIs. I would direly advise you to:
- use a Python RDF framework
- do some syntax tests on the output, e.g. with "rapper"
- use a line by line format, e.g. use turtle without prefixes and just
one triple per line (It's like NTriples, but with Unicode)
Yes, URI encoding could be difficult if we were doing it manually. Note, however, that we are already using a standard library for URI encoding in all non-trivial cases, so this does not seem to be a very likely cause of the problem (though some non-zero probability remains). In general, it is not unlikely that there are bugs in the RDF somewhere; please consider this export as an early prototype that is meant for experimentation purposes. If you want an official RDF dump, you will have to wait for the Wikidata project team to get around doing it (this will surely be based on an RDF library). Personally, I already found the dump useful (I successfully imported some 109 million triples of some custom script into an RDF store), but I know that it can require some tweaking.
We are having a problem currently, because we tried to convert the dump to NTriples (which would be handled by a framework as well) with rapper. We assume that the error is an extra "<" somewhere (not confirmed) and we are still searching for it since the dump is so big....
Ok, looking forward to hear about the results of your search. A good tip for checking such things is to use grep. I did a quick grep on my current local statements export to count the numbers of < and > (this takes less than a minute on my laptop, including on-the-fly decompression). Both numbers were equal, making it unlikely that there is any unmatched < in the current dumps. Then I used grep to check that < and > only occur in the statements files in lines with "commons" URLs. These are created using urllib, so there should never be any < or > in them.
so we can not provide a detailed bug report. If we had one triple per line, this would also be easier, plus there are advantages for stream reading. bzip2 compression is very good as well, no need for prefix optimization.
Not sure what you mean here. Turtle prefixes in general seem to be a Good Thing, not just for reducing the file size. The code has no easy way to get rid of prefixes, but if you want a line-by-line export you could subclass my exporter and overwrite the methods for incremental triple writing so that they remember the last subject (or property) and create full triples instead. This would give you a line-by-line export in (almost) no time (some uses of [...] blocks in object positions would remain, but maybe you could live with that).
Best wishes,
Markus
All the best, Sebastian
Am 03.08.2013 23:22, schrieb Markus Krötzsch:
Update: the first bugs in the export have already been discovered -- and fixed in the script on github. The files I uploaded will be updated on Monday when I have a better upload again (the links file should be fine, the statements file requires a rather tolerant Turtle string literal parser, and the labels file has a malformed line that will hardly work anywhere).
Markus
On 03/08/13 14:48, Markus Krötzsch wrote:
Hi,
I am happy to report that an initial, yet fully functional RDF export for Wikidata is now available. The exports can be created using the wda-export-data.py script of the wda toolkit [1]. This script downloads recent Wikidata database dumps and processes them to create RDF/Turtle files. Various options are available to customize the output (e.g., to export statements but not references, or to export only texts in English and Wolof). The file creation takes a few (about three) hours on my machine depending on what exactly is exported.
For your convenience, I have created some example exports based on yesterday's dumps. These can be found at [2]. There are three Turtle files: site links only, labels/descriptions/aliases only, statements only. The fourth file is a preliminary version of the Wikibase ontology that is used in the exports.
The export format is based on our earlier proposal [3], but it adds a lot of details that had not been specified there yet (namespaces, references, ID generation, compound datavalue encoding, etc.). Details might still change, of course. We might provide regular dumps at another location once the format is stable.
As a side effect of these activities, the wda toolkit [1] is also getting more convenient to use. Creating code for exporting the data into other formats is quite easy.
Features and known limitations of the wda RDF export:
(1) All current Wikidata datatypes are supported. Commons-media data is correctly exported as URLs (not as strings).
(2) One-pass processing. Dumps are processed only once, even though this means that we may not know the types of all properties when we first need them: the script queries wikidata.org to find missing information. This is only relevant when exporting statements.
(3) Limited language support. The script uses Wikidata's internal language codes for string literals in RDF. In some cases, this might not be correct. It would be great if somebody could create a mapping from Wikidata language codes to BCP47 language codes (let me know if you think you can do this, and I'll tell you where to put it)
(4) Limited site language support. To specify the language of linked wiki sites, the script extracts a language code from the URL of the site. Again, this might not be correct in all cases, and it would be great if somebody had a proper mapping from Wikipedias/Wikivoyages to language codes.
(5) Some data excluded. Data that cannot currently be edited is not exported, even if it is found in the dumps. Examples include statement ranks and timezones for time datavalues. I also currently exclude labels and descriptions for simple English, formal German, and informal Dutch, since these would pollute the label space for English, German, and Dutch without adding much benefit (other than possibly for simple English descriptions, I cannot see any case where these languages should ever have different Wikidata texts at all).
Feedback is welcome.
Cheers,
Markus
[1] https://github.com/mkroetzsch/wda Run "python wda-export.data.py --help" for usage instructions [2] http://semanticweb.org/RDF/Wikidata/ [3] http://meta.wikimedia.org/wiki/Wikidata/Development/RDF
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Dear Sebastian,
On 10/08/13 12:18, Sebastian Hellmann wrote:
Hi Markus! Thank you very much.
Regarding your last email: Of course, I am aware of your arguments in your last email, that the dump is not "official". Nevertheless, I am expecting you and others to code (or supervise) similar RDF dumping projects in the future.
Here are two really important things to consider:
- Always use a mature RDF framework for serializing:
...
Statements that involve "always" are easy to disagree with. An important part of software engineering is to achieve one's goals with optimal investment of resources. If you work on larger and more long-term projects, you will start to appreciate that the theoretically "best" or "cleanest" solution is not always the one that leads to a successful project. To the contrary, such a viewpoint can even make it harder to work in a "messy" surrounding, full of tools and data that do not quite adhere to the high ideals that one would like everyone (on the Web!) to have. You can see good example of this in HTML evolution.
Turtle is *really* easy to parse in a robust and fault-tolerant way. I am tempted to write a little script that sanitizes Turtle input in a streaming fashion by discarding garbage triples. Can't take more than a weekend to do that, don't you think? But I already have plans this weekend :-)
- Use NTriples or one-triple-per-line Turtle:
(Turtle supports IRIs and unicode, compare) curl http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 | bzcat | head curl http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.nt.bz2 | bzcat | head
one-triple-per-line let's you a) find errors easier and b) aids further processing, e.g. calculate the outdegree of subjects: curl http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 | bzcat | head -100 | cut -f1 -d '>' | grep -v '^#' | sed 's/<//;s/>//' | awk '{count[$1]++}END{for(j in count) print "<" j ">" "\t"count [j]}'
Furthermore:
- Parsers can treat one-triple-per-line more robust, by just skipping lines
- compression size is the same
- alphabetical ordering of data works well (e.g. for GitHub diffs)
- you can split the files in several smaller files easily
See above. Why not write a little script that streams a Turtle file and creates one-triple-per-line output? This could be done with very little memory overhead in a streaming fashion. Both nested and line-by-line Turtle have their advantages and disadvantages, but one can trivially be converted into the other whereas the other cannot be converted back easily.
Of course we will continue to improve our Turtle quality, but there will always be someone who would prefer a slightly different format. One will always have to draw a line somewhere.
Blank nodes have some bad properties:
- some databases react weird to them and they sometimes fill up indexes
and make the DB slow (depends on the implementations of course, this is just my experience )
- make splitting one-triple-per-line more difficult
- difficult for SPARQL to resolve recursively
- see http://videolectures.net/iswc2011_mallea_nodes/ or
Does this relate to Wikidata or are we getting into general RDF design discussions here (wrong list)? Wikidata uses blank nodes only for serialising OWL axioms, and there is no alternative in this case.
Turtle prefixes: Why do you think they are a "good thing"? They are disputed as sometimes as a premature feature. They do make data more readable, but nobody is going to read 4.4 GB of Turtle.
If you want to fight against existing W3C standards, this is really not the right list. I have not made Turtle, and I won't defend its design here. But since you asked: I think readability is a good thing.
By the way, you can always convert it to turtle easily: curl http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 | bzcat | head -100 | rapper -i turtle -o turtle -I - - file
If conversion is so easy, it does not seem worthwhile to have much of a discussion about this at all.
Cheers,
Markus
Am 10.08.2013 12:44, schrieb Markus Krötzsch:
Good morning. I just found a bug that was caused by a bug in the Wikidata dumps (a value that should be a URI was not). This led to a few dozen lines with illegal qnames of the form "w: ". The updated script fixes this.
Cheers,
Markus
On 09/08/13 18:15, Markus Krötzsch wrote:
Hi Sebastian,
On 09/08/13 15:44, Sebastian Hellmann wrote:
Hi Markus, we just had a look at your python code and created a dump. We are still getting a syntax error for the turtle dump.
You mean "just" as in "at around 15:30 today" ;-)? The code is under heavy development, so changes are quite frequent. Please expect things to be broken in some cases (this is just a little community project, not part of the official Wikidata development).
I have just uploaded a new statements export (20130808) to http://semanticweb.org/RDF/Wikidata/ which you might want to try.
I saw, that you did not use a mature framework for serializing the turtle. Let me explain the problem:
Over the last 4 years, I have seen about two dozen people (undergraduate and PhD students, as well as Post-Docs) implement "simple" serializers for RDF.
They all failed.
This was normally not due to the lack of skill, but due to the lack of missing time. They wanted to do it quick, but they didn't have the time to implement it correctly in the long run. There are some really nasty problems ahead like encoding or special characters in URIs. I would direly advise you to:
- use a Python RDF framework
- do some syntax tests on the output, e.g. with "rapper"
- use a line by line format, e.g. use turtle without prefixes and just
one triple per line (It's like NTriples, but with Unicode)
Yes, URI encoding could be difficult if we were doing it manually. Note, however, that we are already using a standard library for URI encoding in all non-trivial cases, so this does not seem to be a very likely cause of the problem (though some non-zero probability remains). In general, it is not unlikely that there are bugs in the RDF somewhere; please consider this export as an early prototype that is meant for experimentation purposes. If you want an official RDF dump, you will have to wait for the Wikidata project team to get around doing it (this will surely be based on an RDF library). Personally, I already found the dump useful (I successfully imported some 109 million triples of some custom script into an RDF store), but I know that it can require some tweaking.
We are having a problem currently, because we tried to convert the dump to NTriples (which would be handled by a framework as well) with rapper. We assume that the error is an extra "<" somewhere (not confirmed) and we are still searching for it since the dump is so big....
Ok, looking forward to hear about the results of your search. A good tip for checking such things is to use grep. I did a quick grep on my current local statements export to count the numbers of < and > (this takes less than a minute on my laptop, including on-the-fly decompression). Both numbers were equal, making it unlikely that there is any unmatched < in the current dumps. Then I used grep to check that < and > only occur in the statements files in lines with "commons" URLs. These are created using urllib, so there should never be any < or > in them.
so we can not provide a detailed bug report. If we had one triple per line, this would also be easier, plus there are advantages for stream reading. bzip2 compression is very good as well, no need for prefix optimization.
Not sure what you mean here. Turtle prefixes in general seem to be a Good Thing, not just for reducing the file size. The code has no easy way to get rid of prefixes, but if you want a line-by-line export you could subclass my exporter and overwrite the methods for incremental triple writing so that they remember the last subject (or property) and create full triples instead. This would give you a line-by-line export in (almost) no time (some uses of [...] blocks in object positions would remain, but maybe you could live with that).
Best wishes,
Markus
All the best, Sebastian
Am 03.08.2013 23:22, schrieb Markus Krötzsch:
Update: the first bugs in the export have already been discovered -- and fixed in the script on github. The files I uploaded will be updated on Monday when I have a better upload again (the links file should be fine, the statements file requires a rather tolerant Turtle string literal parser, and the labels file has a malformed line that will hardly work anywhere).
Markus
On 03/08/13 14:48, Markus Krötzsch wrote:
Hi,
I am happy to report that an initial, yet fully functional RDF export for Wikidata is now available. The exports can be created using the wda-export-data.py script of the wda toolkit [1]. This script downloads recent Wikidata database dumps and processes them to create RDF/Turtle files. Various options are available to customize the output (e.g., to export statements but not references, or to export only texts in English and Wolof). The file creation takes a few (about three) hours on my machine depending on what exactly is exported.
For your convenience, I have created some example exports based on yesterday's dumps. These can be found at [2]. There are three Turtle files: site links only, labels/descriptions/aliases only, statements only. The fourth file is a preliminary version of the Wikibase ontology that is used in the exports.
The export format is based on our earlier proposal [3], but it adds a lot of details that had not been specified there yet (namespaces, references, ID generation, compound datavalue encoding, etc.). Details might still change, of course. We might provide regular dumps at another location once the format is stable.
As a side effect of these activities, the wda toolkit [1] is also getting more convenient to use. Creating code for exporting the data into other formats is quite easy.
Features and known limitations of the wda RDF export:
(1) All current Wikidata datatypes are supported. Commons-media data is correctly exported as URLs (not as strings).
(2) One-pass processing. Dumps are processed only once, even though this means that we may not know the types of all properties when we first need them: the script queries wikidata.org to find missing information. This is only relevant when exporting statements.
(3) Limited language support. The script uses Wikidata's internal language codes for string literals in RDF. In some cases, this might not be correct. It would be great if somebody could create a mapping from Wikidata language codes to BCP47 language codes (let me know if you think you can do this, and I'll tell you where to put it)
(4) Limited site language support. To specify the language of linked wiki sites, the script extracts a language code from the URL of the site. Again, this might not be correct in all cases, and it would be great if somebody had a proper mapping from Wikipedias/Wikivoyages to language codes.
(5) Some data excluded. Data that cannot currently be edited is not exported, even if it is found in the dumps. Examples include statement ranks and timezones for time datavalues. I also currently exclude labels and descriptions for simple English, formal German, and informal Dutch, since these would pollute the label space for English, German, and Dutch without adding much benefit (other than possibly for simple English descriptions, I cannot see any case where these languages should ever have different Wikidata texts at all).
Feedback is welcome.
Cheers,
Markus
[1] https://github.com/mkroetzsch/wda Run "python wda-export.data.py --help" for usage instructions [2] http://semanticweb.org/RDF/Wikidata/ [3] http://meta.wikimedia.org/wiki/Wikidata/Development/RDF
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Events:
- NLP & DBpedia 2013 (http://nlp-dbpedia2013.blogs.aksw.org, Extended
Deadline: *July 18th*)
- LSWT 23/24 Sept, 2013 in Leipzig (http://aksw.org/lswt)
Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf Projects: http://nlp2rdf.org , http://linguistics.okfn.org , http://dbpedia.org/Wiktionary , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org
Hi Markus (cc'ing DBpedia discussion), of course, you are right with the things you say. Sometimes I have too strong opinions, but please understand that this is only, since I feel like I wasted too much time on patching bad syntax. The appropriate list for such a discussion would be DBpedia mailing list (now in CC), because it involves best practices for publishing large RDF data dumps on the Web in a practical manner. We can remove the Wikidata list, if the discussion is not interesting. RDF serialization formats are given as standards, so it is about choosing the right one. DBpedia is one of the projects, that has tried hard to make the Web of Data work.
Of course, I can understand your arguments, but there is a difference between a HTML tag soup and the RDF compatibility layer and tool chains. I am confident, that the creation of a robust turtle parser would present quite a challenge:
@prefix : http://example.org/ex# . <s> <pl> """missing quote"" ; <p> :works , :doesn,t , :0neither , <chars_,;[]_are_allowed_in_full_URIs_by_the_way> ; <p> [ <p> <c ] , [<p> <d> ] ; <find> <me> .
definitely more than: while ( line = readline() ) { try { parse (line); }catch(exception e){System.out("syntax error in: "+line);} }
So as a best practice, I would definitely go for **alphabetically sorted, one-triple-per-line, non-prefixed turtle with IRI's with a "not sure about blank nodes"**
Example: http://ko.dbpedia.org/resource/지미_카터 http://dbpedia.org/ontology/country <http://ko.dbpedia.org/resource/%EB%AF%B8 국> . http://ko.dbpedia.org/resource/지미_카터 http://xmlns.com/foaf/0.1/name "James Earl Carter, Jr."@ko .
@Markus: actually, the question is important for DBpedia, because disk space on our download server is getting tight for DBpedia 3.9 and other soon to come data publishing projects. I'm sorry to use your thread for this, but I see the opportunity to create a "best current practise" easily and we might be able to save a lot of space by doing so.
Maybe we can skip on NTriples .nt and .nq files?
435G downloads.dbpedia.org 1.8G 1.0 2.5G 2.0 5.1G 3.0 7.6G 3.0rc 6.0G 3.1 6.4G 3.2 7.3G 3.3 21G 3.4 32G 3.5 35G 3.5.1 34G 3.6 44G 3.7 63G 3.7-i18n 169G 3.8 ??? 3.9 22M wikicompany 1.6G wiktionary ...
All the best, Sebastian
Am 10.08.2013 14:35, schrieb Markus Krötzsch:
Dear Sebastian,
On 10/08/13 12:18, Sebastian Hellmann wrote:
Hi Markus! Thank you very much.
Regarding your last email: Of course, I am aware of your arguments in your last email, that the dump is not "official". Nevertheless, I am expecting you and others to code (or supervise) similar RDF dumping projects in the future.
Here are two really important things to consider:
- Always use a mature RDF framework for serializing:
...
Statements that involve "always" are easy to disagree with. An important part of software engineering is to achieve one's goals with optimal investment of resources. If you work on larger and more long-term projects, you will start to appreciate that the theoretically "best" or "cleanest" solution is not always the one that leads to a successful project. To the contrary, such a viewpoint can even make it harder to work in a "messy" surrounding, full of tools and data that do not quite adhere to the high ideals that one would like everyone (on the Web!) to have. You can see good example of this in HTML evolution.
Turtle is *really* easy to parse in a robust and fault-tolerant way. I am tempted to write a little script that sanitizes Turtle input in a streaming fashion by discarding garbage triples. Can't take more than a weekend to do that, don't you think? But I already have plans this weekend :-)
- Use NTriples or one-triple-per-line Turtle:
(Turtle supports IRIs and unicode, compare) curl http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 | bzcat | head curl http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.nt.bz2 | bzcat | head
one-triple-per-line let's you a) find errors easier and b) aids further processing, e.g. calculate the outdegree of subjects: curl http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 | bzcat | head -100 | cut -f1 -d '>' | grep -v '^#' | sed 's/<//;s/>//' | awk '{count[$1]++}END{for(j in count) print "<" j ">" "\t"count [j]}'
Furthermore:
- Parsers can treat one-triple-per-line more robust, by just skipping
lines
- compression size is the same
- alphabetical ordering of data works well (e.g. for GitHub diffs)
- you can split the files in several smaller files easily
See above. Why not write a little script that streams a Turtle file and creates one-triple-per-line output? This could be done with very little memory overhead in a streaming fashion. Both nested and line-by-line Turtle have their advantages and disadvantages, but one can trivially be converted into the other whereas the other cannot be converted back easily.
Of course we will continue to improve our Turtle quality, but there will always be someone who would prefer a slightly different format. One will always have to draw a line somewhere.
Blank nodes have some bad properties:
- some databases react weird to them and they sometimes fill up indexes
and make the DB slow (depends on the implementations of course, this is just my experience )
- make splitting one-triple-per-line more difficult
- difficult for SPARQL to resolve recursively
- see http://videolectures.net/iswc2011_mallea_nodes/ or
Does this relate to Wikidata or are we getting into general RDF design discussions here (wrong list)? Wikidata uses blank nodes only for serialising OWL axioms, and there is no alternative in this case.
Turtle prefixes: Why do you think they are a "good thing"? They are disputed as sometimes as a premature feature. They do make data more readable, but nobody is going to read 4.4 GB of Turtle.
If you want to fight against existing W3C standards, this is really not the right list. I have not made Turtle, and I won't defend its design here. But since you asked: I think readability is a good thing.
By the way, you can always convert it to turtle easily: curl http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2 | bzcat | head -100 | rapper -i turtle -o turtle -I - - file
If conversion is so easy, it does not seem worthwhile to have much of a discussion about this at all.
Cheers,
Markus
Am 10.08.2013 12:44, schrieb Markus Krötzsch:
Good morning. I just found a bug that was caused by a bug in the Wikidata dumps (a value that should be a URI was not). This led to a few dozen lines with illegal qnames of the form "w: ". The updated script fixes this.
Cheers,
Markus
On 09/08/13 18:15, Markus Krötzsch wrote:
Hi Sebastian,
On 09/08/13 15:44, Sebastian Hellmann wrote:
Hi Markus, we just had a look at your python code and created a dump. We are still getting a syntax error for the turtle dump.
You mean "just" as in "at around 15:30 today" ;-)? The code is under heavy development, so changes are quite frequent. Please expect things to be broken in some cases (this is just a little community project, not part of the official Wikidata development).
I have just uploaded a new statements export (20130808) to http://semanticweb.org/RDF/Wikidata/ which you might want to try.
I saw, that you did not use a mature framework for serializing the turtle. Let me explain the problem:
Over the last 4 years, I have seen about two dozen people (undergraduate and PhD students, as well as Post-Docs) implement "simple" serializers for RDF.
They all failed.
This was normally not due to the lack of skill, but due to the lack of missing time. They wanted to do it quick, but they didn't have the time to implement it correctly in the long run. There are some really nasty problems ahead like encoding or special characters in URIs. I would direly advise you to:
- use a Python RDF framework
- do some syntax tests on the output, e.g. with "rapper"
- use a line by line format, e.g. use turtle without prefixes and
just one triple per line (It's like NTriples, but with Unicode)
Yes, URI encoding could be difficult if we were doing it manually. Note, however, that we are already using a standard library for URI encoding in all non-trivial cases, so this does not seem to be a very likely cause of the problem (though some non-zero probability remains). In general, it is not unlikely that there are bugs in the RDF somewhere; please consider this export as an early prototype that is meant for experimentation purposes. If you want an official RDF dump, you will have to wait for the Wikidata project team to get around doing it (this will surely be based on an RDF library). Personally, I already found the dump useful (I successfully imported some 109 million triples of some custom script into an RDF store), but I know that it can require some tweaking.
We are having a problem currently, because we tried to convert the dump to NTriples (which would be handled by a framework as well) with rapper. We assume that the error is an extra "<" somewhere (not confirmed) and we are still searching for it since the dump is so big....
Ok, looking forward to hear about the results of your search. A good tip for checking such things is to use grep. I did a quick grep on my current local statements export to count the numbers of < and > (this takes less than a minute on my laptop, including on-the-fly decompression). Both numbers were equal, making it unlikely that there is any unmatched < in the current dumps. Then I used grep to check that < and > only occur in the statements files in lines with "commons" URLs. These are created using urllib, so there should never be any < or > in them.
so we can not provide a detailed bug report. If we had one triple per line, this would also be easier, plus there are advantages for stream reading. bzip2 compression is very good as well, no need for prefix optimization.
Not sure what you mean here. Turtle prefixes in general seem to be a Good Thing, not just for reducing the file size. The code has no easy way to get rid of prefixes, but if you want a line-by-line export you could subclass my exporter and overwrite the methods for incremental triple writing so that they remember the last subject (or property) and create full triples instead. This would give you a line-by-line export in (almost) no time (some uses of [...] blocks in object positions would remain, but maybe you could live with that).
Best wishes,
Markus
All the best, Sebastian
Am 03.08.2013 23:22, schrieb Markus Krötzsch:
Update: the first bugs in the export have already been discovered -- and fixed in the script on github. The files I uploaded will be updated on Monday when I have a better upload again (the links file should be fine, the statements file requires a rather tolerant Turtle string literal parser, and the labels file has a malformed line that will hardly work anywhere).
Markus
On 03/08/13 14:48, Markus Krötzsch wrote: > Hi, > > I am happy to report that an initial, yet fully functional RDF > export > for Wikidata is now available. The exports can be created using the > wda-export-data.py script of the wda toolkit [1]. This script > downloads > recent Wikidata database dumps and processes them to create > RDF/Turtle > files. Various options are available to customize the output > (e.g., to > export statements but not references, or to export only texts in > English > and Wolof). The file creation takes a few (about three) hours on my > machine depending on what exactly is exported. > > For your convenience, I have created some example exports based on > yesterday's dumps. These can be found at [2]. There are three > Turtle > files: site links only, labels/descriptions/aliases only, > statements > only. The fourth file is a preliminary version of the Wikibase > ontology > that is used in the exports. > > The export format is based on our earlier proposal [3], but it > adds a > lot of details that had not been specified there yet (namespaces, > references, ID generation, compound datavalue encoding, etc.). > Details > might still change, of course. We might provide regular dumps at > another > location once the format is stable. > > As a side effect of these activities, the wda toolkit [1] is also > getting more convenient to use. Creating code for exporting the > data > into other formats is quite easy. > > Features and known limitations of the wda RDF export: > > (1) All current Wikidata datatypes are supported. Commons-media > data is > correctly exported as URLs (not as strings). > > (2) One-pass processing. Dumps are processed only once, even though > this > means that we may not know the types of all properties when we > first > need them: the script queries wikidata.org to find missing > information. > This is only relevant when exporting statements. > > (3) Limited language support. The script uses Wikidata's internal > language codes for string literals in RDF. In some cases, this > might > not > be correct. It would be great if somebody could create a mapping > from > Wikidata language codes to BCP47 language codes (let me know if you > think you can do this, and I'll tell you where to put it) > > (4) Limited site language support. To specify the language of > linked > wiki sites, the script extracts a language code from the URL of the > site. Again, this might not be correct in all cases, and it > would be > great if somebody had a proper mapping from > Wikipedias/Wikivoyages to > language codes. > > (5) Some data excluded. Data that cannot currently be edited is not > exported, even if it is found in the dumps. Examples include > statement > ranks and timezones for time datavalues. I also currently exclude > labels > and descriptions for simple English, formal German, and informal > Dutch, > since these would pollute the label space for English, German, and > Dutch > without adding much benefit (other than possibly for simple English > descriptions, I cannot see any case where these languages should > ever > have different Wikidata texts at all). > > Feedback is welcome. > > Cheers, > > Markus > > [1] https://github.com/mkroetzsch/wda > Run "python wda-export.data.py --help" for usage instructions > [2] http://semanticweb.org/RDF/Wikidata/ > [3] http://meta.wikimedia.org/wiki/Wikidata/Development/RDF >
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Events:
- NLP & DBpedia 2013 (http://nlp-dbpedia2013.blogs.aksw.org, Extended
Deadline: *July 18th*)
- LSWT 23/24 Sept, 2013 in Leipzig (http://aksw.org/lswt)
Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf Projects: http://nlp2rdf.org , http://linguistics.okfn.org , http://dbpedia.org/Wiktionary , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org
Given your "educating" people about software engineering principles, this may fall on deaf ears, but I too have a strong preference for the format with an independent line per triple.
On Sat, Aug 10, 2013 at 8:35 AM, Markus Krötzsch < markus.kroetzsch@cs.ox.ac.uk> wrote:
On 10/08/13 12:18, Sebastian Hellmann wrote:
By the way, you can always convert it to turtle easily: curl http://downloads.dbpedia.org/**3.8/ko/mappingbased_** properties_ko.ttl.bz2http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2| bzcat | head -100 | rapper -i turtle -o turtle -I - - file
If conversion is so easy, it does not seem worthwhile to have much of a discussion about this at all.
The point of the discussion to advocate for a format that is most useful to the maximum number of people out of the box. Rapper isn't installed by default on systems. A file format with independent lines can be processed using grep and other simple command line tools without having to find and install additional software.
Tom
Hi Tom,
On 10/08/13 15:55, Tom Morris wrote:
Given your "educating" people about software engineering principles, this may fall on deaf ears, but I too have a strong preference for the format with an independent line per triple.
No worries. The eventual RDF export of Wikidata will most certainly have this (and any other standard format one could want). If you need NTriples export earlier, but do not want to use a second tool for this, then you could modify the triple writing methods in the python script as I suggested a few emails ago.
On Sat, Aug 10, 2013 at 8:35 AM, Markus Krötzsch <markus.kroetzsch@cs.ox.ac.uk mailto:markus.kroetzsch@cs.ox.ac.uk> wrote:
On 10/08/13 12:18, Sebastian Hellmann wrote: By the way, you can always convert it to turtle easily: curl http://downloads.dbpedia.org/__3.8/ko/mappingbased___properties_ko.ttl.bz2 <http://downloads.dbpedia.org/3.8/ko/mappingbased_properties_ko.ttl.bz2> | bzcat | head -100 | rapper -i turtle -o turtle -I - - file If conversion is so easy, it does not seem worthwhile to have much of a discussion about this at all.
The point of the discussion to advocate for a format that is most useful to the maximum number of people out of the box. Rapper isn't installed by default on systems. A file format with independent lines can be processed using grep and other simple command line tools without having to find and install additional software.
I think the rapper command you refer to was only for expanding prefixes, not for making line-by-line syntax. Prefixes should not create any greping inconveniences.
Anyway, if you restrict yourself to tools that are installed by default on your system, then it will be difficult to do many interesting things with a 4.5G RDF file ;-) Seriously, the RDF dump is really meant specifically for tools that take RDF inputs. It is not very straightforward to encode all of Wikidata in triples, and it leads to some inconvenient constructions (especially a lot of reification). If you don't actually want to use an RDF tool and you are just interested in the data, then there would be easier ways of getting it.
Out of curiosity, what kind of use do you have in mind for the RDF (or for the data in general)?
Cheers,
Markus
On Sat, Aug 10, 2013 at 2:30 PM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Anyway, if you restrict yourself to tools that are installed by default on your system, then it will be difficult to do many interesting things with a 4.5G RDF file ;-) Seriously, the RDF dump is really meant specifically for tools that take RDF inputs. It is not very straightforward to encode all of Wikidata in triples, and it leads to some inconvenient constructions (especially a lot of reification). If you don't actually want to use an RDF tool and you are just interested in the data, then there would be easier ways of getting it.
A single fact per line seems like a pretty convenient format to me. What format do you recommend that's easier to process?
Tom
On 11/08/13 22:29, Tom Morris wrote:
On Sat, Aug 10, 2013 at 2:30 PM, Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:
Anyway, if you restrict yourself to tools that are installed by default on your system, then it will be difficult to do many interesting things with a 4.5G RDF file ;-) Seriously, the RDF dump is really meant specifically for tools that take RDF inputs. It is not very straightforward to encode all of Wikidata in triples, and it leads to some inconvenient constructions (especially a lot of reification). If you don't actually want to use an RDF tool and you are just interested in the data, then there would be easier ways of getting it.
A single fact per line seems like a pretty convenient format to me. What format do you recommend that's easier to process?
I'd suggest some custom format that at least keeps single data values in one line. For example, in RDF, you have to do two joins to find all items that have a property with a date in the year 2010. Even with a line-by-line format, you will not be able to grep this. So I think a less normalised representation would be nicer for direct text-based processing. For text-based processing, I would probably prefer a format where one statement is encoded on one line. But it really depends on what you want to do. Maybe you could also remove some data to obtain something that is easier to process.
Markus
With respect to the RDF export I'd advocate for: 1) an RDF format with one fact per line. 2) the use of a mature/proven RDF generation framework.
Optimizing too early based on a limited and/or biased view of the potential use cases may not be a good idea in the long run. I'd rather keep it simple and standard at the data publishing level, and let consumers access data easily and optimize processing to their need.
Also, I should not have to run a preprocessing step for filtering out the pieces of data that do not follow the standardŠ
Note that I also understand the need for a format that groups every facts about an subject into one record, and serialize them one record per line. It sometime makes life easier for bulk processing of large datasets. But that's a different discussion.
-- Nicolas Torzec.
On 8/12/13 1:49 AM, "Markus Krötzsch" markus@semantic-mediawiki.org wrote:
On 11/08/13 22:29, Tom Morris wrote:
On Sat, Aug 10, 2013 at 2:30 PM, Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:
Anyway, if you restrict yourself to tools that are installed by default on your system, then it will be difficult to do many interesting things with a 4.5G RDF file ;-) Seriously, the RDF dump is really meant specifically for tools that take RDF inputs. It is not very straightforward to encode all of Wikidata in triples, and it leads to some inconvenient constructions (especially a lot of reification). If you don't actually want to use an RDF tool and you are just interested in the data, then there would be easier ways of getting it.
A single fact per line seems like a pretty convenient format to me. What format do you recommend that's easier to process?
I'd suggest some custom format that at least keeps single data values in one line. For example, in RDF, you have to do two joins to find all items that have a property with a date in the year 2010. Even with a line-by-line format, you will not be able to grep this. So I think a less normalised representation would be nicer for direct text-based processing. For text-based processing, I would probably prefer a format where one statement is encoded on one line. But it really depends on what you want to do. Maybe you could also remove some data to obtain something that is easier to process.
Markus
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
On 12/08/13 17:56, Nicolas Torzec wrote:
With respect to the RDF export I'd advocate for:
- an RDF format with one fact per line.
- the use of a mature/proven RDF generation framework.
Optimizing too early based on a limited and/or biased view of the potential use cases may not be a good idea in the long run. I'd rather keep it simple and standard at the data publishing level, and let consumers access data easily and optimize processing to their need.
RDF has several official, standardised syntaxes, and one of them is Turtle. Using it is not a form of optimisation, just a choice of syntax. Every tool I have ever used for serious RDF work (triple stores, libraries, even OWL tools) supports any of the standard RDF syntaxes *just as well*. I do see that there are some advantages in some formats and others in others (I agree with most arguments that have been put forward). But would it not be better to first take a look at the actual content rather than debating the syntactic formatting now? As I said, this is not the final syntax anyway, which will be created with different code in a different programming language.
Also, I should not have to run a preprocessing step for filtering out the pieces of data that do not follow the standardŠ
To the best of our knowledge, there are no such pieces in the current dump. We should try to keep this conversation somewhat related to the actual Wikidata dump that is created by the current version of the Python script on github (I will also upload a dump again tomorrow; currently, you can only get the dump by running the script yourself). I know I suggested that one could parse Turtle in a robust way (which I still think one can) but I am not suggesting for a moment that this should be necessary for using Wikidata dumps in the future. I am committed to fix any error as it is found, but so far I don't get much input in that direction.
Note that I also understand the need for a format that groups every facts about an subject into one record, and serialize them one record per line. It sometime makes life easier for bulk processing of large datasets. But that's a different discussion.
As I said: advantages and disadvantages. This is why we will probably have all desired formats at some time. But someone needs to start somewhere.
Markus
-- Nicolas Torzec.
On 8/12/13 1:49 AM, "Markus Krötzsch" markus@semantic-mediawiki.org wrote:
On 11/08/13 22:29, Tom Morris wrote:
On Sat, Aug 10, 2013 at 2:30 PM, Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:
Anyway, if you restrict yourself to tools that are installed by default on your system, then it will be difficult to do many interesting things with a 4.5G RDF file ;-) Seriously, the RDF dump is really meant specifically for tools that take RDF inputs. It is not very straightforward to encode all of Wikidata in triples, and it leads to some inconvenient constructions (especially a lot of reification). If you don't actually want to use an RDF tool and you are just interested in the data, then there would be easier ways of getting it.
A single fact per line seems like a pretty convenient format to me. What format do you recommend that's easier to process?
I'd suggest some custom format that at least keeps single data values in one line. For example, in RDF, you have to do two joins to find all items that have a property with a date in the year 2010. Even with a line-by-line format, you will not be able to grep this. So I think a less normalised representation would be nicer for direct text-based processing. For text-based processing, I would probably prefer a format where one statement is encoded on one line. But it really depends on what you want to do. Maybe you could also remove some data to obtain something that is easier to process.
Markus
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
On 8/12/13 12:56 PM, Nicolas Torzec wrote:
With respect to the RDF export I'd advocate for:
- an RDF format with one fact per line.
- the use of a mature/proven RDF generation framework.
Yes, keep it simple, use Turtle.
The additional benefit of Turtle is that is addresses a wide data consumer profile i.e., one that extends from the casual end-user all the way up to a parser developer.
When producing Turtle, if possible, stay away from Prefixes, also look to using relative URIs which will eliminate complexity and confusion that can arise re. Linked Deployment.
Simple rules that have helped me, eternally:
1. denote entities not of type Web Resource or Document using hash based HTTP URIs 2. denote source documents (the docs comprised of the data being published) using relative URIs via <> 3. stay away from prefixes (they confuse casual end-users).
BTW -- I suspect some might be wandering, isn't this N-Triples? Answer: No, because of the use of relative HTTP URIs to denote documents, which isn't supported by N-Triples.
A Turtle based RDF model based structured data dump from Wikidata would be a might valuable contribution to the Linked Open Data Cloud.
Markus Krötzsch, 03/08/2013 15:48:
(3) Limited language support. The script uses Wikidata's internal language codes for string literals in RDF. In some cases, this might not be correct. It would be great if somebody could create a mapping from Wikidata language codes to BCP47 language codes (let me know if you think you can do this, and I'll tell you where to put it)
These are only a handful, aren't they?
(4) Limited site language support. To specify the language of linked wiki sites, the script extracts a language code from the URL of the site. Again, this might not be correct in all cases, and it would be great if somebody had a proper mapping from Wikipedias/Wikivoyages to language codes.
Apart from the above, doesn't wgLanguageCode in https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php have what you need?
Nemo
Let me top-post a question to the Wikidata dev team:
Where can we find documentation on what the Wikidata internal language codes actually mean? In particular, how do you map the language selector to the internal codes? I noticed some puzzling details:
* Wikidata uses "be-x-old" as a code, but MediaWiki messages for this language seem to use "be-tarask" as a language code. So there must be a mapping somewhere. Where?
* MediaWiki's http://www.mediawiki.org/wiki/Manual:$wgDummyLanguageCodes provides some mappings. For example, it maps "zh-yue" to "yue". Yet, Wikidata use both of these codes. What does this mean?
Answers to Nemo's points inline:
On 04/08/13 06:15, Federico Leva (Nemo) wrote:
Markus Krötzsch, 03/08/2013 15:48:
(3) Limited language support. The script uses Wikidata's internal language codes for string literals in RDF. In some cases, this might not be correct. It would be great if somebody could create a mapping from Wikidata language codes to BCP47 language codes (let me know if you think you can do this, and I'll tell you where to put it)
These are only a handful, aren't they?
There are about 369 language codes right now. You can see the complete list in langCodes at the bottom of the file
https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py
Most might be correct already, but it is hard to say. Also, is it okay to create new (sub)language codes for our own purposes? Something like simple English will hardly have an official code, but it would be bad to export is as "en".
(4) Limited site language support. To specify the language of linked wiki sites, the script extracts a language code from the URL of the site. Again, this might not be correct in all cases, and it would be great if somebody had a proper mapping from Wikipedias/Wikivoyages to language codes.
Apart from the above, doesn't wgLanguageCode in https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php have what you need?
Interesting. However, the list there does not contain all 300 sites that we currently find in Wikidata dumps (and some that we do not find there, including things like dkwiki that seem to be outdated). The full list of sites we support is also found in the file I mentioned above, just after the language list (variable siteLanguageCodes).
Markus
Markus Krötzsch, 04/08/2013 12:32:
- Wikidata uses "be-x-old" as a code, but MediaWiki messages for this
language seem to use "be-tarask" as a language code. So there must be a mapping somewhere. Where?
Where I linked it.
provides some mappings. For example, it maps "zh-yue" to "yue". Yet, Wikidata use both of these codes. What does this mean?
Answers to Nemo's points inline:
On 04/08/13 06:15, Federico Leva (Nemo) wrote:
Markus Krötzsch, 03/08/2013 15:48:
(3) Limited language support. The script uses Wikidata's internal language codes for string literals in RDF. In some cases, this might not be correct. It would be great if somebody could create a mapping from Wikidata language codes to BCP47 language codes (let me know if you think you can do this, and I'll tell you where to put it)
These are only a handful, aren't they?
There are about 369 language codes right now. You can see the complete list in langCodes at the bottom of the file
https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py
Most might be correct already, but it is hard to say.
Only a handful are incorrect, unless Wikidata has specific problems (no idea how you reach 369).
Also, is it okay to create new (sub)language codes for our own purposes? Something like simple English will hardly have an official code, but it would be bad to export is as "en".
(4) Limited site language support. To specify the language of linked wiki sites, the script extracts a language code from the URL of the site. Again, this might not be correct in all cases, and it would be great if somebody had a proper mapping from Wikipedias/Wikivoyages to language codes.
Apart from the above, doesn't wgLanguageCode in https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php have what you need?
Interesting. However, the list there does not contain all 300 sites that we currently find in Wikidata dumps (and some that we do not find there, including things like dkwiki that seem to be outdated). The full list of sites we support is also found in the file I mentioned above, just after the language list (variable siteLanguageCodes).
Of course not all wikis are there, that configuration is needed only when the subdomain is "wrong". It's still not clear to me what codes you are considering wrong.
Nemo
On 04/08/13 13:17, Federico Leva (Nemo) wrote:
Markus Krötzsch, 04/08/2013 12:32:
- Wikidata uses "be-x-old" as a code, but MediaWiki messages for this
language seem to use "be-tarask" as a language code. So there must be a mapping somewhere. Where?
Where I linked it.
Are you sure? The file you linked has mappings from site ids to language codes, not from language codes to language codes. Do you mean to say: "If you take only the entries of the form 'XXXwiki' in the list, and extract a language code from the XXX, then you get a mapping from language codes to language codes that covers all exceptions in Wikidata"? This approach would give us:
'als' : 'gsw', 'bat-smg': 'sgs', 'be_x_old' : 'be-tarask', 'crh': 'crh-latn', 'fiu_vro': 'vro', 'no' : 'nb', 'roa-rup': 'rup', 'zh-classical' : 'lzh' 'zh-min-nan': 'nan', 'zh-yue': 'yue'
Each of the values on the left here also occur as language tags in Wikidata, so if we map them, we use the same tag for things that Wikidata has distinct tags for. For example, Q27 has a label for yue but also for zh-yue [1]. It seems to be wrong to export both of these with the same language tag if Wikidata uses them for different purposes.
Maybe this is a bug in Wikidata and we should just not export texts with any of the above codes at all (since they always are given by another tag directly)?
provides some mappings. For example, it maps "zh-yue" to "yue". Yet, Wikidata use both of these codes. What does this mean?
Answers to Nemo's points inline:
On 04/08/13 06:15, Federico Leva (Nemo) wrote:
Markus Krötzsch, 03/08/2013 15:48:
...
Apart from the above, doesn't wgLanguageCode in https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php have what you need?
Interesting. However, the list there does not contain all 300 sites that we currently find in Wikidata dumps (and some that we do not find there, including things like dkwiki that seem to be outdated). The full list of sites we support is also found in the file I mentioned above, just after the language list (variable siteLanguageCodes).
Of course not all wikis are there, that configuration is needed only when the subdomain is "wrong". It's still not clear to me what codes you are considering wrong.
Well, the obvious: if a language used in Wikidata labels or on Wikimedia sites has an official IANA code [2], then we should use this code. Every other code would be "wrong". For languages that do not have any accurate code, we should probably use a private code, following the requirements of BCP 47 for private use subtags (in particular, they should have a single x somewhere).
This does not seem to be done correctly by my current code. For example, we now map 'map_bmswiki' to 'map-bms'. While both 'map' and 'bms' are lANA language tags, I am not sure that their combination makes sense. The language should be Basa Banyumasan, but bms is for Bilma Kanuri (and it is a language code, not a dialect code). Note that map-bms does not occur in the file you linked to, so I guess there is some more work to do.
Markus
[1] http://www.wikidata.org/wiki/Special:Export/Q27 [2] http://www.iana.org/assignments/language-subtag-registry/language-subtag-reg...
Small update: I went through the language list at
https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py...
and added a number of TODOs to the most obvious problematic cases. Typical problems are:
* Malformed language codes ('tokipona') * Correctly formed language codes without any official meaning (e.g., 'cbk-zam') * Correctly formed codes with the wrong meaning (e.g., 'sr-ec': Serbian from Ecuador?!) * Language codes with redundant information (e.g., 'kk-cyrl' should be the same as 'kk' according to IANA, but we have both) * Use of macrolanguages instead of languages (e.g., "zh" is not "Mandarin" but just "Chinese"; I guess we mean Mandarin; less sure about Kurdish ...) * Language codes with incomplete information (e.g., "sr" should be "sr-Cyrl" or "sr-Latn", both of which already exist; same for "zh" and "zh-Hans"/"zh-Hant", but also for "zh-HK" [is this simplified or traditional?]).
I invite any language experts to look at the file and add comments/improvements. Some of the issues should possibly also be considered on the implementation side: we don't want two distinct codes for the same thing.
Cheers,
Markus
On 04/08/13 16:35, Markus Krötzsch wrote:
On 04/08/13 13:17, Federico Leva (Nemo) wrote:
Markus Krötzsch, 04/08/2013 12:32:
- Wikidata uses "be-x-old" as a code, but MediaWiki messages for this
language seem to use "be-tarask" as a language code. So there must be a mapping somewhere. Where?
Where I linked it.
Are you sure? The file you linked has mappings from site ids to language codes, not from language codes to language codes. Do you mean to say: "If you take only the entries of the form 'XXXwiki' in the list, and extract a language code from the XXX, then you get a mapping from language codes to language codes that covers all exceptions in Wikidata"? This approach would give us:
'als' : 'gsw', 'bat-smg': 'sgs', 'be_x_old' : 'be-tarask', 'crh': 'crh-latn', 'fiu_vro': 'vro', 'no' : 'nb', 'roa-rup': 'rup', 'zh-classical' : 'lzh' 'zh-min-nan': 'nan', 'zh-yue': 'yue'
Each of the values on the left here also occur as language tags in Wikidata, so if we map them, we use the same tag for things that Wikidata has distinct tags for. For example, Q27 has a label for yue but also for zh-yue [1]. It seems to be wrong to export both of these with the same language tag if Wikidata uses them for different purposes.
Maybe this is a bug in Wikidata and we should just not export texts with any of the above codes at all (since they always are given by another tag directly)?
provides some mappings. For example, it maps "zh-yue" to "yue". Yet, Wikidata use both of these codes. What does this mean?
Answers to Nemo's points inline:
On 04/08/13 06:15, Federico Leva (Nemo) wrote:
Markus Krötzsch, 03/08/2013 15:48:
...
Apart from the above, doesn't wgLanguageCode in https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php
have what you need?
Interesting. However, the list there does not contain all 300 sites that we currently find in Wikidata dumps (and some that we do not find there, including things like dkwiki that seem to be outdated). The full list of sites we support is also found in the file I mentioned above, just after the language list (variable siteLanguageCodes).
Of course not all wikis are there, that configuration is needed only when the subdomain is "wrong". It's still not clear to me what codes you are considering wrong.
Well, the obvious: if a language used in Wikidata labels or on Wikimedia sites has an official IANA code [2], then we should use this code. Every other code would be "wrong". For languages that do not have any accurate code, we should probably use a private code, following the requirements of BCP 47 for private use subtags (in particular, they should have a single x somewhere).
This does not seem to be done correctly by my current code. For example, we now map 'map_bmswiki' to 'map-bms'. While both 'map' and 'bms' are lANA language tags, I am not sure that their combination makes sense. The language should be Basa Banyumasan, but bms is for Bilma Kanuri (and it is a language code, not a dialect code). Note that map-bms does not occur in the file you linked to, so I guess there is some more work to do.
Markus
[1] http://www.wikidata.org/wiki/Special:Export/Q27 [2] http://www.iana.org/assignments/language-subtag-registry/language-subtag-reg...
Hi Purodha,
thanks for the helpful hints. I have implemented most of these now in the list on git (this is also where you can see the private codes I have created where needed). I don't see a big problem in changing the codes in future exports if better options become available (it's much easier than changing codes used internally).
One open question that I still have is what it means if a language that usually has a script tag appears without such a tag (zh vs. zh-Hans/zh-Hant or sr vs. sr-Cyrl/sr-Latn). Does this really mean that we do not know which script is used under this code (either could appear)?
The other question is about the duplicate language tags, such as 'crh' and 'crh-Latn', which both appear in the data but are mapped to the same code. Maybe one of the codes is just phased out and will disappear over time? I guess the Wikidata team needs to answer this. We also have some codes that mean the same according to IANA, namely kk and kk-Cyrl, but which are currently not mapped to the same canonical IANA code.
Finally, I wondered about Norwegian. I gather that no.wikipedia.org is in Norwegian Bokmål (nb), which is how I map the site now. However, the language data in the dumps (not the site data) uses both "no" and "nb". Moreover, many items have different texts for nb and no. I wonder if both are still Bokmål, and there is just a bug that allows people to enter texts for nb under two language settings (for descriptions this could easily be a different text, even if in the same language). We also have nn, and I did not check how this relates to no (same text or different?).
Cheers, Markus
On 05/08/13 15:41, P. Blissenbach wrote:
Hi Markus, Our code 'sr-ec' is at this moment effectively equivalent to 'sr-Cyrl', likewise is our code 'sr-el' currently effectively equivalent to 'sr-Latn'. Both might change, once dialect codes of Serbian are added to the IANA subtag registry at http://www.iana.org/assignments/language-subtag-registry/language-subtag-reg... Our code 'nrm' is not being used for the Narom language as ISO 639-3 does, see: http://www-01.sil.org/iso639-3/documentation.asp?id=nrm We rather use it for the Norman / Nourmaud, as described in http://en.wikipedia.org/wiki/Norman_language The Norman language is recognized by the linguist list and many others but as of yet not present in ISO 639-3. It should probably be suggested to be added. We should probaly map it to a private code meanwhile. Our code 'ksh' is currently being used to represent a superset of what it stands for in ISO 639-3. Since ISO 639 lacks a group code for Ripuarian, we use the code of the only Ripuarian variety (of dozens) having a code, to represent the whole lot. We should probably suggest to add a group code to ISO 639, and at least the dozen+ Ripuarian languages that we are using, and map 'ksh' to a private code for Ripuarian meanwhile. Note also, that for the ALS/GSW and the KSH Wikipedias, page titles are not guaranteed to be in the languages of the Wikipedias. They are often in German instead. Details to be found in their respective page titleing rules. Moreover, for the ksh Wikipedia, unlike some other multilingual or multidialectal Wikipedias, texts are not, or quite often incorrectly, labelled as belonging to a certain dialect. See also: http://meta.wikimedia.org/wiki/Special_language_codes Greetings -- Purodha *Gesendet:* Sonntag, 04. August 2013 um 19:01 Uhr *Von:* "Markus Krötzsch" markus@semantic-mediawiki.org *An:* "Federico Leva (Nemo)" nemowiki@gmail.com *Cc:* "Discussion list for the Wikidata project." wikidata-l@lists.wikimedia.org *Betreff:* [Wikidata-l] Wikidata language codes (Was: Wikidata RDF export available) Small update: I went through the language list at
https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py...
and added a number of TODOs to the most obvious problematic cases. Typical problems are:
- Malformed language codes ('tokipona')
- Correctly formed language codes without any official meaning (e.g.,
'cbk-zam')
- Correctly formed codes with the wrong meaning (e.g., 'sr-ec': Serbian
from Ecuador?!)
- Language codes with redundant information (e.g., 'kk-cyrl' should be
the same as 'kk' according to IANA, but we have both)
- Use of macrolanguages instead of languages (e.g., "zh" is not
"Mandarin" but just "Chinese"; I guess we mean Mandarin; less sure about Kurdish ...)
- Language codes with incomplete information (e.g., "sr" should be
"sr-Cyrl" or "sr-Latn", both of which already exist; same for "zh" and "zh-Hans"/"zh-Hant", but also for "zh-HK" [is this simplified or traditional?]).
I invite any language experts to look at the file and add comments/improvements. Some of the issues should possibly also be considered on the implementation side: we don't want two distinct codes for the same thing.
Cheers,
Markus
On 04/08/13 16:35, Markus Krötzsch wrote:
On 04/08/13 13:17, Federico Leva (Nemo) wrote:
Markus Krötzsch, 04/08/2013 12:32:
- Wikidata uses "be-x-old" as a code, but MediaWiki messages for this
language seem to use "be-tarask" as a language code. So there must be a mapping somewhere. Where?
Where I linked it.
Are you sure? The file you linked has mappings from site ids to language codes, not from language codes to language codes. Do you mean to say: "If you take only the entries of the form 'XXXwiki' in the list, and extract a language code from the XXX, then you get a mapping from language codes to language codes that covers all exceptions in Wikidata"? This approach would give us:
'als' : 'gsw', 'bat-smg': 'sgs', 'be_x_old' : 'be-tarask', 'crh': 'crh-latn', 'fiu_vro': 'vro', 'no' : 'nb', 'roa-rup': 'rup', 'zh-classical' : 'lzh' 'zh-min-nan': 'nan', 'zh-yue': 'yue'
Each of the values on the left here also occur as language tags in Wikidata, so if we map them, we use the same tag for things that Wikidata has distinct tags for. For example, Q27 has a label for yue but also for zh-yue [1]. It seems to be wrong to export both of these with the same language tag if Wikidata uses them for different purposes.
Maybe this is a bug in Wikidata and we should just not export texts with any of the above codes at all (since they always are given by another tag directly)?
- MediaWiki's
http://www.mediawiki.org/wiki/Manual:$wgDummyLanguageCodes
provides some mappings. For example, it maps "zh-yue" to "yue". Yet, Wikidata use both of these codes. What does this mean?
Answers to Nemo's points inline:
On 04/08/13 06:15, Federico Leva (Nemo) wrote:
Markus Krötzsch, 03/08/2013 15:48:
...
Apart from the above, doesn't wgLanguageCode in
https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php
have what you need?
Interesting. However, the list there does not contain all 300 sites
that
we currently find in Wikidata dumps (and some that we do not find
there,
including things like dkwiki that seem to be outdated). The full
list of
sites we support is also found in the file I mentioned above, just
after
the language list (variable siteLanguageCodes).
Of course not all wikis are there, that configuration is needed only when the subdomain is "wrong". It's still not clear to me what codes you are considering wrong.
Well, the obvious: if a language used in Wikidata labels or on Wikimedia sites has an official IANA code [2], then we should use this code. Every other code would be "wrong". For languages that do not have any accurate code, we should probably use a private code, following the requirements of BCP 47 for private use subtags (in particular, they should have a single x somewhere).
This does not seem to be done correctly by my current code. For example, we now map 'map_bmswiki' to 'map-bms'. While both 'map' and 'bms' are lANA language tags, I am not sure that their combination makes sense. The language should be Basa Banyumasan, but bms is for Bilma Kanuri (and it is a language code, not a dialect code). Note that map-bms does not occur in the file you linked to, so I guess there is some more work
to do.
Markus
http://www.iana.org/assignments/language-subtag-registry/language-subtag-reg...
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
The language code "no" is the metacode for Norwegian, and nowiki was in the beginning used for both Norwegian Bokmål, Riksmål and Nynorsk. The later split of and made nnwiki, but nowiki continued as before. After a while all Nynorsk content was migrated. Now nowiki has content in Bokmål and Riksmål, first one is official in Norway and the later is an unofficial variant. After the last additions to Bokmål there are very few forms that are only legal n Riksmål, so for all practical purposes nowiki has become a pure Bokmål wiki.
I think all content in Wikidata should use either "nn" or "nb", and all existing content with "no" as language code should be folded into "nb". It would be nice if "no" could be used as an alias for "nb", as this is de facto situation now, but it is probably not necessary and could create a discussion with the Nynorsk community.
The site code should be "nowiki" as long as the community does not ask for a change.
jeblad
On 8/6/13, Markus Krötzsch markus@semantic-mediawiki.org wrote:
Hi Purodha,
thanks for the helpful hints. I have implemented most of these now in the list on git (this is also where you can see the private codes I have created where needed). I don't see a big problem in changing the codes in future exports if better options become available (it's much easier than changing codes used internally).
One open question that I still have is what it means if a language that usually has a script tag appears without such a tag (zh vs. zh-Hans/zh-Hant or sr vs. sr-Cyrl/sr-Latn). Does this really mean that we do not know which script is used under this code (either could appear)?
The other question is about the duplicate language tags, such as 'crh' and 'crh-Latn', which both appear in the data but are mapped to the same code. Maybe one of the codes is just phased out and will disappear over time? I guess the Wikidata team needs to answer this. We also have some codes that mean the same according to IANA, namely kk and kk-Cyrl, but which are currently not mapped to the same canonical IANA code.
Finally, I wondered about Norwegian. I gather that no.wikipedia.org is in Norwegian Bokmål (nb), which is how I map the site now. However, the language data in the dumps (not the site data) uses both "no" and "nb". Moreover, many items have different texts for nb and no. I wonder if both are still Bokmål, and there is just a bug that allows people to enter texts for nb under two language settings (for descriptions this could easily be a different text, even if in the same language). We also have nn, and I did not check how this relates to no (same text or different?).
Cheers, Markus
On 05/08/13 15:41, P. Blissenbach wrote:
Hi Markus, Our code 'sr-ec' is at this moment effectively equivalent to 'sr-Cyrl', likewise is our code 'sr-el' currently effectively equivalent to 'sr-Latn'. Both might change, once dialect codes of Serbian are added to the IANA subtag registry at http://www.iana.org/assignments/language-subtag-registry/language-subtag-reg... Our code 'nrm' is not being used for the Narom language as ISO 639-3 does, see: http://www-01.sil.org/iso639-3/documentation.asp?id=nrm We rather use it for the Norman / Nourmaud, as described in http://en.wikipedia.org/wiki/Norman_language The Norman language is recognized by the linguist list and many others but as of yet not present in ISO 639-3. It should probably be suggested to be added. We should probaly map it to a private code meanwhile. Our code 'ksh' is currently being used to represent a superset of what it stands for in ISO 639-3. Since ISO 639 lacks a group code for Ripuarian, we use the code of the only Ripuarian variety (of dozens) having a code, to represent the whole lot. We should probably suggest to add a group code to ISO 639, and at least the dozen+ Ripuarian languages that we are using, and map 'ksh' to a private code for Ripuarian meanwhile. Note also, that for the ALS/GSW and the KSH Wikipedias, page titles are not guaranteed to be in the languages of the Wikipedias. They are often in German instead. Details to be found in their respective page titleing rules. Moreover, for the ksh Wikipedia, unlike some other multilingual or multidialectal Wikipedias, texts are not, or quite often incorrectly, labelled as belonging to a certain dialect. See also: http://meta.wikimedia.org/wiki/Special_language_codes Greetings -- Purodha *Gesendet:* Sonntag, 04. August 2013 um 19:01 Uhr *Von:* "Markus Krötzsch" markus@semantic-mediawiki.org *An:* "Federico Leva (Nemo)" nemowiki@gmail.com *Cc:* "Discussion list for the Wikidata project." wikidata-l@lists.wikimedia.org *Betreff:* [Wikidata-l] Wikidata language codes (Was: Wikidata RDF export available) Small update: I went through the language list at
https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py...
and added a number of TODOs to the most obvious problematic cases. Typical problems are:
- Malformed language codes ('tokipona')
- Correctly formed language codes without any official meaning (e.g.,
'cbk-zam')
- Correctly formed codes with the wrong meaning (e.g., 'sr-ec': Serbian
from Ecuador?!)
- Language codes with redundant information (e.g., 'kk-cyrl' should be
the same as 'kk' according to IANA, but we have both)
- Use of macrolanguages instead of languages (e.g., "zh" is not
"Mandarin" but just "Chinese"; I guess we mean Mandarin; less sure about Kurdish ...)
- Language codes with incomplete information (e.g., "sr" should be
"sr-Cyrl" or "sr-Latn", both of which already exist; same for "zh" and "zh-Hans"/"zh-Hant", but also for "zh-HK" [is this simplified or traditional?]).
I invite any language experts to look at the file and add comments/improvements. Some of the issues should possibly also be considered on the implementation side: we don't want two distinct codes for the same thing.
Cheers,
Markus
On 04/08/13 16:35, Markus Krötzsch wrote:
On 04/08/13 13:17, Federico Leva (Nemo) wrote:
Markus Krötzsch, 04/08/2013 12:32:
- Wikidata uses "be-x-old" as a code, but MediaWiki messages for this
language seem to use "be-tarask" as a language code. So there must be
a
mapping somewhere. Where?
Where I linked it.
Are you sure? The file you linked has mappings from site ids to
language
codes, not from language codes to language codes. Do you mean to say: "If you take only the entries of the form 'XXXwiki' in the list, and extract a language code from the XXX, then you get a mapping from language codes to language codes that covers all exceptions in Wikidata"? This approach would give us:
'als' : 'gsw', 'bat-smg': 'sgs', 'be_x_old' : 'be-tarask', 'crh': 'crh-latn', 'fiu_vro': 'vro', 'no' : 'nb', 'roa-rup': 'rup', 'zh-classical' : 'lzh' 'zh-min-nan': 'nan', 'zh-yue': 'yue'
Each of the values on the left here also occur as language tags in Wikidata, so if we map them, we use the same tag for things that Wikidata has distinct tags for. For example, Q27 has a label for yue
but
also for zh-yue [1]. It seems to be wrong to export both of these with the same language tag if Wikidata uses them for different purposes.
Maybe this is a bug in Wikidata and we should just not export texts
with
any of the above codes at all (since they always are given by another tag directly)?
- MediaWiki's
http://www.mediawiki.org/wiki/Manual:$wgDummyLanguageCodes
provides some mappings. For example, it maps "zh-yue" to "yue". Yet, Wikidata use both of these codes. What does this mean?
Answers to Nemo's points inline:
On 04/08/13 06:15, Federico Leva (Nemo) wrote:
Markus Krötzsch, 03/08/2013 15:48:
...
Apart from the above, doesn't wgLanguageCode in
https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php
have what you need?
Interesting. However, the list there does not contain all 300 sites
that
we currently find in Wikidata dumps (and some that we do not find
there,
including things like dkwiki that seem to be outdated). The full
list of
sites we support is also found in the file I mentioned above, just
after
the language list (variable siteLanguageCodes).
Of course not all wikis are there, that configuration is needed only when the subdomain is "wrong". It's still not clear to me what codes
you
are considering wrong.
Well, the obvious: if a language used in Wikidata labels or on
Wikimedia
sites has an official IANA code [2], then we should use this code.
Every
other code would be "wrong". For languages that do not have any
accurate
code, we should probably use a private code, following the requirements of BCP 47 for private use subtags (in particular, they should have a single x somewhere).
This does not seem to be done correctly by my current code. For
example,
we now map 'map_bmswiki' to 'map-bms'. While both 'map' and 'bms' are lANA language tags, I am not sure that their combination makes sense. The language should be Basa Banyumasan, but bms is for Bilma Kanuri
(and
it is a language code, not a dialect code). Note that map-bms does not occur in the file you linked to, so I guess there is some more work
to do.
Markus
http://www.iana.org/assignments/language-subtag-registry/language-subtag-reg...
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
On 10/08/13 11:07, John Erling Blad wrote:
The language code "no" is the metacode for Norwegian, and nowiki was in the beginning used for both Norwegian Bokmål, Riksmål and Nynorsk. The later split of and made nnwiki, but nowiki continued as before. After a while all Nynorsk content was migrated. Now nowiki has content in Bokmål and Riksmål, first one is official in Norway and the later is an unofficial variant. After the last additions to Bokmål there are very few forms that are only legal n Riksmål, so for all practical purposes nowiki has become a pure Bokmål wiki.
I think all content in Wikidata should use either "nn" or "nb", and all existing content with "no" as language code should be folded into "nb". It would be nice if "no" could be used as an alias for "nb", as this is de facto situation now, but it is probably not necessary and could create a discussion with the Nynorsk community.
The site code should be "nowiki" as long as the community does not ask for a change.
Thanks for the clarification. I will keep "no" to mean "no" for now.
What I wonder is: if users choose to enter a "no" label on Wikidata, what is the language setting that they see? Does this say "Norwegian (any variant)" or what? That's what puzzles me. I know that a Wikipedia can allow multiple languages (or dialects) to coexist, but in the Wikidata language selector I thought you can only select "real" languages, not "language groups".
Markus
On 8/6/13, Markus Krötzsch markus@semantic-mediawiki.org wrote:
Hi Purodha,
thanks for the helpful hints. I have implemented most of these now in the list on git (this is also where you can see the private codes I have created where needed). I don't see a big problem in changing the codes in future exports if better options become available (it's much easier than changing codes used internally).
One open question that I still have is what it means if a language that usually has a script tag appears without such a tag (zh vs. zh-Hans/zh-Hant or sr vs. sr-Cyrl/sr-Latn). Does this really mean that we do not know which script is used under this code (either could appear)?
The other question is about the duplicate language tags, such as 'crh' and 'crh-Latn', which both appear in the data but are mapped to the same code. Maybe one of the codes is just phased out and will disappear over time? I guess the Wikidata team needs to answer this. We also have some codes that mean the same according to IANA, namely kk and kk-Cyrl, but which are currently not mapped to the same canonical IANA code.
Finally, I wondered about Norwegian. I gather that no.wikipedia.org is in Norwegian Bokmål (nb), which is how I map the site now. However, the language data in the dumps (not the site data) uses both "no" and "nb". Moreover, many items have different texts for nb and no. I wonder if both are still Bokmål, and there is just a bug that allows people to enter texts for nb under two language settings (for descriptions this could easily be a different text, even if in the same language). We also have nn, and I did not check how this relates to no (same text or different?).
Cheers, Markus
On 05/08/13 15:41, P. Blissenbach wrote:
Hi Markus, Our code 'sr-ec' is at this moment effectively equivalent to 'sr-Cyrl', likewise is our code 'sr-el' currently effectively equivalent to 'sr-Latn'. Both might change, once dialect codes of Serbian are added to the IANA subtag registry at http://www.iana.org/assignments/language-subtag-registry/language-subtag-reg... Our code 'nrm' is not being used for the Narom language as ISO 639-3 does, see: http://www-01.sil.org/iso639-3/documentation.asp?id=nrm We rather use it for the Norman / Nourmaud, as described in http://en.wikipedia.org/wiki/Norman_language The Norman language is recognized by the linguist list and many others but as of yet not present in ISO 639-3. It should probably be suggested to be added. We should probaly map it to a private code meanwhile. Our code 'ksh' is currently being used to represent a superset of what it stands for in ISO 639-3. Since ISO 639 lacks a group code for Ripuarian, we use the code of the only Ripuarian variety (of dozens) having a code, to represent the whole lot. We should probably suggest to add a group code to ISO 639, and at least the dozen+ Ripuarian languages that we are using, and map 'ksh' to a private code for Ripuarian meanwhile. Note also, that for the ALS/GSW and the KSH Wikipedias, page titles are not guaranteed to be in the languages of the Wikipedias. They are often in German instead. Details to be found in their respective page titleing rules. Moreover, for the ksh Wikipedia, unlike some other multilingual or multidialectal Wikipedias, texts are not, or quite often incorrectly, labelled as belonging to a certain dialect. See also: http://meta.wikimedia.org/wiki/Special_language_codes Greetings -- Purodha *Gesendet:* Sonntag, 04. August 2013 um 19:01 Uhr *Von:* "Markus Krötzsch" markus@semantic-mediawiki.org *An:* "Federico Leva (Nemo)" nemowiki@gmail.com *Cc:* "Discussion list for the Wikidata project." wikidata-l@lists.wikimedia.org *Betreff:* [Wikidata-l] Wikidata language codes (Was: Wikidata RDF export available) Small update: I went through the language list at
https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py...
and added a number of TODOs to the most obvious problematic cases. Typical problems are:
- Malformed language codes ('tokipona')
- Correctly formed language codes without any official meaning (e.g.,
'cbk-zam')
- Correctly formed codes with the wrong meaning (e.g., 'sr-ec': Serbian
from Ecuador?!)
- Language codes with redundant information (e.g., 'kk-cyrl' should be
the same as 'kk' according to IANA, but we have both)
- Use of macrolanguages instead of languages (e.g., "zh" is not
"Mandarin" but just "Chinese"; I guess we mean Mandarin; less sure about Kurdish ...)
- Language codes with incomplete information (e.g., "sr" should be
"sr-Cyrl" or "sr-Latn", both of which already exist; same for "zh" and "zh-Hans"/"zh-Hant", but also for "zh-HK" [is this simplified or traditional?]).
I invite any language experts to look at the file and add comments/improvements. Some of the issues should possibly also be considered on the implementation side: we don't want two distinct codes for the same thing.
Cheers,
Markus
On 04/08/13 16:35, Markus Krötzsch wrote:
On 04/08/13 13:17, Federico Leva (Nemo) wrote:
Markus Krötzsch, 04/08/2013 12:32:
- Wikidata uses "be-x-old" as a code, but MediaWiki messages for this
language seem to use "be-tarask" as a language code. So there must be
a
mapping somewhere. Where?
Where I linked it.
Are you sure? The file you linked has mappings from site ids to
language
codes, not from language codes to language codes. Do you mean to say: "If you take only the entries of the form 'XXXwiki' in the list, and extract a language code from the XXX, then you get a mapping from language codes to language codes that covers all exceptions in Wikidata"? This approach would give us:
'als' : 'gsw', 'bat-smg': 'sgs', 'be_x_old' : 'be-tarask', 'crh': 'crh-latn', 'fiu_vro': 'vro', 'no' : 'nb', 'roa-rup': 'rup', 'zh-classical' : 'lzh' 'zh-min-nan': 'nan', 'zh-yue': 'yue'
Each of the values on the left here also occur as language tags in Wikidata, so if we map them, we use the same tag for things that Wikidata has distinct tags for. For example, Q27 has a label for yue
but
also for zh-yue [1]. It seems to be wrong to export both of these with the same language tag if Wikidata uses them for different purposes.
Maybe this is a bug in Wikidata and we should just not export texts
with
any of the above codes at all (since they always are given by another tag directly)?
- MediaWiki's
http://www.mediawiki.org/wiki/Manual:$wgDummyLanguageCodes
provides some mappings. For example, it maps "zh-yue" to "yue". Yet, Wikidata use both of these codes. What does this mean?
Answers to Nemo's points inline:
On 04/08/13 06:15, Federico Leva (Nemo) wrote: > Markus Krötzsch, 03/08/2013 15:48:
...
> Apart from the above, doesn't wgLanguageCode in >
https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php
> > have what you need?
Interesting. However, the list there does not contain all 300 sites
that
we currently find in Wikidata dumps (and some that we do not find
there,
including things like dkwiki that seem to be outdated). The full
list of
sites we support is also found in the file I mentioned above, just
after
the language list (variable siteLanguageCodes).
Of course not all wikis are there, that configuration is needed only when the subdomain is "wrong". It's still not clear to me what codes
you
are considering wrong.
Well, the obvious: if a language used in Wikidata labels or on
Wikimedia
sites has an official IANA code [2], then we should use this code.
Every
other code would be "wrong". For languages that do not have any
accurate
code, we should probably use a private code, following the requirements of BCP 47 for private use subtags (in particular, they should have a single x somewhere).
This does not seem to be done correctly by my current code. For
example,
we now map 'map_bmswiki' to 'map-bms'. While both 'map' and 'bms' are lANA language tags, I am not sure that their combination makes sense. The language should be Basa Banyumasan, but bms is for Bilma Kanuri
(and
it is a language code, not a dialect code). Note that map-bms does not occur in the file you linked to, so I guess there is some more work
to do.
Markus
http://www.iana.org/assignments/language-subtag-registry/language-subtag-reg...
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
You can't use "no" as a language in ULS, but you can use setlang and uselang with "no" if I remember correct. All messages are aliased to "nb" if the language is "nb". Also at nowiki will the messages for "nb" be used, and this is an accepted solution. Previously no.wikidata.org redirected with a setlang=no and that created a lot of confusion as we then had to different language codes depending on how the page was opened. There are also bots that use the site id to generate a language code and that will create a "no" language code.
On 8/10/13, Markus Krötzsch markus@semantic-mediawiki.org wrote:
On 10/08/13 11:07, John Erling Blad wrote:
The language code "no" is the metacode for Norwegian, and nowiki was in the beginning used for both Norwegian Bokmål, Riksmål and Nynorsk. The later split of and made nnwiki, but nowiki continued as before. After a while all Nynorsk content was migrated. Now nowiki has content in Bokmål and Riksmål, first one is official in Norway and the later is an unofficial variant. After the last additions to Bokmål there are very few forms that are only legal n Riksmål, so for all practical purposes nowiki has become a pure Bokmål wiki.
I think all content in Wikidata should use either "nn" or "nb", and all existing content with "no" as language code should be folded into "nb". It would be nice if "no" could be used as an alias for "nb", as this is de facto situation now, but it is probably not necessary and could create a discussion with the Nynorsk community.
The site code should be "nowiki" as long as the community does not ask for a change.
Thanks for the clarification. I will keep "no" to mean "no" for now.
What I wonder is: if users choose to enter a "no" label on Wikidata, what is the language setting that they see? Does this say "Norwegian (any variant)" or what? That's what puzzles me. I know that a Wikipedia can allow multiple languages (or dialects) to coexist, but in the Wikidata language selector I thought you can only select "real" languages, not "language groups".
Markus
On 8/6/13, Markus Krötzsch markus@semantic-mediawiki.org wrote:
Hi Purodha,
thanks for the helpful hints. I have implemented most of these now in the list on git (this is also where you can see the private codes I have created where needed). I don't see a big problem in changing the codes in future exports if better options become available (it's much easier than changing codes used internally).
One open question that I still have is what it means if a language that usually has a script tag appears without such a tag (zh vs. zh-Hans/zh-Hant or sr vs. sr-Cyrl/sr-Latn). Does this really mean that we do not know which script is used under this code (either could appear)?
The other question is about the duplicate language tags, such as 'crh' and 'crh-Latn', which both appear in the data but are mapped to the same code. Maybe one of the codes is just phased out and will disappear over time? I guess the Wikidata team needs to answer this. We also have some codes that mean the same according to IANA, namely kk and kk-Cyrl, but which are currently not mapped to the same canonical IANA code.
Finally, I wondered about Norwegian. I gather that no.wikipedia.org is in Norwegian Bokmål (nb), which is how I map the site now. However, the language data in the dumps (not the site data) uses both "no" and "nb". Moreover, many items have different texts for nb and no. I wonder if both are still Bokmål, and there is just a bug that allows people to enter texts for nb under two language settings (for descriptions this could easily be a different text, even if in the same language). We also have nn, and I did not check how this relates to no (same text or different?).
Cheers, Markus
On 05/08/13 15:41, P. Blissenbach wrote:
Hi Markus, Our code 'sr-ec' is at this moment effectively equivalent to 'sr-Cyrl', likewise is our code 'sr-el' currently effectively equivalent to 'sr-Latn'. Both might change, once dialect codes of Serbian are added to the IANA subtag registry at http://www.iana.org/assignments/language-subtag-registry/language-subtag-reg... Our code 'nrm' is not being used for the Narom language as ISO 639-3 does, see: http://www-01.sil.org/iso639-3/documentation.asp?id=nrm We rather use it for the Norman / Nourmaud, as described in http://en.wikipedia.org/wiki/Norman_language The Norman language is recognized by the linguist list and many others but as of yet not present in ISO 639-3. It should probably be suggested to be added. We should probaly map it to a private code meanwhile. Our code 'ksh' is currently being used to represent a superset of what it stands for in ISO 639-3. Since ISO 639 lacks a group code for Ripuarian, we use the code of the only Ripuarian variety (of dozens) having a code, to represent the whole lot. We should probably suggest to add a group code to ISO 639, and at least the dozen+ Ripuarian languages that we are using, and map 'ksh' to a private code for Ripuarian meanwhile. Note also, that for the ALS/GSW and the KSH Wikipedias, page titles are not guaranteed to be in the languages of the Wikipedias. They are often in German instead. Details to be found in their respective page titleing rules. Moreover, for the ksh Wikipedia, unlike some other multilingual or multidialectal Wikipedias, texts are not, or quite often incorrectly, labelled as belonging to a certain dialect. See also: http://meta.wikimedia.org/wiki/Special_language_codes Greetings -- Purodha *Gesendet:* Sonntag, 04. August 2013 um 19:01 Uhr *Von:* "Markus Krötzsch" markus@semantic-mediawiki.org *An:* "Federico Leva (Nemo)" nemowiki@gmail.com *Cc:* "Discussion list for the Wikidata project." wikidata-l@lists.wikimedia.org *Betreff:* [Wikidata-l] Wikidata language codes (Was: Wikidata RDF export available) Small update: I went through the language list at
https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py...
and added a number of TODOs to the most obvious problematic cases. Typical problems are:
- Malformed language codes ('tokipona')
- Correctly formed language codes without any official meaning (e.g.,
'cbk-zam')
- Correctly formed codes with the wrong meaning (e.g., 'sr-ec': Serbian
from Ecuador?!)
- Language codes with redundant information (e.g., 'kk-cyrl' should be
the same as 'kk' according to IANA, but we have both)
- Use of macrolanguages instead of languages (e.g., "zh" is not
"Mandarin" but just "Chinese"; I guess we mean Mandarin; less sure about Kurdish ...)
- Language codes with incomplete information (e.g., "sr" should be
"sr-Cyrl" or "sr-Latn", both of which already exist; same for "zh" and "zh-Hans"/"zh-Hant", but also for "zh-HK" [is this simplified or traditional?]).
I invite any language experts to look at the file and add comments/improvements. Some of the issues should possibly also be considered on the implementation side: we don't want two distinct codes for the same thing.
Cheers,
Markus
On 04/08/13 16:35, Markus Krötzsch wrote:
On 04/08/13 13:17, Federico Leva (Nemo) wrote:
Markus Krötzsch, 04/08/2013 12:32: > * Wikidata uses "be-x-old" as a code, but MediaWiki messages for
this
> language seem to use "be-tarask" as a language code. So there must
be a
> mapping somewhere. Where?
Where I linked it.
Are you sure? The file you linked has mappings from site ids to
language
codes, not from language codes to language codes. Do you mean to
say:
"If you take only the entries of the form 'XXXwiki' in the list,
and
extract a language code from the XXX, then you get a mapping from language codes to language codes that covers all exceptions in Wikidata"? This approach would give us:
'als' : 'gsw', 'bat-smg': 'sgs', 'be_x_old' : 'be-tarask', 'crh': 'crh-latn', 'fiu_vro': 'vro', 'no' : 'nb', 'roa-rup': 'rup', 'zh-classical' : 'lzh' 'zh-min-nan': 'nan', 'zh-yue': 'yue'
Each of the values on the left here also occur as language tags in Wikidata, so if we map them, we use the same tag for things that Wikidata has distinct tags for. For example, Q27 has a label for
yue but
also for zh-yue [1]. It seems to be wrong to export both of these
with
the same language tag if Wikidata uses them for different purposes.
Maybe this is a bug in Wikidata and we should just not export texts
with
any of the above codes at all (since they always are given by
another
tag directly)?
> * MediaWiki's
http://www.mediawiki.org/wiki/Manual:$wgDummyLanguageCodes
> provides some mappings. For example, it maps "zh-yue" to "yue".
Yet,
> Wikidata use both of these codes. What does this mean? > > Answers to Nemo's points inline: > > On 04/08/13 06:15, Federico Leva (Nemo) wrote: >> Markus Krötzsch, 03/08/2013 15:48:
...
>> Apart from the above, doesn't wgLanguageCode in >>
https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php
>> >> have what you need? > > Interesting. However, the list there does not contain all 300
sites that
> we currently find in Wikidata dumps (and some that we do not find
there,
> including things like dkwiki that seem to be outdated). The full
list of
> sites we support is also found in the file I mentioned above,
just after
> the language list (variable siteLanguageCodes).
Of course not all wikis are there, that configuration is needed
only
when the subdomain is "wrong". It's still not clear to me what
codes you
are considering wrong.
Well, the obvious: if a language used in Wikidata labels or on
Wikimedia
sites has an official IANA code [2], then we should use this code.
Every
other code would be "wrong". For languages that do not have any
accurate
code, we should probably use a private code, following the
requirements
of BCP 47 for private use subtags (in particular, they should have
a
single x somewhere).
This does not seem to be done correctly by my current code. For
example,
we now map 'map_bmswiki' to 'map-bms'. While both 'map' and 'bms'
are
lANA language tags, I am not sure that their combination makes
sense.
The language should be Basa Banyumasan, but bms is for Bilma Kanuri
(and
it is a language code, not a dialect code). Note that map-bms does
not
occur in the file you linked to, so I guess there is some more work
to do.
Markus
http://www.iana.org/assignments/language-subtag-registry/language-subtag-reg...
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
John Erling Blad, 12/08/2013 01:30:
You can't use "no" as a language in ULS, but you can use setlang and uselang with "no" if I remember correct. All messages are aliased to "nb" if the language is "nb". Also at nowiki will the messages for "nb" be used, and this is an accepted solution. Previously no.wikidata.org redirected with a setlang=no and that created a lot of confusion as we then had to different language codes depending on how the page was opened. There are also bots that use the site id to generate a language code and that will create a "no" language code.
This answers the question indirectly: as far as I know, language-dependent content can, currently, be entered only in your interface language. However, both no and nb are available in preferences and you may also encounter https://bugzilla.wikimedia.org/show_bug.cgi?id=37459
Nemo
On 8/10/13, Markus Krötzsch wrote:
What I wonder is: if users choose to enter a "no" label on Wikidata, what is the language setting that they see? Does this say "Norwegian (any variant)" or what? That's what puzzles me. I know that a Wikipedia can allow multiple languages (or dialects) to coexist, but in the Wikidata language selector I thought you can only select "real" languages, not "language groups".
Markus Krötzsch, 04/08/2013 17:35:
Are you sure? The file you linked has mappings from site ids to language codes, not from language codes to language codes. Do you mean to say: "If you take only the entries of the form 'XXXwiki' in the list, and extract a language code from the XXX, then you get a mapping from language codes to language codes that covers all exceptions in Wikidata"?
Yes. You said Wikidata just uses the subdomain and the subdomain is contained in the database names used by the config. Sorry if I implied the removal of the wik* suffix and the conversion from _ to -
This approach would give us:
'als' : 'gsw', 'bat-smg': 'sgs', 'be_x_old' : 'be-tarask', 'crh': 'crh-latn', 'fiu_vro': 'vro', 'no' : 'nb', 'roa-rup': 'rup', 'zh-classical' : 'lzh' 'zh-min-nan': 'nan', 'zh-yue': 'yue'
Each of the values on the left here also occur as language tags in Wikidata, so if we map them, we use the same tag for things that Wikidata has distinct tags for. For example, Q27 has a label for yue but also for zh-yue [1]. It seems to be wrong to export both of these with the same language tag if Wikidata uses them for different purposes.
Maybe this is a bug in Wikidata and we should just not export texts with any of the above codes at all (since they always are given by another tag directly)?
Sorry, I don't know why both can appear. I would have said that one is a sitelink and the other some value added on wiki with the correct language code (entry label?) but my limited json reading skills seem to indicate otherwise.
[...]
Well, the obvious: if a language used in Wikidata labels or on Wikimedia sites has an official IANA code [2],
(And all of them are supposed to, except rare exceptions with pre-2006 wikis.)
then we should use this code. Every other code would be "wrong". For languages that do not have any accurate code, we should probably use a private code, following the requirements of BCP 47 for private use subtags (in particular, they should have a single x somewhere).
This does not seem to be done correctly by my current code. For example, we now map 'map_bmswiki' to 'map-bms'. While both 'map' and 'bms' are lANA language tags, I am not sure that their combination makes sense. The language should be Basa Banyumasan, but bms is for Bilma Kanuri (and it is a language code, not a dialect code). Note that map-bms does not occur in the file you linked to, so I guess there is some more work to do.
Indeed, that appears to be one of the exceptions. :) I don't know how it should be tracked, you could file a bug in MediaWiki>Internationalisation asking to find a proper code for this language. What was unclear to me is why you implied there were many such cases; that would surprise me.
Nemo