Hello, thank you for providing Wikidata RDF. The effort is very much appreciated.
I found that Wikidata RDF in N-Triples format has an encoding issue with non ASCII characters.
Because RDF 1.1 N-Triples format no longer has 'ASCII only' restriction, it is desirable to use characters directly as in Turtle format, not using Unicode escape sequences (\uHHHH).
If, for some reasons, escape is necessary, it should be '\u' followed by Unicode code point, not UTF-8 octets. For example, Kanji character '位' should be '\u4F4D', not '\u00E4\u00BD\u008D'. Current Wikidata RDF NT is provided in the latter form, which makes non ASCII characters garbages.
I hope this issue is not very difficult to fix.
cheers,
Hello Kanzaki,
thank you for the report!
Can you provide a link to the RDF document in question? Are you talking about the RDF dumps (which are generated by a third party), or the (incomplete) RDF we return from URLs like https://www.wikidata.org/entity/Q42.nt?
We use a (cropped version of) EasyRdf. Do you perhaps know if this problem still exists in the latest version of EasyRdf?
-- daniel
Am 05.12.2014 12:18, schrieb KANZAKI Masahide:
Hello, thank you for providing Wikidata RDF. The effort is very much appreciated.
I found that Wikidata RDF in N-Triples format has an encoding issue with non ASCII characters.
Because RDF 1.1 N-Triples format no longer has 'ASCII only' restriction, it is desirable to use characters directly as in Turtle format, not using Unicode escape sequences (\uHHHH).
If, for some reasons, escape is necessary, it should be '\u' followed by Unicode code point, not UTF-8 octets. For example, Kanji character '位' should be '\u4F4D', not '\u00E4\u00BD\u008D'. Current Wikidata RDF NT is provided in the latter form, which makes non ASCII characters garbages.
I hope this issue is not very difficult to fix.
cheers,
Hello Daniel (Cc. Nicholas Humfrey), thank you for prompt reply.
I found the problem in NT from Wikidata URI like you provided bellow. And I confirmed that the latest version (0.9.0) of EasyRdf N-Triples serialiser has the same problem.
It could be fixed by minor patch on Serialiser/Ntriples.php: - in escapeString(), use mb_strlen and mb_substr instead of strlen and string array. Maybe no utf8_decode needed. - in unicodeCharNo(), not utf8_encode - in escapedChar(), return $c instead of "\uXXXX" for $c >= 127
I hope this helps. cheers,
2014-12-05 21:14 GMT+09:00 Daniel Kinzler daniel.kinzler@wikimedia.de:
Hello Kanzaki,
thank you for the report!
Can you provide a link to the RDF document in question? Are you talking about the RDF dumps (which are generated by a third party), or the (incomplete) RDF we return from URLs like https://www.wikidata.org/entity/Q42.nt?
We use a (cropped version of) EasyRdf. Do you perhaps know if this problem still exists in the latest version of EasyRdf?
-- daniel
Am 05.12.2014 12:18, schrieb KANZAKI Masahide:
Hello, thank you for providing Wikidata RDF. The effort is very much appreciated.
I found that Wikidata RDF in N-Triples format has an encoding issue with non ASCII characters.
Because RDF 1.1 N-Triples format no longer has 'ASCII only' restriction, it is desirable to use characters directly as in Turtle format, not using Unicode escape sequences (\uHHHH).
If, for some reasons, escape is necessary, it should be '\u' followed by Unicode code point, not UTF-8 octets. For example, Kanji character '位' should be '\u4F4D', not '\u00E4\u00BD\u008D'. Current Wikidata RDF NT is provided in the latter form, which makes non ASCII characters garbages.
I hope this issue is not very difficult to fix.
cheers,
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Am 05.12.2014 15:14, schrieb KANZAKI Masahide:
Hello Daniel (Cc. Nicholas Humfrey), thank you for prompt reply.
I found the problem in NT from Wikidata URI like you provided bellow. And I confirmed that the latest version (0.9.0) of EasyRdf N-Triples serialiser has the same problem.
It could be fixed by minor patch on Serialiser/Ntriples.php:
- in escapeString(), use mb_strlen and mb_substr instead of strlen and
string array. Maybe no utf8_decode needed.
- in unicodeCharNo(), not utf8_encode
- in escapedChar(), return $c instead of "\uXXXX" for $c >= 127
Hey, thanks for digging in!
Would you report that issue to easyRdf at https://github.com/njh/easyrdf/issues? You seem to know that code better than my by now :) Please cross-link https://phabricator.wikimedia.org/T76854.
-- daniel
Am 05.12.2014 12:18, schrieb KANZAKI Masahide:
Hello, thank you for providing Wikidata RDF. The effort is very much appreciated.
I found that Wikidata RDF in N-Triples format has an encoding issue with non ASCII characters.
This issue is now being tracked under https://phabricator.wikimedia.org/T76854. If you have any further comments, please provide them there.
-- daniel