[Wikidata-l] Unicode escapes in Wikidata RDF N-Triples

List overview All Threads
Download

newer

older

[Wikidata-l] Identifier lookup...

[Wikidata-l] next IRC office hour...

KANZAKI Masahide

5 Dec 2014 5 Dec '14

4:48 p.m.

Hello, thank you for providing Wikidata RDF. The effort is very much appreciated.

I found that Wikidata RDF in N-Triples format has an encoding issue with non ASCII characters.

Because RDF 1.1 N-Triples format no longer has 'ASCII only' restriction, it is desirable to use characters directly as in Turtle format, not using Unicode escape sequences (\uHHHH).

If, for some reasons, escape is necessary, it should be '\u' followed by Unicode code point, not UTF-8 octets. For example, Kanji character '位' should be '\u4F4D', not '\u00E4\u00BD\u008D'. Current Wikidata RDF NT is provided in the latter form, which makes non ASCII characters garbages.

I hope this issue is not very difficult to fix.

cheers,

-- @prefix : http://www.kanzaki.com/ns/sig# . <> :from [:name "KANZAKI Masahide"; :nick "masaka"; :email "mkanzaki@gmail.com"].

Show replies by date

Daniel Kinzler

5 Dec 5 Dec

5:44 p.m.

Hello Kanzaki,

thank you for the report!

Can you provide a link to the RDF document in question? Are you talking about the RDF dumps (which are generated by a third party), or the (incomplete) RDF we return from URLs like https://www.wikidata.org/entity/Q42.nt?

We use a (cropped version of) EasyRdf. Do you perhaps know if this problem still exists in the latest version of EasyRdf?

-- daniel

Am 05.12.2014 12:18, schrieb KANZAKI Masahide:

...

Hello, thank you for providing Wikidata RDF. The effort is very much appreciated.

I found that Wikidata RDF in N-Triples format has an encoding issue with non ASCII characters.

Because RDF 1.1 N-Triples format no longer has 'ASCII only' restriction, it is desirable to use characters directly as in Turtle format, not using Unicode escape sequences (\uHHHH).

If, for some reasons, escape is necessary, it should be '\u' followed by Unicode code point, not UTF-8 octets. For example, Kanji character '位' should be '\u4F4D', not '\u00E4\u00BD\u008D'. Current Wikidata RDF NT is provided in the latter form, which makes non ASCII characters garbages.

I hope this issue is not very difficult to fix.

cheers,

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

KANZAKI Masahide

7:44 p.m.

Hello Daniel (Cc. Nicholas Humfrey), thank you for prompt reply.

I found the problem in NT from Wikidata URI like you provided bellow. And I confirmed that the latest version (0.9.0) of EasyRdf N-Triples serialiser has the same problem.

It could be fixed by minor patch on Serialiser/Ntriples.php: - in escapeString(), use mb_strlen and mb_substr instead of strlen and string array. Maybe no utf8_decode needed. - in unicodeCharNo(), not utf8_encode - in escapedChar(), return $c instead of "\uXXXX" for $c >= 127

I hope this helps. cheers,

2014-12-05 21:14 GMT+09:00 Daniel Kinzler daniel.kinzler@wikimedia.de:

...

Hello Kanzaki,

thank you for the report!

Can you provide a link to the RDF document in question? Are you talking about the RDF dumps (which are generated by a third party), or the (incomplete) RDF we return from URLs like https://www.wikidata.org/entity/Q42.nt?

We use a (cropped version of) EasyRdf. Do you perhaps know if this problem still exists in the latest version of EasyRdf?

-- daniel

Am 05.12.2014 12:18, schrieb KANZAKI Masahide:

...
Hello, thank you for providing Wikidata RDF. The effort is very much appreciated.

I found that Wikidata RDF in N-Triples format has an encoding issue with non ASCII characters.

Because RDF 1.1 N-Triples format no longer has 'ASCII only' restriction, it is desirable to use characters directly as in Turtle format, not using Unicode escape sequences (\uHHHH).

If, for some reasons, escape is necessary, it should be '\u' followed by Unicode code point, not UTF-8 octets. For example, Kanji character '位' should be '\u4F4D', not '\u00E4\u00BD\u008D'. Current Wikidata RDF NT is provided in the latter form, which makes non ASCII characters garbages.

I hope this issue is not very difficult to fix.

cheers,

-- Daniel Kinzler Senior Software Developer

Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

-- @prefix : http://www.kanzaki.com/ns/sig# . <> :from [:name "KANZAKI Masahide"; :nick "masaka"; :email "mkanzaki@gmail.com"].

Daniel Kinzler

7:50 p.m.

Am 05.12.2014 15:14, schrieb KANZAKI Masahide:

...

Hello Daniel (Cc. Nicholas Humfrey), thank you for prompt reply.

I found the problem in NT from Wikidata URI like you provided bellow. And I confirmed that the latest version (0.9.0) of EasyRdf N-Triples serialiser has the same problem.

It could be fixed by minor patch on Serialiser/Ntriples.php:

in escapeString(), use mb_strlen and mb_substr instead of strlen and

string array. Maybe no utf8_decode needed.

in unicodeCharNo(), not utf8_encode

in escapedChar(), return $c instead of "\uXXXX" for $c >= 127

Hey, thanks for digging in!

Would you report that issue to easyRdf at https://github.com/njh/easyrdf/issues? You seem to know that code better than my by now :) Please cross-link https://phabricator.wikimedia.org/T76854.

-- daniel

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Daniel Kinzler

7:40 p.m.

Am 05.12.2014 12:18, schrieb KANZAKI Masahide:

...

Hello, thank you for providing Wikidata RDF. The effort is very much appreciated.

I found that Wikidata RDF in N-Triples format has an encoding issue with non ASCII characters.

This issue is now being tracked under https://phabricator.wikimedia.org/T76854. If you have any further comments, please provide them there.

-- daniel

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

3671

Age (days ago)

3671

Last active (days ago)

wikidata@lists.wikimedia.org

4 comments

2 participants

tags (0)

participants (2)

Daniel Kinzler
KANZAKI Masahide