XML and Unicode chars in tag names

List overview All Threads
Download

newer

older

New Version of the Flow Prototype...

English lessons today

Strainu

14 Jul 2013 14 Jul '13

11:57 a.m.

Hi,

I know this is probably the wrong list to ask, but I don't know a better one and I know we have a very good i18n team so I'm hoping someone here can help me.

I'm trying to parse the following xml (abbriged for brevity):

<?xml version="1.0" encoding="UTF-8"?> <județ> <siruta>47</siruta> <nume>Județul Bacău</nume> </județ>

Every validator I've tried marks an error on the ț in the tag named județ. However, the xml specs [1] says this is actually correct:

document ::= prolog element Misc* element ::= EmptyElemTag | STag content ETag STag ::= '<' Name (S Attribute)* S? '>' Name ::= NameStartChar (NameChar)* NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]

ț is #x163 [2], thus should be in the interval [#xF8-#x2FF].

I have reached page 10 on google searching for "can the xml tags contain utf8 letters" and "xml tags utf-8", but I found nothing relevant. Am I missing something here? Is there any way around this?

Thanks, Strainu

[1] http://www.w3.org/TR/xml/ [2] http://ro.wikipedia.org/wiki/Wikipedia:Diacritice#Date_tehnice

Show replies by date

MZMcBride

14 Jul 14 Jul

3:30 p.m.

Strainu wrote:

...

I'm trying to parse the following xml (abbriged for brevity):

<?xml version="1.0" encoding="UTF-8"?>

<județ> <siruta>47</siruta> <nume>Județul Bacău</nume> </județ>

Every validator I've tried marks an error on the ț in the tag named județ.

Hi.

This list is a fine place to ask. :-)

Are you having trouble with validation or parsing? Validators can simply be wrong. Which validators are you using? And which parsers are you using?

Can you be more specific about what you're trying to do (feel free to link to or include sample code) and the tools you're trying to do it with?

MZMcBride

Strainu

4:23 p.m.

2013/7/14 MZMcBride z@mzmcbride.com:

...

Strainu wrote:

...
I'm trying to parse the following xml (abbriged for brevity):

<?xml version="1.0" encoding="UTF-8"?>

<județ> <siruta>47</siruta> <nume>Județul Bacău</nume> </județ>

Every validator I've tried marks an error on the ț in the tag named județ.

Hi.

This list is a fine place to ask. :-)

Hi,

...

Are you having trouble with validation or parsing? Validators can simply be wrong. Which validators are you using? And which parsers are you using?

I'm having trouble with both. I used the W3C validator [1], which wasn't designed for random XML files, but can still find a good number of errors and xmlvalidation.com [2]. On the parsing side, I tried with python's lxml; the output is available at [3]

...

Can you be more specific about what you're trying to do (feel free to link to or include sample code) and the tools you're trying to do it with?

Well, I have a PHP website which gathers public data about Romania's administrative units, which I then try to export in programming-friendly formats (CSV, JSON, XML). The workflow is: extract the data from the database, put it in a PHP array, then use this array to generate all the output formats. You have an example of such an array at [4] (since my initial email I've worked around the diacritics problem, but I'm still searching for a solution). For converting to XML I have a custom array_walk function [5].

I know that some potential reusers are heavy XML fans, so I wanted to give them an easy way to reuse the data. Having the XML tags/JSON keys with diacritics is not a must have, but is definitely a very nice feature, because those keys could be used directly as labels when printing the data somewhere.

Regards, Strainu

[1] http://validator.w3.org/ [2] http://www.xmlvalidation.com/ [3] https://gist.github.com/mgax/f6a3edc5b4883b3377e8 [4] https://github.com/strainu/despresate/blob/master/include/sat_functions.php#... [5] https://github.com/strainu/despresate/blob/master/include/common.php#L57

John

4:35 p.m.

Im a python programmer, your whole approach to strings/unicode needs help. The encoding issue you have isnt due to the library but rather coder error. If you want to jump on IRC I can talk you through the issues.

On Sun, Jul 14, 2013 at 4:23 PM, Strainu strainu10@gmail.com wrote:

...

2013/7/14 MZMcBride z@mzmcbride.com:

...
Strainu wrote:

...
I'm trying to parse the following xml (abbriged for brevity):

<?xml version="1.0" encoding="UTF-8"?>

<județ> <siruta>47</siruta> <nume>Județul Bacău</nume> </județ>

Every validator I've tried marks an error on the ț in the tag named județ.

Hi.

This list is a fine place to ask. :-)

Hi,

...
Are you having trouble with validation or parsing? Validators can simply be wrong. Which validators are you using? And which parsers are you

using?

I'm having trouble with both. I used the W3C validator [1], which wasn't designed for random XML files, but can still find a good number of errors and xmlvalidation.com [2]. On the parsing side, I tried with python's lxml; the output is available at [3]

...
Can you be more specific about what you're trying to do (feel free to

link

...
to or include sample code) and the tools you're trying to do it with?

Well, I have a PHP website which gathers public data about Romania's administrative units, which I then try to export in programming-friendly formats (CSV, JSON, XML). The workflow is: extract the data from the database, put it in a PHP array, then use this array to generate all the output formats. You have an example of such an array at [4] (since my initial email I've worked around the diacritics problem, but I'm still searching for a solution). For converting to XML I have a custom array_walk function [5].

I know that some potential reusers are heavy XML fans, so I wanted to give them an easy way to reuse the data. Having the XML tags/JSON keys with diacritics is not a must have, but is definitely a very nice feature, because those keys could be used directly as labels when printing the data somewhere.

Regards, Strainu

[1] http://validator.w3.org/ [2] http://www.xmlvalidation.com/ [3] https://gist.github.com/mgax/f6a3edc5b4883b3377e8 [4] https://github.com/strainu/despresate/blob/master/include/sat_functions.php#... [5] https://github.com/strainu/despresate/blob/master/include/common.php#L57

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Bjoern Hoehrmann

5:21 p.m.

* Strainu wrote:

...

I know this is probably the wrong list to ask, but I don't know a better one and I know we have a very good i18n team so I'm hoping someone here can help me.

I'm trying to parse the following xml (abbriged for brevity):

I18N issues are far easier to debug with access to the actual bytes that demonstrate the problem. Copying and pasting text into an email adds and obscures potential problems. You should also always give the exact error messages you are receiving and not your interpretation of them.

I am guessing your file is not actually UTF-8 encoded.

My first thought was that you might be using a character that had been forbidden in the first through fourth edition of the XML specification that got allowed in the fifth edition which added many characters to the definition of legal names. U+0163 however has always been allowed.

The second thought was that your eyes might be deceiving you and you've actually got a `t` followed by a U+0327 combining cedilla and that character is not allowed. That also cannot be the problem because U+0327 has always been allowed.

I then made a minimal test case, `<` followed by U+0163 and `/>` making sure the document is UTF-8 encoded and loaded that in a browser that I know checks for illegal characters in names.

data:application/xml,%3c%c5%a3%2f%3e

That worked fine so your problem description is incorrect or incomplete. I would recommend having the `xmllint` frontend to libxml2 around and do `xmllint example.xml`. That, too, works fine for my test case.

I take it from your later mail that you are getting `UnicodeEncodeError` in Python. You asked Python to encode U+0219 using the `ascii` codec and Python is telling you that U+0219 cannot be encoded using that codec. You have to check what kind of string `fromstring` expects (byte string or character string or what) and then check how to create such a string in Python from a literal in the source code. You might need a u'' string and call .encode('utf-8') on it.

-- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Strainu

6:07 p.m.

2013/7/15 Bjoern Hoehrmann derhoermi@gmx.net:

...

I18N issues are far easier to debug with access to the actual bytes that demonstrate the problem. Copying and pasting text into an email adds and obscures potential problems. You should also always give the exact error messages you are receiving and not your interpretation of them.

Hi Bjoern,

Thanks for your extensive answer. I will keep that in mind. The actual url used for testing is below.

...

I am guessing your file is not actually UTF-8 encoded.

This doesn't seem to be the case:

...

wget "http://despresate.strainu.ro/judet.php?id=15&f=xml&t=all&commune..." -O 1.xml

2013-07-15 00:37:58 (178 KB/s) - `1.xml' saved [31081]

...

enca -L none 1.xml

Universal transformation format 8 bits; UTF-8

...

file -bi 1.xml

application/xml; charset=utf-8

...

I then made a minimal test case, `<` followed by U+0163 and `/>` making sure the document is UTF-8 encoded and loaded that in a browser that I know checks for illegal characters in names.

data:application/xml,%3c%c5%a3%2f%3e

That worked fine so your problem description is incorrect or incomplete. I would recommend having the `xmllint` frontend to libxml2 around and do `xmllint example.xml`. That, too, works fine for my test case.

xmllint works for me too (for 1.xml). Still, Firefox insists there is a problem in the xml file, but Chromium is ok with the same file.

...

I take it from your later mail that you are getting `UnicodeEncodeError` in Python. You asked Python to encode U+0219 using the `ascii` codec and Python is telling you that U+0219 cannot be encoded using that codec. You have to check what kind of string `fromstring` expects (byte string or character string or what) and then check how to create such a string in Python from a literal in the source code. You might need a u'' string and call .encode('utf-8') on it.

Correct, that was simply not utf8, my mistake. Reading directly from the file (including the http url) works here too.

Still, it seems to me that unicode char support in tag names is sketchy. Would you recommend that I go ahead with those names or it would be wiser, for the sake of reusers, to keep to the ascii letters?

Thanks all for your help, Strainu

...

-- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Bjoern Hoehrmann

6:29 p.m.

* Strainu wrote:

...

...
wget "http://despresate.strainu.ro/judet.php?id=15&f=xml&t=all&commune..." -O 1.xml

2013-07-15 00:37:58 (178 KB/s) - `1.xml' saved [31081]

That uses U+021B and not U+0163. U+021B was not allowed in element type names in the fourth edition of the XML 1.0 specification (but is allowed now in the fifth edition).

...

Still, it seems to me that unicode char support in tag names is sketchy. Would you recommend that I go ahead with those names or it would be wiser, for the sake of reusers, to keep to the ascii letters?

If you stick to the characters allowed in the fourth edition only, see http://www.w3.org/TR/2006/REC-xml-20060816/#NT-Name, you should have only the usual problems (like using the non-ascii characters in source code meant to process documents of this kind and failing due to i18n issues in their programming environment).

4177

Age (days ago)

4177

Last active (days ago)

wikitech-l@lists.wikimedia.org

6 comments

4 participants

tags (0)

participants (4)

Bjoern Hoehrmann
John
MZMcBride
Strainu