2013/7/15 Bjoern Hoehrmann derhoermi@gmx.net:
I18N issues are far easier to debug with access to the actual bytes that demonstrate the problem. Copying and pasting text into an email adds and obscures potential problems. You should also always give the exact error messages you are receiving and not your interpretation of them.
Hi Bjoern,
Thanks for your extensive answer. I will keep that in mind. The actual url used for testing is below.
I am guessing your file is not actually UTF-8 encoded.
This doesn't seem to be the case:
wget "http://despresate.strainu.ro/judet.php?id=15&f=xml&t=all&commune..." -O 1.xml
2013-07-15 00:37:58 (178 KB/s) - `1.xml' saved [31081]
enca -L none 1.xml
Universal transformation format 8 bits; UTF-8
file -bi 1.xml
application/xml; charset=utf-8
I then made a minimal test case, `<` followed by U+0163 and `/>` making sure the document is UTF-8 encoded and loaded that in a browser that I know checks for illegal characters in names.
data:application/xml,%3c%c5%a3%2f%3e
That worked fine so your problem description is incorrect or incomplete. I would recommend having the `xmllint` frontend to libxml2 around and do `xmllint example.xml`. That, too, works fine for my test case.
xmllint works for me too (for 1.xml). Still, Firefox insists there is a problem in the xml file, but Chromium is ok with the same file.
I take it from your later mail that you are getting `UnicodeEncodeError` in Python. You asked Python to encode U+0219 using the `ascii` codec and Python is telling you that U+0219 cannot be encoded using that codec. You have to check what kind of string `fromstring` expects (byte string or character string or what) and then check how to create such a string in Python from a literal in the source code. You might need a u'' string and call .encode('utf-8') on it.
Correct, that was simply not utf8, my mistake. Reading directly from the file (including the http url) works here too.
Still, it seems to me that unicode char support in tag names is sketchy. Would you recommend that I go ahead with those names or it would be wiser, for the sake of reusers, to keep to the ascii letters?
Thanks all for your help, Strainu
-- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l