* Strainu wrote:
I know this is probably the wrong list to ask, but I don't know a better one and I know we have a very good i18n team so I'm hoping someone here can help me.
I'm trying to parse the following xml (abbriged for brevity):
I18N issues are far easier to debug with access to the actual bytes that demonstrate the problem. Copying and pasting text into an email adds and obscures potential problems. You should also always give the exact error messages you are receiving and not your interpretation of them.
I am guessing your file is not actually UTF-8 encoded.
My first thought was that you might be using a character that had been forbidden in the first through fourth edition of the XML specification that got allowed in the fifth edition which added many characters to the definition of legal names. U+0163 however has always been allowed.
The second thought was that your eyes might be deceiving you and you've actually got a `t` followed by a U+0327 combining cedilla and that character is not allowed. That also cannot be the problem because U+0327 has always been allowed.
I then made a minimal test case, `<` followed by U+0163 and `/>` making sure the document is UTF-8 encoded and loaded that in a browser that I know checks for illegal characters in names.
data:application/xml,%3c%c5%a3%2f%3e
That worked fine so your problem description is incorrect or incomplete. I would recommend having the `xmllint` frontend to libxml2 around and do `xmllint example.xml`. That, too, works fine for my test case.
I take it from your later mail that you are getting `UnicodeEncodeError` in Python. You asked Python to encode U+0219 using the `ascii` codec and Python is telling you that U+0219 cannot be encoded using that codec. You have to check what kind of string `fromstring` expects (byte string or character string or what) and then check how to create such a string in Python from a literal in the source code. You might need a u'' string and call .encode('utf-8') on it.