Re: [Wikitech-l] XML and Unicode chars in tag names

14 Jul 2013


      * Strainu wrote:
...
I know this is probably the wrong list to ask, but I don't know a
better one and I know we have a very good i18n team so I'm hoping
someone here can help me.
I'm trying to parse the following xml (abbriged for brevity):
I18N issues are far easier to debug with access to the actual bytes that
demonstrate the problem. Copying and pasting text into an email adds and
obscures potential problems. You should also always give the exact error
messages you are receiving and not your interpretation of them.
I am guessing your file is not actually UTF-8 encoded.
My first thought was that you might be using a character that had been
forbidden in the first through fourth edition of the XML specification
that got allowed in the fifth edition which added many characters to the
definition of legal names. U+0163 however has always been allowed.
The second thought was that your eyes might be deceiving you and you've
actually got a `t` followed by a U+0327 combining cedilla and that
character is not allowed. That also cannot be the problem because U+0327
has always been allowed.
I then made a minimal test case, `<` followed by U+0163 and `/>` making
sure the document is UTF-8 encoded and loaded that in a browser that I
know checks for illegal characters in names.
data:application/xml,%3c%c5%a3%2f%3e
That worked fine so your problem description is incorrect or incomplete.
I would recommend having the `xmllint` frontend to libxml2 around and do
`xmllint example.xml`. That, too, works fine for my test case.
I take it from your later mail that you are getting `UnicodeEncodeError`
in Python. You asked Python to encode U+0219 using the `ascii` codec and
Python is telling you that U+0219 cannot be encoded using that codec.
You have to check what kind of string `fromstring` expects (byte string
or character string or what) and then check how to create such a string
in Python from a literal in the source code. You might need a u'' string
and call .encode('utf-8') on it.
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] XML and Unicode chars in tag names