Hi all,
when exporting a page by using /wiki/Special:Export/Blabla, there are some characters that won't be encoded correctly as UTF-8, for example \x0092 (single opening quote).
For an example see
http://en.wikipedia.org/wiki/Special:Export/User:Lunchboxhero
(Note that this is not my user page)
Regards, Stephan
On Sun, 2004-03-21 at 15:10, Stephan Walter wrote:
when exporting a page by using /wiki/Special:Export/Blabla, there are some characters that won't be encoded correctly as UTF-8, for example \x0092 (single opening quote).
For an example see
http://en.wikipedia.org/wiki/Special:Export/User:Lunchboxhero
Well, that would be because en is an ISO 8859-1 wiki, and 0x92 doesn't mean "single opening quote" in ISO 8859-1. (It does mean it in Microsoft CP1252). So its functioning as designed.
Abigail Brady wrote:
Well, that would be because en is an ISO 8859-1 wiki, and 0x92 doesn't mean "single opening quote" in ISO 8859-1. (It does mean it in Microsoft CP1252). So its functioning as designed.
en may be an ISO-8859 wiki, but the XML header of the exported articles says the XML is UTF-8:
<?xml version="1.0" encoding="UTF-8" ?> <mediawiki version="0.1" xml:lang="en"> <page> ....
So IMHO there shouldn't be any non-UTF8 characters in the XML file.
Regards, Stephan
Abigail Brady morwen@evilmagic.org writes:
On Sun, 2004-03-21 at 17:25, Stephan Walter wrote:
So IMHO there shouldn't be any non-UTF8 characters in the XML file.
Well, like they say, garbage in, garbage out...
No, concerning XML that's wrong. Run a parser on the file you are creating and if the parser signals an error take an appropriate action.
Of course, the problem is the wiki file format (maybe, you want to address this issue?); this file format is such a pain to me that I'm prepared tho leave the project some time soon.
wikitech-l@lists.wikimedia.org