All,
I've been struggling to track this for a few hours. This file is a SQL dump, the headers says itf UTF-8.
http://dumps.wikimedia.org/zhwiki/20130102/zhwiki-20130102-langlinks.sql.gz
but:
$ isutf8 zh-langlinks.sql zh-langlinks.sql: line 204, char 2361, byte offset 520707: invalid UTF-8 code
$ head -204 zh-langlinks.sql | tail -1 | head -c 520750 | tail -c 50 | hexdump -C 00000000 64 69 61 3a 43 6f f6 72 64 69 6e 61 74 69 65 20 |dia:Co.rdinatie | 00000010 65 78 74 65 72 6e 65 20 70 75 62 6c 69 63 69 74 |externe publicit| 00000020 65 69 74 2f 69 6e 74 65 72 6e 61 74 69 6f 6e 61 |eit/internationa| 00000030 61 6c |al| 00000032
There might be other occurencies, but one is enough to make my import scripts crash, so... you guys are warned.
The issue is that the bad character was added in 2004, see
https://zh.wikipedia.org/w/index.php?title=Wikipedia:%E6%96%B0%E9%97%BB% E7%A8%BF/2004%E5%B9%B42%E6%9C%88_%28%E7%AE%80% 29&action=edit&oldid=386385
before there were aggressive checks for that sort of thing. Garbage in, garbage out... Neither the dump producer scripts nor the db table inserts are going to alter data they are given on the grounds that it's bad utf8. At least this particular instance is easy to fix for any zhwiki editor.
Ariel
Στις 05-01-2013, ημέρα Σαβ, και ώρα 19:58 +0100, ο/η Mathieu Poumeyrol έγραψε:
All,
I've been struggling to track this for a few hours. This file is a SQL dump, the headers says itf UTF-8.
http://dumps.wikimedia.org/zhwiki/20130102/zhwiki-20130102-langlinks.sql.gz
but:
$ isutf8 zh-langlinks.sql zh-langlinks.sql: line 204, char 2361, byte offset 520707: invalid UTF-8 code
$ head -204 zh-langlinks.sql | tail -1 | head -c 520750 | tail -c 50 | hexdump -C 00000000 64 69 61 3a 43 6f f6 72 64 69 6e 61 74 69 65 20 |dia:Co.rdinatie | 00000010 65 78 74 65 72 6e 65 20 70 75 62 6c 69 63 69 74 |externe publicit| 00000020 65 69 74 2f 69 6e 74 65 72 6e 61 74 69 6f 6e 61 |eit/internationa| 00000030 61 6c |al| 00000032
There might be other occurencies, but one is enough to make my import scripts crash, so... you guys are warned.
Ariel T. Glenn, 08/01/2013 09:26:
The issue is that the bad character was added in 2004, see
https://zh.wikipedia.org/w/index.php?title=Wikipedia:%E6%96%B0%E9%97%BB% E7%A8%BF/2004%E5%B9%B42%E6%9C%88_%28%E7%AE%80% 29&action=edit&oldid=386385
I've requested removal and revdeletion: https://zh.wikipedia.org/w/index.php?diff=24435408&oldid=691618 Mathieu, please follow the discussion in case they ask questions on UTF8 or on why this is important for you, neither of which I'd be able to answer... Thanks, Nemo
xmldatadumps-l@lists.wikimedia.org