All,
I've been struggling to track this for a few hours. This file is a SQL dump, the
headers says itf UTF-8.
http://dumps.wikimedia.org/zhwiki/20130102/zhwiki-20130102-langlinks.sql.gz
but:
$ isutf8 zh-langlinks.sql
zh-langlinks.sql: line 204, char 2361, byte offset 520707: invalid UTF-8 code
$ head -204 zh-langlinks.sql | tail -1 | head -c 520750 | tail -c 50 | hexdump -C
00000000 64 69 61 3a 43 6f f6 72 64 69 6e 61 74 69 65 20 |dia:Co.rdinatie |
00000010 65 78 74 65 72 6e 65 20 70 75 62 6c 69 63 69 74 |externe publicit|
00000020 65 69 74 2f 69 6e 74 65 72 6e 61 74 69 6f 6e 61 |eit/internationa|
00000030 61 6c |al|
00000032
There might be other occurencies, but one is enough to make my import scripts crash, so...
you guys are warned.
--
K.