All,
I've been struggling to track this for a few hours. This file is a SQL dump, the headers says itf UTF-8.
http://dumps.wikimedia.org/zhwiki/20130102/zhwiki-20130102-langlinks.sql.gz
but:
$ isutf8 zh-langlinks.sql zh-langlinks.sql: line 204, char 2361, byte offset 520707: invalid UTF-8 code
$ head -204 zh-langlinks.sql | tail -1 | head -c 520750 | tail -c 50 | hexdump -C 00000000 64 69 61 3a 43 6f f6 72 64 69 6e 61 74 69 65 20 |dia:Co.rdinatie | 00000010 65 78 74 65 72 6e 65 20 70 75 62 6c 69 63 69 74 |externe publicit| 00000020 65 69 74 2f 69 6e 74 65 72 6e 61 74 69 6f 6e 61 |eit/internationa| 00000030 61 6c |al| 00000032
There might be other occurencies, but one is enough to make my import scripts crash, so... you guys are warned.