When I download the dump http://download.wikimedia.org/svwiki/20060808/svwiki-20060808-templatelinks....
and uncompress it, I find violations of UTF-8 (apparently remainders of ISO-8859-1) in these records:
0xf6 in (141524,10,'F\xf6rfattarstub'), 0xf6 in (154217,10,'Geografistub-Gr\×f6nland'), 0xe4 in (147111,10,'Japanskt_L\xe4n'), 0xe4 in (145703,10,'Motorv\xe4gar_i_Sverige'), 0xc4 in (125122,10,'RA\xc4'), 0xe5 in (146360,10,'Sk\xe5despelarstub'), 0xf6 and 0xd6 in (160822,10,'S\xf6dra_\xd6sterbotten'), 0xe4 in (145703,10,'TrafikplatsLandsv\xe4g'),
I could still import this SQL dump into mysql (4.0), but when I open the SQL dump file in GNU Emacs (22.0.50) it doesn't go into Unicode mode as it does for a clean UTF-8 file.
I've found no errors in some other files I've looked at.
This is a total of 9 violations in 8 records referring to 7 different pages (page ID 145703 appears twice) out of 330,000 records, so no real reason for panic. These seven page IDs are not present in the page.sql dump, so apparently stale link records that should have been removed from the database. If I run the inner join:
select page_namespace, page_title, tl_namespace, tl_title from page, templatelinks where page_id = tl_from;
the result is clean UTF-8. But the result is 2076 rows shorter than the templatelinks table:
select count(*) from page, templatelinks where page_id=tl_from; 328349
select count(*) from templatelinks; 330425
wikitech-l@lists.wikimedia.org