Unclean UTF-8 in dump, due to stale templatelinks - Wikitech-l

16 Aug 2006


      When I download the dump
http://download.wikimedia.org/svwiki/20060808/svwiki-20060808-templatelinks....
and uncompress it, I find violations of UTF-8 (apparently 
remainders of ISO-8859-1) in these records:
0xf6 in    (141524,10,'F\xf6rfattarstub'),
0xf6 in    (154217,10,'Geografistub-Gr\×f6nland'),
0xe4 in    (147111,10,'Japanskt_L\xe4n'),
0xe4 in    (145703,10,'Motorv\xe4gar_i_Sverige'),
0xc4 in    (125122,10,'RA\xc4'),
0xe5 in    (146360,10,'Sk\xe5despelarstub'),
0xf6 and 0xd6 in  (160822,10,'S\xf6dra_\xd6sterbotten'),
0xe4 in    (145703,10,'TrafikplatsLandsv\xe4g'),
I could still import this SQL dump into mysql (4.0), but when I 
open the SQL dump file in GNU Emacs (22.0.50) it doesn't go into 
Unicode mode as it does for a clean UTF-8 file.
I've found no errors in some other files I've looked at.
This is a total of 9 violations in 8 records referring to 7 
different pages (page ID 145703 appears twice) out of 330,000 
records, so no real reason for panic.  These seven page IDs are 
not present in the page.sql dump, so apparently stale link records 
that should have been removed from the database.  If I run the 
inner join:
select page_namespace, page_title, tl_namespace, tl_title
  from page, templatelinks where page_id = tl_from;
the result is clean UTF-8.  But the result is 2076 rows shorter 
than the templatelinks table:
select count(*)
  from page, templatelinks where page_id=tl_from;
  328349
select count(*)
  from templatelinks;
  330425
-- 
  Lars Aronsson (lars@aronsson.se)
  Aronsson Datateknik - http://aronsson.se