New subject: "Changelog" mode

22 Jun 2004


      I finally got my linux box's big drive cleared off and a backup dump of 
en imported so I can get ready to run some conversion tests. First, 
quick statistics from checking for the presence of high characters in 
the 2004-06-16 dump:
10.4% of cur entries need their page content fixed
  1.9% of cur entries need their titles fixed
Smaller portions are affected by their comment fields or usernames.
[Exact proportion of old entry text can't be checked easily due to 
compression.]
  1.7% of old revisions need their titles fixed.
Smaller portions are affected by their comment fields or usernames.
1.8% of watchlist entries need their titles fixed.
0.4% of registered usernames need to be fixed.
0.7% of images need to be renamed
  1.4% of images need their upload comments fixed
(This is not an exhaustive list of fields needing conversion.)
This makes it pretty clear that a 'sparse' conversion that only updates 
that which needs to be updated should speed things up tremendously over 
the basic 'dump everything, convert, and load it back in' approach we 
used on fr.
Less than 2% of titles & usernames need to be fixed; this step can be 
done relatively quickly on all affected tables (cur, old, brokenlinks, 
categorylinks, watchlist, user, image, oldimage) to provide consistency 
for queries which must key on *_title or *_user_text and thus can't 
allow for different places containing different forms of the data.
It should be possible as some have suggested to use either heuristics or 
explicit marking to do run-time conversion of cur_text and old_text, and 
perhaps cur_comment, old_comment, and similar bits. In this case we'd 
want to do the conversion at data load time since we need the real 
encoding for parsing to match up to titles. This would avoid downtime 
for the conversion of the 10.4% of cur_text material that needs it 
(45,862 rows), but requires changes to MediaWiki itself that need to be 
coded and tested.
The remaining latin-1 wikis will have rather larger incidences of high 
chars than English does, but should still benefit from this approach by 
skipping the bulk text recoding.
I'd hoped to have some conversion test results by now but had some false 
starts with the database setup that used up the weekend. :( I'll try to 
get the code ready and running in the next few days.
-- brion vibber (brion @ pobox.com)

UTF-8 conversion: a look at the facts