I finally got my linux box's big drive cleared off and a backup dump of
en imported so I can get ready to run some conversion tests. First,
quick statistics from checking for the presence of high characters in
the 2004-06-16 dump:
10.4% of cur entries need their page content fixed
1.9% of cur entries need their titles fixed
Smaller portions are affected by their comment fields or usernames.
[Exact proportion of old entry text can't be checked easily due to
compression.]
1.7% of old revisions need their titles fixed.
Smaller portions are affected by their comment fields or usernames.
1.8% of watchlist entries need their titles fixed.
0.4% of registered usernames need to be fixed.
0.7% of images need to be renamed
1.4% of images need their upload comments fixed
(This is not an exhaustive list of fields needing conversion.)
This makes it pretty clear that a 'sparse' conversion that only updates
that which needs to be updated should speed things up tremendously over
the basic 'dump everything, convert, and load it back in' approach we
used on fr.
Less than 2% of titles & usernames need to be fixed; this step can be
done relatively quickly on all affected tables (cur, old, brokenlinks,
categorylinks, watchlist, user, image, oldimage) to provide consistency
for queries which must key on *_title or *_user_text and thus can't
allow for different places containing different forms of the data.
It should be possible as some have suggested to use either heuristics or
explicit marking to do run-time conversion of cur_text and old_text, and
perhaps cur_comment, old_comment, and similar bits. In this case we'd
want to do the conversion at data load time since we need the real
encoding for parsing to match up to titles. This would avoid downtime
for the conversion of the 10.4% of cur_text material that needs it
(45,862 rows), but requires changes to MediaWiki itself that need to be
coded and tested.
The remaining latin-1 wikis will have rather larger incidences of high
chars than English does, but should still benefit from this approach by
skipping the bulk text recoding.
I'd hoped to have some conversion test results by now but had some false
starts with the database setup that used up the weekend. :( I'll try to
get the code ready and running in the next few days.
-- brion vibber (brion @
pobox.com)