I finally got my linux box's big drive cleared off and a backup dump of en imported so I can get ready to run some conversion tests. First, quick statistics from checking for the presence of high characters in the 2004-06-16 dump:
10.4% of cur entries need their page content fixed 1.9% of cur entries need their titles fixed Smaller portions are affected by their comment fields or usernames.
[Exact proportion of old entry text can't be checked easily due to compression.] 1.7% of old revisions need their titles fixed. Smaller portions are affected by their comment fields or usernames.
1.8% of watchlist entries need their titles fixed.
0.4% of registered usernames need to be fixed.
0.7% of images need to be renamed 1.4% of images need their upload comments fixed
(This is not an exhaustive list of fields needing conversion.)
This makes it pretty clear that a 'sparse' conversion that only updates that which needs to be updated should speed things up tremendously over the basic 'dump everything, convert, and load it back in' approach we used on fr.
Less than 2% of titles & usernames need to be fixed; this step can be done relatively quickly on all affected tables (cur, old, brokenlinks, categorylinks, watchlist, user, image, oldimage) to provide consistency for queries which must key on *_title or *_user_text and thus can't allow for different places containing different forms of the data.
It should be possible as some have suggested to use either heuristics or explicit marking to do run-time conversion of cur_text and old_text, and perhaps cur_comment, old_comment, and similar bits. In this case we'd want to do the conversion at data load time since we need the real encoding for parsing to match up to titles. This would avoid downtime for the conversion of the 10.4% of cur_text material that needs it (45,862 rows), but requires changes to MediaWiki itself that need to be coded and tested.
The remaining latin-1 wikis will have rather larger incidences of high chars than English does, but should still benefit from this approach by skipping the bulk text recoding.
I'd hoped to have some conversion test results by now but had some false starts with the database setup that used up the weekend. :( I'll try to get the code ready and running in the next few days.
-- brion vibber (brion @ pobox.com)
This is perhaps only tangentially related, but I've been wondering for a while whether downtime for various upgrades / database conversions could be minimised by using some kind of "changelog mode". The idea being that a copy could be made of the database and sent somewhere else for processing [I'm not sure where, since processing two copies of en: on one server doesn't seem very likely, but that's a detail for later]; meanwhile, the original copy is still fully editable, but creates a log of all edits (or rather, actions that change the DB) made since the copy was taken. Then, once the bulk of the conversion is done, the main db goes into read-only mode while the new version is "patched" with the changes from the log (obviously, the code which was altering the DB would then have to alter the logs to match, but in many cases this shouldn't be too hard, since the changes themselves aren't usually complex, just time-consuming).
I'm not sure how feasible this is, or even how much would be gained in practice, but when en: was converted to 1.3 it seemingly had to be locked for essentially a whole day while its links table was rebuilt - my hunch is that it would take far less than a day to "replay" a day's worth of edits, if some feature could be made for them to be logged and replayed.
What does anyone with more experience think? Just a pipe dream, or something worth considering as an investment for the future?
wikitech-l@lists.wikimedia.org