Jim,
Haha - yeah, I think we'll have to do that. In
our case, when there
are duplicates, the one we'll want to keep is the one which is not a
redirect, and which has a certain threshold of content (measured in
characters most likely).
Makes sense, you may want to keep a log of nuked content (or actually,
rename them to some random title and log that).
If we get something that works reliably, there's a
good chance I could
release it into the wikimedia svn as either an extension or
maintenance script. It may have to be an extension to allow a user to
interactively select among good candidates.
You are awesome :)
FYI, currently our page_title field is declared as follows:
`page_title` varchar(255) character set latin1 collate latin1_bin
NOT NULL default '',
OK, first what you want to do - convert it to VARBINARY.
Then what you want to do - convert it to VARCHAR utf8 with your
selected collation.
Converting to VARBINARY will force MySQL to forget the latin1 crap (in
this case it is a huge lie, that is PITA eventually)
Not sure what effect that has on your advice since
we're not using utf8 :/
You are using utf8, it is just tagged as latin1, which generally is
bad idea, but we somehow manage to tolerate that without committing
seppuku.
Domas