[Mediawiki-l] Encoding issues

Juliano F. Ravasi ml at juliano.info
Fri Sep 25 15:04:59 UTC 2009


I think it is a good idea to keep posts to this list in English. Since
the overwhelming majority of posts are in English, I guess that all
subscribers can understand and manage speaking a little of English, but
the same is not true to the many other languages spoken around the
world. If anyone posts in other languages, the conversation becomes
automatically restricted, people that may know the answer won't be able
to help and the chances to have your problem fixed are reduced.


Back on topic:

In my experience, there are many different ways to export and import
data from and to MySQL databases, and many, many of them are broken when
it comes to binary data or non-ASCII text. Many hosting providers use
phpMyAdmin or some variant to export MySQL databases for backup. Do not
use that!

The secure way to backup databases and reimport them somewhere else is
to use the command-line tools.

To export:
    mysqldump -uUSER -p DATABASE > FILENAME.sql

To import:
    mysql -uUSER -p < FILENAME.sql

I also recommend checking that the terminal locale in both systems are
compatible with the 'locale' (Linux) or 'env' (other) commands. In the
case of doubt, add "LANG=C LC_ALL=C" before each command to force a
common locale in both systems.

Also, use md5sum or sha1sum to check that the sql file wasn't damaged
during transport. When transferring the file, transfer it as a
binary/image and don't let the FTP software (if you are using any)
detect that it "looks like text". Gzipping the file before transfer is a
good idea to avoid this problem.

Do not try to edit the sql file between export and import, specially if
your editor thinks it knows how to handle files with mixed binary/text
data. If you still want to edit the sql file, do not touch the /*!...*/
comments near the beginning and the end of the file, those comments tell
the importer how character data is to be handled. This is precisely
where phpMyAdmin and other similar tools fail to produce usable backups.

Regards,
Juliano.


Javier Bezos wrote:
> Hi all,
> 
> We have hired an external service to update our system from
> 1.11 to 1.14. After many delays (which explains why 1.14 and
> not 1.15), now it's a mess because at many places accented
> characters looks as if they were unencoded UTF-8 characters
> (ie, ó is not an unencoded ó, but the two UTF-8 encoded
> chars à and ³). Examples are:
> 
> http://www.wikilengua.org/index.php/Propiedad:Norma_UNE_(Terminesp)
> http://www.wikilengua.org/index.php/Special:UnusedImages
> 
> Mainly in order to complain, any idea of why this mess? Is
> there a way fix it?
> 
> (Semantic MW has stopped working properly, too :-().
> 
> Thanx
> Javier Bezos


-- 
Juliano F. Ravasi ·· http://juliano.info/
5105 46CC B2B7 F0CD 5F47 E740 72CA 54F4 DF37 9E96

"A candle loses nothing by lighting another candle." -- Erin Majors

* NOTE: Don't try to reach me through this address, use "contact@" instead.



More information about the MediaWiki-l mailing list