[Mediawiki-l] iso-8859-1 conversion to UTF-8 failed during upgrade from 1.4.0 to 1.5.2

Andre Oliveira da Costa costa at tecgraf.puc-rio.br
Wed Nov 16 14:18:22 UTC 2005


Hi Brion,

I was finally able to dedicate some time to this. Below I post more detailed
info.

On Wed, 09 Nov 2005 14:23:33 -0800
Brion Vibber <brion at pobox.com> wrote:

> Andre Oliveira da Costa wrote:
> > On Wed, 09 Nov 2005 13:43:10 -0800
> > Brion Vibber <brion at pobox.com> wrote:
> > 
> >> Andre Oliveira da Costa wrote:
> >>> Ok, got it. Any chance this iso-8859-1 --> utf-8 problem will get fixed on
> >>> future releases?
> >> Well, it worked fine for us, so I'm not sure what there is to fix?
> > 
> > Mmmh... "Houston, we got a problem" ;-)
> > 
> > Not sure why it didn't work for me, then. Does my description of the problem
> > on bug #3898 [http://bugzilla.wikimedia.org/show_bug.cgi?id=3898] give you
> > any clue about what could have failed on my upgrade (and, judging by some of
> > the replies, on others' as well)? If not, what additional info can I provide
> > to help you track down the problem?
> 
> * Operating system and version

FC4, all latets updates applied, kernel 2.6.14-1.1637_FC4

> * PHP version

PHP 5.0.4

> * PHP configuration

Should I send /etc/php.ini directly to your email address? It's quite
big to be posted here...

> * Which PHP modules are installed

php-pear-5.0.4-10.5
php-pgsql-5.0.4-10.5
php-jpgraph-1.19-1.2.fc4.rf
php-snmp-5.0.4-10.5
php-mbstring-5.0.4-10.5
php-mysql-5.0.4-10.5

(I don't use all of them, some -- like snmp -- have been installed for
testing purposes; actually, I don't really _use_ PHP at home, apache
is off by default)

> * MySQL version

MySQL 4.1.14

> * MySQL configuration

~ cat /etc/my.cnf
[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
# Default to using old password format for compatibility with mysql 3.x
# clients (those using the mysqlclient10 compatibility package).
old_passwords=1
     
[mysql.server]
user=mysql
basedir=/var/lib

[mysqld_safe]
err-log=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

> * LocalSettings.php

I don't recall doing any tweaking to it, but I can forward it to you if you need it.

> * Any other modifications or settings on MediaWiki

I haven't applied any modification on this test setup. 1.4.0 installation has
some modifications, but they are related to access rights, not content.
Still, if this is relevant, I can provide them to you.

> * Exact, step-by-step procedure you used for running the upgrade

- extract content from mediawiki-1.5.2.tar.gz at /var/www/html
- make symlink from /var/www/html/mediawiki-1.5.2 to /var/www/html/wiki
  PS: [wiki] = /var/www/html/wiki
- create 'wikidb' DB on MySQL
- create 'wikiuser' with full rights on wikidb.*
- import MediaWiki 1.4.2 DB dump (with mysql -u wikiuser -h localhost -p
  wikidb < wikidb-dump.sql)
- copy [wiki]/AdminSettings.sample to [wiki]/AdminSettings.php, configure 
  it with proper MySQL root user/password
- copy LocalSettings.php from current (1.4.0) installation to [wiki] dir
- run 'php -f maintenance/upgrade1_5.php' from dir [wiki] (I captured its
  output in case it helps)
- run 'php -f maintenance/update.php' (captured output of this one as well)
- chown -R apache:apache [wiki]

After this, [host]/wiki is online, but content doesn't display right
-- latin-1 chars are replaced by commas.

Investigating more thoroughly the problem I guess I now have a better
understanding of what's happening: commas are just the way of Firefox telling
me it found chars incompatible with the page encoding (sorry for the false
alarm, should have realized this). In this case, latin-1 chars were left
behind during the translation, and page encoding is utf-8. If I force the
browser to use latin-1 encoding, page displays fine -- well, almost, some
content seems to be missing.

Eg. this page:

http://shadow/wiki/index.php?title=Padr%F5es_de_Programa%E7%E3o&action=edit

which should point to a page titled "Padrões de Programação" appears as
missing (i.e. link appears as ...&action=edit). If I go to the special "dead
end pages" page, I see this link there:

http://shadow/wiki/index.php?title=Padr%C3%B5es_de_Programa%C3%A7%C3%A3o

If I follow it, "missing" content is there, with latin-1 chars (so we're back
to the "commas" issue again). Page title is utf-8, but the remaining of the
content is latin-1.

Judging by this, it seems the upgrade1_5.php script did convert URLs
(and consequently page titles) from latin-1 to utf-8, but some or all of
pages content was not converted.

I hope this helps you guys understand what happened -- and fix the upgrade
script so that I can migrate to 1.5.2 =)

> * If possible, sample data files

Well, I would be happy to provide configuration files for MySQL and
PHP, and also output from conversion scripts. Just let me know if you
guys would like to take a look at them, and where should I send them to. I
can also provide some content if it will help.

> Chars being replaced by commas is not something I've ever seen.

You're 100% right, it was a silly oversight. My bad, sorry for the confusion.
Still, there is indeed some problem with the latin-1 --> utf-8 conversion
process...

Any ideas? Anything else I could provide?

Best,

Andre

-- 
Andre Oliveira da Costa
(costa at tecgraf.puc-rio.br)



More information about the MediaWiki-l mailing list