Hi all,
This is the background. A year ago we hired a person to update and add some improvements we needed, with an estimated time of 8 weeks. After months of delays and a lot of problems, we've fired him. Unfortunately, I'm afraid the database has been damaged or not properly set up.
Now, two questions--
1) In the short term, I'd like to fix an issue with searches. The site is in Spanish but it's unable to find words with diacriticals in the article body (in the title they are found). I've rebuilt the search index, with no luck. Fortunately, I had access to old dumps and I've noticed a difference -- formerly tables were like
CREATE TABLE `wl_archive` ( `ar_namespace` int(11) NOT NULL default '0', `ar_title` varchar(255) character set latin1 collate latin1_bin NOT NULL default '', [...] ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
and now they are like
SET @saved_cs_client = @@character_set_client; SET character_set_client = utf8; CREATE TABLE `wl_archive` ( `ar_namespace` int(11) NOT NULL default '0', `ar_title` varchar(255) character set latin1 collate latin1_bin NOT NULL default '', [...] ) ENGINE=InnoDB DEFAULT CHARSET=utf8; SET character_set_client = @saved_cs_client;
2) In the medium term, and with the background above, which is the best approach to have a "correct" database and system again.
Thanx Javier -------------------------------- www.wikilengua.org www.tex-tipografia.com
Hi Javier,
The short answer is that I can't help you...I can only tell you what I did and maybe someone else will give a better reply so that I can learn something, too.
Javier Bezos wrote:
CREATE TABLE `wl_archive` ( `ar_namespace` int(11) NOT NULL default '0', `ar_title` varchar(255) character set latin1 collate latin1_bin NOT NULL default '', [...] ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
and now they are like
SET @saved_cs_client = @@character_set_client; SET character_set_client = utf8; CREATE TABLE `wl_archive` ( `ar_namespace` int(11) NOT NULL default '0', `ar_title` varchar(255) character set latin1 collate latin1_bin NOT NULL default '', [...] ) ENGINE=InnoDB DEFAULT CHARSET=utf8; SET character_set_client = @saved_cs_client;
This is the exact problem that I had. I had a database with a mix of English and Japanese and could see everything fine. I then checked both the database and the system and found out they were both set to "latin1". If so, I don't know why it had worked for so long...
So, what I did was change the database and the MySQL system to utf8. But then the database also has to be converted and this failed miserably for me. I followed these steps:
http://en.gentoo-wiki.com/wiki/Convert_latin1_to_UTF-8_in_MySQL
which did not work (more specifically, the "Convert dump" step). In the end, I re-typed the Japanese (which wasn't a lot). So maybe you might have better luck or someone else can help you. Or maybe you can make some sense of the above link and succeed where I failed. :-)
Good luck...
Ray
El 09/03/10 10:20, Raymond Wan wrote:
Javier Bezos wrote:
CREATE TABLE `wl_archive` ( `ar_namespace` int(11) NOT NULL default '0', `ar_title` varchar(255) character set latin1 collate latin1_bin NOT NULL default '', [...] ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
and now they are like
SET @saved_cs_client = @@character_set_client; SET character_set_client = utf8; CREATE TABLE `wl_archive` ( `ar_namespace` int(11) NOT NULL default '0', `ar_title` varchar(255) character set latin1 collate latin1_bin NOT NULL default '', [...] ) ENGINE=InnoDB DEFAULT CHARSET=utf8; SET character_set_client = @saved_cs_client;
This is the exact problem that I had. I had a database with a mix of English and Japanese and could see everything fine. I then checked both the database and the system and found out they were both set to "latin1". If so, I don't know why it had worked for so long...
So, what I did was change the database and the MySQL system to utf8. But then the database also has to be converted and this failed miserably for me. I followed these steps:
http://en.gentoo-wiki.com/wiki/Convert_latin1_to_UTF-8_in_MySQL
which did not work (more specifically, the "Convert dump" step).
That's wrong. Even when the database shows charset=latin1, mediawiki stores utf8 inside. So you shouldn't have tried to convert the content, that would have broken it.
In the end, I re-typed the Japanese (which wasn't a lot). So maybe you might have better luck or someone else can help you. Or maybe you can make some sense of the above link and succeed where I failed. :-)
Good luck...
Ray
Javier, you need to what's the data representation and set to one (either latin1 or utf8), I think you had some tables latin1 and others utf8 now. With $wgDBmysql5 set accordingly.
Without knowing the exact way your db is broken, it'd be hard to fix it. What search are you using? The internal one? Maybe you can share a one-revision sql dump?
PS: Update to 1.15.2, you are vulnerable to CSS image inclusion.
Thank yoy for your interest and answers.
Platonides:
So you shouldn't have tried to convert the content, that would have broken it.
Very likely this is what they did.
Without knowing the exact way your db is broken, it'd be hard to fix it. What search are you using? The internal one? Maybe you can share a one-revision sql dump?
And your crystall ball? ;-) Well, as to the problem with the search I've discovered the key for the searchindex table is generated with U8<hex> while the table has u8<hex>. So, if I replace in SeachMySQL.php
$searchon = $this->db->strencode( $searchon );
by
$searchon = $this->db->strencode( $wgContLang->lc ($searchon) );
words are found. I'm using this as a workaround, but I'm investigating the origin of the problem (any hints?).
PS: Update to 1.15.2, you are vulnerable to CSS image inclusion.
Thank you. Please, could you give a pointer explaining it?
Cheers Javier
On Wed, Mar 10, 2010 at 9:57 AM, Javier Bezos jbezos@fundeu.es wrote:
PS: Update to 1.15.2, you are vulnerable to CSS image inclusion.
Thank you. Please, could you give a pointer explaining it?
Cheers Javier
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
http://lists.wikimedia.org/pipermail/mediawiki-announce/2010-March/000088.ht...
Which was also cross-posted to this list and wikitech-l.
-Chad
http://lists.wikimedia.org/pipermail/mediawiki-announce/2010-March/000088.ht...
Which was also cross-posted to this list and wikitech-l.
Thank you, Chad, but what I actually wanted was an explanation of the bug. If the problem are external images, I presume a link has to be given somehow, but in our wiki only wiki-like links (ie, [http://...]) are allowed, so I was interested in the details. I presume the fix is this one:
http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=63424
Javier
Javier Bezos wrote:
http://lists.wikimedia.org/pipermail/mediawiki-announce/2010-March/000088.ht...
Which was also cross-posted to this list and wikitech-l.
Thank you, Chad, but what I actually wanted was an explanation of the bug. If the problem are external images, I presume a link has to be given somehow, but in our wiki only wiki-like links (ie, [http://...]) are allowed, so I was interested in the details. I presume the fix is this one:
http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=63424
Javier
Yes. The problem is not the external images feature, but that you could embed an image inside CSS with mediawiki not noticing that it was an external link. Per the advisory, all installs since 1.5 are affected.
I wrote:
$searchon = $this->db->strencode( $searchon ); by $searchon = $this->db->strencode( $wgContLang->lc ($searchon) );
words are found. I'm using this as a workaround, but I'm investigating the origin of the problem (any hints?).
Found. The problem is in the database. For some reason they have decided the table searchindex should have a collation using latin1 :-(. So I repeat my second question:
- In the medium term, and with the background above, which
is the best approach to have a "correct" database and system again.
Cheers Javier
When we have had problems it was easier to dump database to sql statements, drop database, recreate with media wiki setup with all correct settings and restore the data without drop table instructions.
This worked for us in the past anyway.
On Thu, Mar 11, 2010 at 9:14 AM, Javier Bezos jbezos@fundeu.es wrote:
I wrote:
$searchon = $this->db->strencode( $searchon ); by $searchon = $this->db->strencode( $wgContLang->lc ($searchon) );
words are found. I'm using this as a workaround, but I'm investigating the origin of the problem (any hints?).
Found. The problem is in the database. For some reason they have decided the table searchindex should have a collation using latin1 :-(. So I repeat my second question:
- In the medium term, and with the background above, which
is the best approach to have a "correct" database and system again.
Cheers Javier
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Hi,
Platonides wrote:
So, what I did was change the database and the MySQL system to utf8. But then the database also has to be converted and this failed miserably for me. I followed these steps:
http://en.gentoo-wiki.com/wiki/Convert_latin1_to_UTF-8_in_MySQL
which did not work (more specifically, the "Convert dump" step).
That's wrong. Even when the database shows charset=latin1, mediawiki stores utf8 inside. So you shouldn't have tried to convert the content, that would have broken it.
Oh...I see -- thanks for this!
Then I guess there are two combinations: Mediawiki with latin1 MySQL ; Mediawiki with UTF MySQL. What are the advantages / disadvantages of either choice?
I *guess* that if someone were to login to mysql directly, and did a SELECT, then the UTF would look like gibberish. Likewise when a dump is done of the data. Of course, neither "problem" affects Mediawiki's functionality...
Any other pros/cons?
Thanks!
Ray
Raymond Wan wrote:
Oh...I see -- thanks for this!
Then I guess there are two combinations: Mediawiki with latin1 MySQL ; Mediawiki with UTF MySQL. What are the advantages / disadvantages of either choice?
I *guess* that if someone were to login to mysql directly, and did a SELECT, then the UTF would look like gibberish. Likewise when a dump is done of the data. Of course, neither "problem" affects Mediawiki's functionality...
Any other pros/cons?
Thanks!
Ray
MediaWiki offers you three character sets for MySQL: * MySQL 4.1/5.0 binary * MySQL 4.1/5.0 UTF-8 * MySQL 4.0 backwards-compatible UTF-8
In the three modes MediaWiki is storing utf-8 characters. It all depends on how MySQL treats them.
In "backwards-compatible UTF-8" mysql thinks it's latin1. The data will "look wrong" and if you don't provide --default-character-set for mysqldump 4.1 and newer, it will corrupt the text (it will "helpfully" transform it to utf-8). This is the only one which works with mysql 4.0, and it supports the full unicode.
UTF-8 uses MySQL support for UTF-8, which currentyl limits you to the Basic Multilingual Plane. The data will "look right". The indexes will be larger.
With binary, it works almost like backwards utf-8, but mysql will treat it as opaque data and won't mess with it. Representation will be messy. You have the full unicode.
Hi,
Platonides wrote:
MediaWiki offers you three character sets for MySQL:
- MySQL 4.1/5.0 binary
- MySQL 4.1/5.0 UTF-8
- MySQL 4.0 backwards-compatible UTF-8
...
Thank you for your explanation! I think I understand the differences between the 3 options. Also, looking at the installation instructions, it seems I was asked that before and probably just clicked "OK". I installed Mediawiki 3-4 years ago and have been upgrading since...so maybe back then I wasn't asked? I honestly don't remember...
Is there a way to find out what it is set at? Since I've forced mysql to use UTF-8, I'm not sure if I've made a mistake...but since it's retrieving Japanese characters fine, it must be ok... I'm just concerned that I've broken something that will show up later on...
Thanks a lot!
Ray
Raymond Wan wrote:
Platonides wrote:
MediaWiki offers you three character sets for MySQL:
- MySQL 4.1/5.0 binary
- MySQL 4.1/5.0 UTF-8
- MySQL 4.0 backwards-compatible UTF-8
...
Thank you for your explanation! I think I understand the differences between the 3 options. Also, looking at the installation instructions, it seems I was asked that before and probably just clicked "OK". I installed Mediawiki 3-4 years ago and have been upgrading since...so maybe back then I wasn't asked? I honestly don't remember...
Is there a way to find out what it is set at? Since I've forced mysql to use UTF-8, I'm not sure if I've made a mistake...but since it's retrieving Japanese characters fine, it must be ok... I'm just concerned that I've broken something that will show up later on...
Thanks a lot!
Ray
Look at your table definitions. If the CREATE TABLE contains DEFAULT CHARSET=binary or DEFAULT CHARSET=utf8, you are using the first two, and $wgDBmysql5 should be set. Else you are using the third one and $wgDBmysql5 should be false.
The three options are supported, if it works, don't fix it :)
Hi,
Platonides wrote:
Look at your table definitions. If the CREATE TABLE contains DEFAULT CHARSET=binary or DEFAULT CHARSET=utf8, you are using the first two, and $wgDBmysql5 should be set. Else you are using the third one and $wgDBmysql5 should be false.
Ah, thanks -- I'll take a look for it -- thanks!
The three options are supported, if it works, don't fix it :)
Yes...good point -- I should have asked on this list before attempting to "fix" something that wasn't broken!
Thanks for the advice!
Ray
mediawiki-l@lists.wikimedia.org