I have set up the test site at http://test-de.wikipedia.com/, I had to remove an insert from the newiki.sql because it was causing an error. Of course, I lost the exact text of the error somewhere, but it basically said that there was already an entry for Main_Page (the entry I removed looked like it may have been a discussion forum for the actual main page). However, it was a fairly large chunk. So, take a look.
A diff to my modified newiki.sql is attached.
I have set up the test site at http://test-de.wikipedia.com/,
Thanks Jason! I informed the German Wikipedians and hope they'll start playing around with it soon.
I wrote down some bugs I found on http://test-de.wikipedia.com/wiki/wikipedia:Beobachtete+Fehler
But that's in German, and I don't know anybody else but Magnus will understand it. So shell I post a (bad) translated report to wikitech-l from time to time?
* The search for words with umlauts doesn't seem to work. The script doesn't find them, and when I search for more than one word (including one with umlauts) I get MySQL errors.
* The conversion didn't correct the upper and lower case problem in article titles. I thought someone made a script for this?
* And can't we prevent the loss of article histories when converting to the new format?
BTW, maybe it's possible to keep an unlimited revision history (and import the old ones) when the script is faster again, or when we change to the new server.
More complaints coming soon ... ;-)
Kurt
On mar, 2002-05-21 at 06:28, Kurt Jansson wrote:
I have set up the test site at http://test-de.wikipedia.com/,
Thanks Jason! I informed the German Wikipedians and hope they'll start playing around with it soon.
I wrote down some bugs I found on http://test-de.wikipedia.com/wiki/wikipedia:Beobachtete+Fehler
But that's in German, and I don't know anybody else but Magnus will understand it. So shell I post a (bad) translated report to wikitech-l from time to time?
- The search for words with umlauts doesn't seem to work. The script
doesn't find them,
Hmm, looks like the search needs another drubbing. I believe it's chopping up words at the boundaries of non-ASCII characters; if your word is big enough on either end (übersetzung, terroranschläge) this does a good enough job anyway, but it's hardly the way it should work!
I'll try it on my own machine later today; if it works there, I'll send some MySQL character set config tweaks I've made that may help.
and when I search for more than one word (including one with umlauts) I get MySQL errors.
I'm not getting that, this seems to work for me. (Try for example "terroranschläge staaten".)
Note of course that the dreaded 4-character minimum is still in place.
- The conversion didn't correct the upper and lower case problem in
article titles. I thought someone made a script for this?
Working on it...
- And can't we prevent the loss of article histories when converting to
the new format?
BTW, maybe it's possible to keep an unlimited revision history (and import the old ones) when the script is faster again, or when we change to the new server.
We already do this since ages. (But nobody's yet run the recovery script which will add these to the *English* version which was converted with an older, more primitive conversion script.)
Which articles are missing histories? I checked a few at random and so far, so good.
More complaints coming soon ... ;-)
Vielen dank!
-- brion vibber (brion @ pobox.com)
On Tue, May 21, 2002 at 10:38:18AM -0700, Brion L. VIBBER wrote:
On mar, 2002-05-21 at 06:28, Kurt Jansson wrote:
I have set up the test site at http://test-de.wikipedia.com/,
Thanks Jason! I informed the German Wikipedians and hope they'll start playing around with it soon.
I wrote down some bugs I found on http://test-de.wikipedia.com/wiki/wikipedia:Beobachtete+Fehler
But that's in German, and I don't know anybody else but Magnus will understand it. So shell I post a (bad) translated report to wikitech-l from time to time?
- The search for words with umlauts doesn't seem to work. The script
doesn't find them,
Hmm, looks like the search needs another drubbing. I believe it's chopping up words at the boundaries of non-ASCII characters; if your word is big enough on either end (übersetzung, terroranschläge) this does a good enough job anyway, but it's hardly the way it should work!
It doesn't (unless you have changed this in my code, I didn't check). The parsing function usses the PERL regular expression \w to decide if something is legal in a search word or not. If it thinks a character is illegal it gives an error so you should have noticed that. If the result is empty and you didn't get a syntax error there can be two problems: - MySQL doesn't index that character. I wasn't able to find out which characters are exactly indexed by the full-text index and which not. I know that characters with umlauts get indexed (search for "G"odel" for example) but I wouldn't know about the ringel-S (sp?) for example. The easiest way to find out is probably simply trying. - The special characters in the articles were written by using entities. The search doesn't know that "o and ö are the same.
Btw, if you have questions about the parse function for the search, just ask. I'm very busy at the moment so I don't have time to do any real programming, but answering a few questions should not be a problem.
-- Jan Hidders
On mar, 2002-05-21 at 06:28, Kurt Jansson wrote:
I have set up the test site at http://test-de.wikipedia.com/,
Thanks Jason! I informed the German Wikipedians and hope they'll
start
playing around with it soon.
I wrote down some bugs I found on http://test-de.wikipedia.com/wiki/wikipedia:Beobachtete+Fehler
But that's in German, and I don't know anybody else but Magnus will understand it. So shell I post a (bad) translated report to
wikitech-l
from time to time?
The list is growing! Brion, how's your German? :-)
and when I search for more than one word (including one with umlauts) I get MySQL errors.
I'm not getting that, this seems to work for me. (Try for example "terroranschläge staaten".)
Note of course that the dreaded 4-character minimum is still in place.
Okay, my word fell under the minimum because of the chopping. "Berlin Fläche" gives me
Warning: Supplied argument is not a valid MySQL result resource in /home/wiki-de-php/fpw/special_dosearch.php on line 324
Warning: Supplied argument is not a valid MySQL result resource in /home/wiki-de-php/fpw/special_dosearch.php on line 326
Warning: Supplied argument is not a valid MySQL result resource in /home/wiki-de-php/fpw/special_dosearch.php on line 334
Warning: Supplied argument is not a valid MySQL result resource in /home/wiki-de-php/fpw/special_dosearch.php on line 336
Which articles are missing histories? I checked a few at random and so far, so good.
Sorry, forget about this. I was talking nonsense.
More complaints coming soon ... ;-)
Vielen dank!
And, sorry, I forgot to mention that the new script all in all REALLY KICKS ASS!!! I love it.
Kurt
On mer, 2002-05-22 at 10:36, Kurt Jansson wrote:
I wrote down some bugs I found on http://test-de.wikipedia.com/wiki/wikipedia:Beobachtete+Fehler
But that's in German, and I don't know anybody else but Magnus will understand it. So shell I post a (bad) translated report to
wikitech-l
from time to time?
The list is growing! Brion, how's your German? :-)
Nicht so gut, but I'll try to muddle through the list mit mein Wörterbuch und verstehen ein bisse...
Okay, my word fell under the minimum because of the chopping. "Berlin Fläche" gives me
Warning: Supplied argument is not a valid MySQL result resource in /home/wiki-de-php/fpw/special_dosearch.php on line 324
If I actually make the software check return codes and report useful error messages, I see:
You have an error in your SQL syntax near 'MATCH (cur_ind_title) AGAINST ("Fläche") ) AND cur_t' at line 3
Looks like mismatched parentheses. I've replaced (what I think is) the rest of the \w regexps with [\w\x80-\xff], and it looks improved, but as much of the search code is a mystery to me, I can't guarantee anything!
(Some changes to special_dosearch.php now in CVS.)
Which articles are missing histories? I checked a few at random and so far, so good.
Sorry, forget about this. I was talking nonsense.
(Whew!)
More complaints coming soon ... ;-)
Vielen dank!
And, sorry, I forgot to mention that the new script all in all REALLY KICKS ASS!!! I love it.
Oh, that's so much nicer than "it's too slow!" or "that's a well-meaning feature but poorly implemented" or "make sure everything works 100% before putting it online". You're my new favorite user! ;)
-- brion vibber (brion @ pobox.com)
On Wed, May 22, 2002 at 11:14:06AM -0700, Brion L. VIBBER wrote:
If I actually make the software check return codes and report useful error messages, I see:
You have an error in your SQL syntax near 'MATCH (cur_ind_title) AGAINST ("Fläche") ) AND cur_t' at line 3
Looks like mismatched parentheses. I've replaced (what I think is) the rest of the \w regexps with [\w\x80-\xff], and it looks improved, but as much of the search code is a mystery to me, I can't guarantee anything!
It looks Ok to me. Having said that can I as one of the contributers of the search code make a small protest here? [But feel free to ignore me because I've been away for too long without due notice.] I checked what \w actually matches on my system here and it matches the following ASCII codes (decimal):
48 - 57 ( '0' - '9' ) 65 - 90 ( 'A' - 'Z' ) 95 ( '_' ) 79 - 122 ( 'a' - 'z' ) 170 181 192 - 214 216 - 246 248 - 255
So that includes all the German characters (or has our encoding scheme changed?) and it should have worked with simply \w (and it did, as I said, searching for "Go"del" went fine. This is important because (1) our error reporting should be as tight as possible and not just give an empty search result if the user types a character that isn't indexed and (2) if something did really go wrong before then your quick-fix is now hiding a bug that may come to haunt us later.
-- Jan Hidders
On mer, 2002-05-22 at 14:43, Jan.Hidders wrote:
On Wed, May 22, 2002 at 11:14:06AM -0700, Brion L. VIBBER wrote:
If I actually make the software check return codes and report useful error messages, I see:
You have an error in your SQL syntax near 'MATCH (cur_ind_title) AGAINST ("Fläche") ) AND cur_t' at line 3
Looks like mismatched parentheses. I've replaced (what I think is) the rest of the \w regexps with [\w\x80-\xff], and it looks improved, but as much of the search code is a mystery to me, I can't guarantee anything!
It looks Ok to me. Having said that can I as one of the contributers of the search code make a small protest here? [But feel free to ignore me because I've been away for too long without due notice.] I checked what \w actually matches on my system here and it matches the following ASCII codes (decimal):
48 - 57 ( '0' - '9' ) 65 - 90 ( 'A' - 'Z' ) 95 ( '_' ) 79 - 122 ( 'a' - 'z' ) 170 181 192 - 214 216 - 246 248 - 255
So that includes all the German characters (or has our encoding scheme changed?)
If you'll recall, we're switching all the wikipedias to UTF-8 (if we don't do it now, we'll just end up doing it in a few years and it'll be more painful). The above covers some, but not all UTF-8 sequences that encode valid letters.
In an ideal world, locale settings would apply (and work correctly and consistently!) and \w would match everything necessary, but... (PHP 4.1.0 has some sort of special UTF-8 mode for regexps that might or might not be useful here.)
and it should have worked with simply \w (and it did, as I said, searching for "Go"del" went fine. This is important because (1) our error reporting should be as tight as possible and not just give an empty search result if the user types a character that isn't indexed and (2) if something did really go wrong before then your quick-fix is now hiding a bug that may come to haunt us later.
Oh, I don't doubt that at all. :)
-- brion vibber (brion @ pobox.com)
On Wed, May 22, 2002 at 03:34:08PM -0700, Brion L. VIBBER wrote:
If you'll recall, we're switching all the wikipedias to UTF-8 (if we don't do it now, we'll just end up doing it in a few years and it'll be more painful).
Uh oh. I remember we had that discussion and as I already said then: moving to UTF-8 breaks the fulltext search. This is because the indexing algorithm assumes we use Latin-1 and bases upon that its decision to collate and chooses which characters to index. For the German characters this will probably more or less work out, but if you go you beyond (why else use UTF-8?) that you will get into severe trouble.
Do you realize how serious this situation is?
In an ideal world, locale settings would apply (and work correctly and consistently!) and \w would match everything necessary, but... (PHP 4.1.0 has some sort of special UTF-8 mode for regexps that might or might not be useful here.)
It's not so much PHP that is the problem, as it is MySQL. Perhaps we should considering moving to PostgreSQL which really supports UTF-8 and is a better database anyway (some special pages could be implemented far more efficiently there).
-- Jan Hidders
On ĵaŭ, 2002-05-23 at 02:21, Jan.Hidders wrote:
On Wed, May 22, 2002 at 03:34:08PM -0700, Brion L. VIBBER wrote:
If you'll recall, we're switching all the wikipedias to UTF-8 (if we don't do it now, we'll just end up doing it in a few years and it'll be more painful).
Uh oh. I remember we had that discussion and as I already said then: moving to UTF-8 breaks the fulltext search.
Fulltext search is already broken in a million ways! It doesn't know character references (ü, ĉ, į etc), it can't find partial matches or sounds-likes, it can't find "X" when you search for "Xs" or "Xs" when you search for "X", it doesn't return *ANY* results for words it thinks are too common...
UTF-8 is the least of our problems; it just means that case-folding is a little trickier (and if we had a decent $*#@%# database, it would take care of that for us).
This is because the indexing algorithm assumes we use Latin-1 and bases upon that its decision to collate and chooses which characters to index. For the German characters this will probably more or less work out, but if you go you beyond (why else use UTF-8?) that you will get into severe trouble.
Do you realize how serious this situation is?
Yes, that's *exactly* why we have to break out of the "everything in the world should be Latin-1, oh and by the way even though we have limited support for other character sets -- but not the ones YOU need -- you can only select one single character set for the whole database server! MWOOHAAHAAHAAA!" rut and make it work for the rest of us too.
In an ideal world, locale settings would apply (and work correctly and consistently!) and \w would match everything necessary, but... (PHP 4.1.0 has some sort of special UTF-8 mode for regexps that might or might not be useful here.)
It's not so much PHP that is the problem, as it is MySQL.
The particular problem I was discussing was the regexps, which were a PHP problem. MySQL could do all the magic it wanted; if the words don't get through the regexps in the PHP code they'll never get anywhere in the database's fulltext search.
Perhaps we should considering moving to PostgreSQL which really supports UTF-8 and is a better database anyway (some special pages could be implemented far more efficiently there).
Oh, it's not like that hasn't been suggested. If anybody knows how to go about switching to Postgres, I sure as heck wouldn't object. I have no emotional attachment to MySQL; as far as I know it's only being used because Magnus was already familiar with it.
-- brion vibber (brion @ pobox.com)
On Thu, May 23, 2002 at 03:42:59AM -0700, Brion L. VIBBER wrote:
On ĵaÅ, 2002-05-23 at 02:21, Jan.Hidders wrote:
Uh oh. I remember we had that discussion and as I already said then: moving to UTF-8 breaks the fulltext search.
Fulltext search is already broken in a million ways! It doesn't know character references (ü, ĉ, į etc), it can't find partial matches or sounds-likes, it can't find "X" when you search for "Xs" or "Xs" when you search for "X", it doesn't return *ANY* results for words it thinks are too common...
All true, although the last problem will be solved when we move to the new MySQL and use the boolean search there. But at least there was a well-defined semantics: if a word shows up in the edit-text the search will find it. This is now no longer the case and no clever PHP programming can solve this.
UTF-8 is the least of our problems; it just means that case-folding is a little trickier (and if we had a decent $*#@%# database, it would take care of that for us).
Case folding is not the only problem. The problem is that the fulltext index does not index certain characters above 128. That means that words that contain multibyte characters that are represented using such characters will not be indexed. That's a bit harder to explain to the users than the previous problems.
It's not so much PHP that is the problem, as it is MySQL.
The particular problem I was discussing was the regexps, which were a PHP problem. MySQL could do all the magic it wanted; if the words don't get through the regexps in the PHP code they'll never get anywhere in the database's fulltext search.
I know, but what I am saying is that this is something we can solve with PHP programming. But if MySQL doesn't index it then we can we can let through all we want but it won't turn up in the search results. Period.
Perhaps we should considering moving to PostgreSQL which really supports UTF-8 and is a better database anyway (some special pages could be implemented far more efficiently there).
Oh, it's not like that hasn't been suggested. If anybody knows how to go about switching to Postgres, I sure as heck wouldn't object.
Unfortunately, I only have very limited hands-on experience with PostgreSQL and no experience with its fulltext indexing. From the manuals I gather it does have such a thing but it takes some configuring. On the up-side it seems far more configurable and you can in fact change the parser that looks for which words to index, how to collate, et cetera. It was written by Russians, I believe, who are generally more aware of character set problems anyway.
But the problem is that I don't have the time to research this; next month I have a conference and I have two dead-lines for articles coming up.
-- Jan Hidders
On ĵaŭ, 2002-05-23 at 04:52, Jan.Hidders wrote:
On Thu, May 23, 2002 at 03:42:59AM -0700, Brion L. VIBBER wrote:
Fulltext search is already broken in a million ways! It doesn't know character references (ü, ĉ, į etc), it can't find partial matches or sounds-likes, it can't find "X" when you search for "Xs" or "Xs" when you search for "X", it doesn't return *ANY* results for words it thinks are too common...
All true, although the last problem will be solved when we move to the new MySQL and use the boolean search there. But at least there was a well-defined semantics: if a word shows up in the edit-text the search will find it. This is now no longer the case and no clever PHP programming can solve this.
What's not well-defined about "We're having some problems with the search engine right now; if you don't find what you're looking for at first, try capitalizing it and search again." ?
UTF-8 is the least of our problems; it just means that case-folding is a little trickier (and if we had a decent $*#@%# database, it would take care of that for us).
Case folding is not the only problem. The problem is that the fulltext index does not index certain characters above 128. That means that words that contain multibyte characters that are represented using such characters will not be indexed. That's a bit harder to explain to the users than the previous problems.
If it is in fact having problems indexing chars over 128, install the attached hacked character set definition file and reindex the database.
my.cnf -> /etc/ (or wherever) Index, custom.conf -> /usr/share/mysql/charsets/ (or wherever) Run 'myisamchk -r -q' over the tables.
Works great on my machine...
-- brion vibber (brion @ pobox.com)
On Thu, May 23, 2002 at 10:59:13AM -0700, Brion L. VIBBER wrote:
If it is in fact having problems indexing chars over 128, install the attached hacked character set definition file and reindex the database.
Wow! A customized character set! I didn't know that was possible in MySQL. That looks indeed like a solution. I bow before you, oh master.
-- Jan Hidders
PS. I'd still favour a move to PostgreSQL though, which would probably be a cleaner solution and has also some other benifits. But before we do that I would like to be very very sure that fulltext indexing (which is now I believe the only MySQL-specific thing we use) works well there.
On Thu, May 23, 2002 at 09:33:21PM +0200, Jan.Hidders wrote:
PS. I'd still favour a move to PostgreSQL though, which would probably be a cleaner solution and has also some other benifits. But before we do that I would like to be very very sure that fulltext indexing (which is now I believe the only MySQL-specific thing we use) works well there.
An interesting alternative might be to do the indexing ourselves. This may sound very inefficient but let me explain. One method/trick that was used in PgSQL to get a fulltext index was to create someting like a separate table
text_index( search_word, document_id )
with a conventional index on the column search_word. Now, every time a row is added to the table that you want to be searchable you retrieve all the search words (and their suffixes) and add the corresponding pairs to the table text_index. So if the text in document with id '5' is "Text Indexing" you add the following pairs to table text_index:
( "text", 5 ) ( "ext", 5 ) ( "xt", 5 ) ( "indexing", 5 ) ( "ndexing", 5 ) ...
If you now want to search for an article containing "dex" you can query the text_index table and use in your WHERE clause something like
"search_word LIKE 'dex%'"
In this case the index for the search_word column is used because the LIKE expression doesn't start with a wild-card. As you can see this will also give you the partial matches. In PgSQL this is easy to implement because you can define triggers that call certain functions whenever a row is inserted or deleted in a table. But also in MySQL we could do this, although we would then have to ensure ourselves that every time an article is created/updated/deleted the text_index table is also updated.
The nice part is that we would have total control over what is indexed and how it is indexed in the form of a PHP function that determines which pairs are added to text_index. This function could take care of collation and could even do nifty things like taking entity references into account or let a-umlaut match with ae. The different Wikipedias would probably all have a localized version of this function.
The down-side is of course some performance when articles are updates and the extra work we need to do on writing this PHP function (although initial version could be quite simple).
-- Jan Hidders
Here's a brief history of why we use MySQL:
Bomis' first database experience was using the software that runs slashdot. At the time, it required MySQL (other databases are supported now). So, we installed it. Our next database project (Nupedia) was shortly thereafter, and we had no reason to learn a whole different database setup. So, when Wikipedia came along, and we decided to make it database-driven, it just didn't make sense to change databases, especially since it would mean we would have to change Nupedia to the new format.
Now, the programmer who wrote Nupedia (for the most part) is no longer with Bomis. I'm not sure what, if any, MySQL specific properties Nupedia uses (or Wikipedia, for that matter). I am not opposed to installing a different database, but I can't really spend a lot of time on the conversion of the software. Of course, it's not really my decision anyway...
Jason Richey
Brion L. VIBBER wrote:
Oh, it's not like that hasn't been suggested. If anybody knows how to go about switching to Postgres, I sure as heck wouldn't object. I have no emotional attachment to MySQL; as far as I know it's only being used because Magnus was already familiar with it.
-----Original Message----- From: wikitech-l-admin@nupedia.com [mailto:wikitech-l-admin@nupedia.com]On Behalf Of Jason Richey Sent: Thursday, May 23, 2002 6:58 PM To: Brion L. VIBBER Cc: Jan.Hidders; Wikitech-L Subject: Re: [Intlwiki-l] Re: [Wikitech-l] php-wiki de
Now, the programmer who wrote Nupedia (for the most part) is no longer with Bomis. I'm not sure what, if any, MySQL specific properties Nupedia uses (or Wikipedia, for that matter). I am not opposed to installing a different database, but I can't really spend a lot of time on the conversion of the software. Of course, it's not really my decision anyway...
The current Nupedia software seems pretty unusable at the moment (copyeditor pages, among others). Personally, I doubt it will see the rebirth of Nupedia. As I keep advertising ;) I'm rewriting the whole thing (http://nunupedia.sourceforge.net). I doubt any MySQL-specific queries will be necessary in that project, so there's no need to keep MySQL just for Nupedia's sake.
As for Wikipedia, except for the search function, I don't think there's much (if any) MySQL-specific stuff in there. Shouldn't be too hard to change to something else. Maybe we could use the Lee's test server for trying???
Magnus
Jan.Hidders wrote:
It's not so much PHP that is the problem, as it is MySQL. Perhaps we should considering moving to PostgreSQL which really supports UTF-8 and is a better database anyway (some special pages could be implemented far more efficiently there).
A good opportunity for switching from MySQL to PostgreSQL is coming soon, with the move from our current server to a new and much more powerful server. I'd be happy to install PostgreSQL over there instead. And we'll be giving out developer accounts to several people, including root access where appropriate, so things are going to be a lot easier for the development team.
I propose this as also a good time to switch to Lee's refactored codebase, unless the consensus is that his code is not better than the existiing codebase.
--Jimbo
On 5/23/02 1:34 PM, "Jimmy Wales" jwales@bomis.com wrote:
Jan.Hidders wrote:
It's not so much PHP that is the problem, as it is MySQL. Perhaps we should considering moving to PostgreSQL which really supports UTF-8 and is a better database anyway (some special pages could be implemented far more efficiently there).
A good opportunity for switching from MySQL to PostgreSQL is coming soon, with the move from our current server to a new and much more powerful server. I'd be happy to install PostgreSQL over there instead. And we'll be giving out developer accounts to several people, including root access where appropriate, so things are going to be a lot easier for the development team.
I gotta say, I'm hardly convinced that Postgre is that astoundingly better than MySQL for our purposes. As was pointed out, Slashdot runs on MySQL, and it's a damn fast, high-traffic, complicated site.
If you guys want to change to Postgre, go ahead, but it should be recognized that it's a mostly arbitrary decision.
At least that's how I see it.
--tc
On Thu, May 23, 2002 at 04:11:43PM -0400, The Cunctator wrote:
I gotta say, I'm hardly convinced that Postgre is that astoundingly better than MySQL for our purposes. As was pointed out, Slashdot runs on MySQL, and it's a damn fast, high-traffic, complicated site.
For a few special pages (recent changes and lonely pages) I needed nested subqueries, but MySQL doesn't have those. As a consequence I had to write some PHP code for some high volume data processing that can be done far more efficiently by the database.
-- Jan Hidders
On mar, 2002-05-21 at 06:28, Kurt Jansson wrote:
- The conversion didn't correct the upper and lower case problem in
article titles. I thought someone made a script for this?
Okay, I finally got a chance to finish that up. I've just checked in recaseLinks.php, which looks through the link tables with the critical eye of the old capitalization function, counts up the various alternate name forms in use for each page, and renames articles to the form most frequently used in links. It also puts in redirects for other forms that were used, so there shouldn't be breakage.
Jason, if you could give this a try on the German test site: php recaseLinks.php php rebuildLinks.php # The link tables then need to be rebuilt
(The Esperanto wiki also needs it, but it still needs to be converted again to see if the alternate charset function works as I think it should. I'd be mucho grateful if you could do that, so I can set people loose on bug & message-translation checking!)
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org