Hi,
Every release of MW since 1.10 I've been making a tweak to the rebuildtextindex.php to replace line 60 with the following code:
if($s->page_namespace != NS_MAIN) { global $wgContLang; $title = $wgContLang->getNsText( $s->page_namespace ) . ':' . $s->page_title; } else { $title = $s->page_title; } $u = new SearchUpdate( $s->page_id, $title, $revtext );
I have no idea how the best way to get this into the main code base is, but I'm pretty sure as it stands its wrong in everyone's eyes - since the namespace information is currently lost. Ideally SearchUpdate would be refactored to take a namespace parameter, but this at least allows it to be retrieved intact. I found this problem when adding an extension to index the other namespaces. Or should I be using something else to rebuild the text search?
Kind regards,
Alex
Oh and while I'm at it is there any reason why line 161 of \includes\SearchEngine.php cannot be updated to :
return "\x22A-Za-z_'0-9\x80-\xFF\-";
This allows for quoted searches to be passed into the MySQL query engine. Or am I introducing a security hole?
Alex
On Fri, Aug 22, 2008 at 4:18 PM, Alex Powell alexp@exscien.com wrote:
Hi,
Every release of MW since 1.10 I've been making a tweak to the rebuildtextindex.php to replace line 60 with the following code:
if($s->page_namespace != NS_MAIN) { global $wgContLang; $title = $wgContLang->getNsText( $s->page_namespace ) .
':' . $s->page_title; } else { $title = $s->page_title; } $u = new SearchUpdate( $s->page_id, $title, $revtext );
I have no idea how the best way to get this into the main code base is, but I'm pretty sure as it stands its wrong in everyone's eyes - since the namespace information is currently lost. Ideally SearchUpdate would be refactored to take a namespace parameter, but this at least allows it to be retrieved intact. I found this problem when adding an extension to index the other namespaces. Or should I be using something else to rebuild the text search?
Kind regards,
Alex
-- Alex Powell
Exscien Training Ltd Tel: +44 (0) 1865 876562 Mob: +44 (0) 7717 765210
skype: alexp700 mailto:alexp@exscien.com http://www.exscien.com
Registered in England and Wales 05927635, Unit 10 Wheatley Business Centre, Old London Road, Wheatley, OX33 1XW, England
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Alex Powell wrote:
Oh and while I'm at it is there any reason why line 161 of \includes\SearchEngine.php cannot be updated to :
return "\x22A-Za-z_'0-9\x80-\xFF\-";
This allows for quoted searches to be passed into the MySQL query engine. Or am I introducing a security hole?
Are you looking at a really old version? We've supported quoted searches for a while.
- -- brion
Ah. I branched a version of the search functions at 1.11 and directed them towards a centralized text store. THis was done by overriding the Special::SearchPage. I noticed in my code that the legalSearchChars stuff was filtering out " 's from the query, that meant a search:
"fish pie"
would include articles with the string "fish and pie" only in them and not an exact match to "fish pie" only. By adding \x22 to the legal chars in the core class, which in my 1.11 version was called regardless of any derived classes overridden legalSearchChars(). Possibly this issue is fixed in 1.13, but the line remains the same.
BTW the relevance metric for the MySQL search is also quite borked in the 1.13 codebase. It needs to be more like on line 191, SearchMySQL.php:
$match = $this->parseQuery( $filteredTerm, $fulltext );
$m2 = str_replace(" IN BOOLEAN MODE", "", $match);
return "SELECT page_id, page_namespace, page_title, {$m2} as relevance " . "FROM masterwiki.$page, masterwiki.$searchindex " . "WHERE pid=si_masterid AND $match";
That will give properly ranked results - got this from the MySQL man page on free text search:
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
First comment.
Hope that helps!
Kind regards,
Alex
On Fri, Aug 22, 2008 at 5:29 PM, Brion Vibber brion@wikimedia.org wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Alex Powell wrote:
Oh and while I'm at it is there any reason why line 161 of \includes\SearchEngine.php cannot be updated to :
return "\x22A-Za-z_'0-9\x80-\xFF\-";
This allows for quoted searches to be passed into the MySQL query engine.
Or
am I introducing a security hole?
Are you looking at a really old version? We've supported quoted searches for a while.
- -- brion
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkiu6XoACgkQwRnhpk1wk452YQCgjvzlBhnremGXVI4xbXJkP1Aw 3nEAn33cADmdwCcgqLJnJfvFWAMaljfn =mo1X -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Sorry, I meant second :-s
On Sat, Aug 23, 2008 at 8:01 PM, Alex Powell alexp@exscien.com wrote:
Ah. I branched a version of the search functions at 1.11 and directed them towards a centralized text store. THis was done by overriding the Special::SearchPage. I noticed in my code that the legalSearchChars stuff was filtering out " 's from the query, that meant a search:
"fish pie"
would include articles with the string "fish and pie" only in them and not an exact match to "fish pie" only. By adding \x22 to the legal chars in the core class, which in my 1.11 version was called regardless of any derived classes overridden legalSearchChars(). Possibly this issue is fixed in 1.13, but the line remains the same.
BTW the relevance metric for the MySQL search is also quite borked in the 1.13 codebase. It needs to be more like on line 191, SearchMySQL.php:
$match = $this->parseQuery( $filteredTerm, $fulltext );
$m2 = str_replace(" IN BOOLEAN MODE", "", $match);
return "SELECT page_id, page_namespace, page_title, {$m2} as relevance " . "FROM masterwiki.$page, masterwiki.$searchindex " . "WHERE pid=si_masterid AND $match";
That will give properly ranked results - got this from the MySQL man page on free text search:
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
First comment.
Hope that helps!
Kind regards,
Alex
On Fri, Aug 22, 2008 at 5:29 PM, Brion Vibber brion@wikimedia.org wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Alex Powell wrote:
Oh and while I'm at it is there any reason why line 161 of \includes\SearchEngine.php cannot be updated to :
return "\x22A-Za-z_'0-9\x80-\xFF\-";
This allows for quoted searches to be passed into the MySQL query
engine. Or
am I introducing a security hole?
Are you looking at a really old version? We've supported quoted searches for a while.
- -- brion
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkiu6XoACgkQwRnhpk1wk452YQCgjvzlBhnremGXVI4xbXJkP1Aw 3nEAn33cADmdwCcgqLJnJfvFWAMaljfn =mo1X -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Alex Powell
Exscien Training Ltd Tel: +44 (0) 1865 876562 Mob: +44 (0) 7717 765210
skype: alexp700 mailto:alexp@exscien.com http://www.exscien.com
Registered in England and Wales 05927635, Unit 10 Wheatley Business Centre, Old London Road, Wheatley, OX33 1XW, England
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Alex Powell wrote:
Every release of MW since 1.10 I've been making a tweak to the rebuildtextindex.php to replace line 60 with the following code:
[snip]
I have no idea how the best way to get this into the main code base is, but I'm pretty sure as it stands its wrong in everyone's eyes - since the namespace information is currently lost. Ideally SearchUpdate would be refactored to take a namespace parameter, but this at least allows it to be retrieved intact.
Every call to SearchUpdate from Article and Title passes in the title text portion without the namespace, and even if you included the namespace on the title text, SearchUpdate itself discards it in its constructor!
The namespace information is kept in the page table, in the page_namespace field.
At some point we'll want to refactor SearchUpdate along with the search backend classes to ensure everything does their index updates in a cleaner way.
- -- brion
Sorry, yes I have extensions that create the index which seem to be getting the full title text. It does look like it was an oversight.
I guess I'll just keep hacking the file!
Cheers,
Alex
On Fri, Aug 22, 2008 at 5:35 PM, Brion Vibber brion@wikimedia.org wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Alex Powell wrote:
Every release of MW since 1.10 I've been making a tweak to the rebuildtextindex.php to replace line 60 with the following code:
[snip]
I have no idea how the best way to get this into the main code base is,
but
I'm pretty sure as it stands its wrong in everyone's eyes - since the namespace information is currently lost. Ideally SearchUpdate would be refactored to take a namespace parameter, but this at least allows it to
be
retrieved intact.
Every call to SearchUpdate from Article and Title passes in the title text portion without the namespace, and even if you included the namespace on the title text, SearchUpdate itself discards it in its constructor!
The namespace information is kept in the page table, in the page_namespace field.
At some point we'll want to refactor SearchUpdate along with the search backend classes to ensure everything does their index updates in a cleaner way.
- -- brion
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEUEARECAAYFAkiu6tgACgkQwRnhpk1wk46N3wCYm4NtQOOWfMiKPv1R1MGmTUo7 QgCfR+z9OXnElW/SmSzTz3ddwACt7gk= =KqVv -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org