At bugzilla:18861 https://bugzilla.wikimedia.org/show_bug.cgi?id=18861 there is a discussion about how transcluded pages are not seen by the search engine, and I have made an assumption that is Wikisource's issue where its pages that are transcluded across from the Page: namespace don't make the main namespace search.
An example of the behaviour
The work "Highways and Byways in Sussex" which is proofread in the Page namespace as individual pages, and transcluded into chapters in the main namespace. http://en.wikisource.org/wiki/Highways_and_Byways_in_Sussex
* search in the main namespace http://en.wikisource.org/wiki/Special:Search?search=singleton&prefix=Hig...
* compared with searching the individual pages in the Page namespace http://en.wikisource.org/wiki/Special:Search?search=singleton&prefix=Pag...
I am guessing that the search engine does not transclude pages before it undertakes it indexing function. Is someone able to confirm that for me?
Is there any fix that anyone can suggest, or even know where such an issue can be raised beyong Bugzilla? Would a fix lie in the search engine? Or does the fix lie in the transclusion process? Thanks.
Regards, Andrew
Op 16-12-2010 13:38, Billinghurst schreef:
At bugzilla:18861 https://bugzilla.wikimedia.org/show_bug.cgi?id=18861 there is a discussion about how transcluded pages are not seen by the search engine, and I have made an assumption that is Wikisource's issue where its pages that are transcluded across from the Page: namespace don't make the main namespace search.
An example of the behaviour
The work "Highways and Byways in Sussex" which is proofread in the Page namespace as individual pages, and transcluded into chapters in the main namespace. http://en.wikisource.org/wiki/Highways_and_Byways_in_Sussex
- search in the main namespace
http://en.wikisource.org/wiki/Special:Search?search=singleton&prefix=Hig...
- compared with searching the individual pages in the Page namespace
http://en.wikisource.org/wiki/Special:Search?search=singleton&prefix=Pag...
I am guessing that the search engine does not transclude pages before it undertakes it indexing function. Is someone able to confirm that for me?
Is there any fix that anyone can suggest, or even know where such an issue can be raised beyong Bugzilla? Would a fix lie in the search engine? Or does the fix lie in the transclusion process? Thanks.
Hi Andrew,
Afaik all search engine development stopped when Robert shifted his attention to more important things (study). Would be nice if the WMF could use some of it's resources to start improving the search engine again.
Maarten
Regards, Andrew
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Billinghurst wrote:
I am guessing that the search engine does not transclude pages before it undertakes it indexing function. Is someone able to confirm that for me?
Is there any fix that anyone can suggest, or even know where such an issue can be raised beyong Bugzilla? Would a fix lie in the search engine? Or does the fix lie in the transclusion process? Thanks.
Regards, Andrew
This change should fix it. It makes the search engine store the wikitext with the templates expanded, variables replaced, comments stripped... That operation should be fast, as it should be hitting the cache from having just rendered it.
Index: includes/Article.php =================================================================== --- includes/Article.php (revision 78601) +++ includes/Article.php (working copy) @@ -3622,7 +3622,7 @@ * @param $changed Boolean: Whether or not the content actually changed */ public function editUpdates( $text, $summary, $minoredit, $timestamp_of_pagechange, $newid, $changed = true ) { - global $wgDeferredUpdateList, $wgMessageCache, $wgUser, $wgEnableParserCache; + global $wgDeferredUpdateList, $wgMessageCache, $wgUser, $wgEnableParserCache, $wgParser;
wfProfileIn( __METHOD__ );
@@ -3674,6 +3674,6 @@
$u = new SiteStatsUpdate( 0, 1, $this->mGoodAdjustment, $this->mTotalAdjustment ); array_push( $wgDeferredUpdateList, $u ); - $u = new SearchUpdate( $id, $title, $text ); + $u = new SearchUpdate( $id, $title, $wgParser->preprocess( $editInfo->pst, $this->mTitle, $editInfo->popts, $editInfo->revid ) ); array_push( $wgDeferredUpdateList, $u );
This would fix it for the default mysql search, although I'm not sure at what level of overhead. The Lucene backend is using OAIRepository for incremental updates, and builds whole indexes from XML dumps. Thus, the expanded articles need to be present in *both* of those.
Cheers, r.
This change should fix it. It makes the search engine store the wikitext with the templates expanded, variables replaced, comments stripped... That operation should be fast, as it should be hitting the cache from having just rendered it.
Index: includes/Article.php
--- includes/Article.php (revision 78601) +++ includes/Article.php (working copy) @@ -3622,7 +3622,7 @@ * @param $changed Boolean: Whether or not the content actually changed */ public function editUpdates( $text, $summary, $minoredit, $timestamp_of_pagechange, $newid, $changed = true ) {
global $wgDeferredUpdateList, $wgMessageCache, $wgUser,
$wgEnableParserCache;
global $wgDeferredUpdateList, $wgMessageCache, $wgUser,
$wgEnableParserCache, $wgParser;
wfProfileIn( __METHOD__ );
@@ -3674,6 +3674,6 @@
$u = new SiteStatsUpdate( 0, 1, $this->mGoodAdjustment,
$this->mTotalAdjustment ); array_push( $wgDeferredUpdateList, $u );
$u = new SearchUpdate( $id, $title, $text );
$u = new SearchUpdate( $id, $title, $wgParser->preprocess(
$editInfo->pst, $this->mTitle, $editInfo->popts, $editInfo->revid ) ); array_push( $wgDeferredUpdateList, $u );
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
2010/12/19 Platonides Platonides@gmail.com:
That operation should be fast, as it should be hitting the cache from having just rendered it.
Calling $wgParser->preprocess() won't hit the parser cache, I don't think. I also don't think preprocessed wikitext is cached at all, just the HTML output.
Roan Kattouw (Catrope)
Roan Kattouw wrote:
2010/12/19 Platonides Platonides@gmail.com:
That operation should be fast, as it should be hitting the cache from having just rendered it.
Calling $wgParser->preprocess() won't hit the parser cache, I don't think. I also don't think preprocessed wikitext is cached at all, just the HTML output.
Roan Kattouw (Catrope)
Not the parser cache, but the preprocessor one, see includes/parser/Preprocessor_DOM.php:105 preprocessToObj() function. Text bigger than $wgPreprocessorCacheThreshold has its xml serialization stored in the objectcache for a day.
You are right, however in that it isn't the full preprocessed output.
Is it this parsing issue or a similar rendering issue that also is the cause for the book tool not working on transcluded pages at Wikisource?
As per https://bugzilla.wikimedia.org/show_bug.cgi?id=21653
Regards, Andrew
On 19 Dec 2010 at 19:27, Platonides wrote:
Billinghurst wrote:
I am guessing that the search engine does not transclude pages before it undertakes it indexing function. Is someone able to confirm that for me?
Is there any fix that anyone can suggest, or even know where such an issue can be raised beyong Bugzilla? Would a fix lie in the search engine? Or does the fix lie in the transclusion process? Thanks.
Regards, Andrew
This change should fix it. It makes the search engine store the wikitext with the templates expanded, variables replaced, comments stripped... That operation should be fast, as it should be hitting the cache from having just rendered it.
Index: includes/Article.php
--- includes/Article.php (revision 78601) +++ includes/Article.php (working copy) @@ -3622,7 +3622,7 @@ * @param $changed Boolean: Whether or not the content actually changed */ public function editUpdates( $text, $summary, $minoredit, $timestamp_of_pagechange, $newid, $changed = true ) {
global $wgDeferredUpdateList, $wgMessageCache, $wgUser,
$wgEnableParserCache;
global $wgDeferredUpdateList, $wgMessageCache, $wgUser,
$wgEnableParserCache, $wgParser;
wfProfileIn( __METHOD__ );
@@ -3674,6 +3674,6 @@
$u = new SiteStatsUpdate( 0, 1, $this->mGoodAdjustment,
$this->mTotalAdjustment ); array_push( $wgDeferredUpdateList, $u );
$u = new SearchUpdate( $id, $title, $text );
$u = new SearchUpdate( $id, $title, $wgParser->preprocess(
$editInfo->pst, $this->mTitle, $editInfo->popts, $editInfo->revid ) ); array_push( $wgDeferredUpdateList, $u );
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Billinghurst wrote:
Is it this parsing issue or a similar rendering issue that also is the cause for the book tool not working on transcluded pages at Wikisource?
As per https://bugzilla.wikimedia.org/show_bug.cgi?id=21653
Regards, Andrew
No. It's a problem with the collection extension. (A problem with /their/ parsing if you prefer, not a general one)
wikitech-l@lists.wikimedia.org