Is there some established way to ensure that the snippets shown with MediaWiki search results will not include raw Wiki markup? I've looked around and found others noting this issue for both the "Lucene search" extension:
http://stackoverflow.com/questions/778166/mediawiki-lucene-how-to-strip-mark...
...and the default search mechanism:
http://www.mediawiki.org/wiki/Search_issues#.22Search_Results_Show_Markup_Ju... http://www.searchtools.com/analysis/mediawiki-search/5-mw-search-stinks-show...
Unfortunately, none of these mention a working solution.
Would it be best to use the rendered HTML output of Wiki pages to generate these snippets? That seems like the only way to be sure that you're properly transforming any Wiki text (and handling the syntax additions of any extensions you may have installed, etc). If so, is there any publically-available code to make that happen?
Thanks, Jamey
On Mon, Sep 14, 2009 at 11:37 AM, Wood, Jamey Jamey.Wood@nrel.gov wrote:
Is there some established way to ensure that the snippets shown with MediaWiki search results will not include raw Wiki markup? I've looked around and found others noting this issue for both the "Lucene search" extension:
http://stackoverflow.com/questions/778166/mediawiki-lucene-how-to-strip-mark...
...and the default search mechanism:
http://www.mediawiki.org/wiki/Search_issues#.22Search_Results_Show_Markup_Ju...
http://www.searchtools.com/analysis/mediawiki-search/5-mw-search-stinks-show...
Unfortunately, none of these mention a working solution.
Would it be best to use the rendered HTML output of Wiki pages to generate these snippets? That seems like the only way to be sure that you're properly transforming any Wiki text (and handling the syntax additions of any extensions you may have installed, etc). If so, is there any publically-available code to make that happen?
Thanks, Jamey
In Lucene it is possible to highlight arbitrary text based on the results of a query. First you parse your query and then you search for it. Next you create a QueryScorer and pass your query to it. Then you pass your scorer to a highlighter. You then pass arbitrary text to a token stream analyzer and ask your highlighter to get the best fragment from it. The text that you pass to the analyzer does not need to be the same text that is one of the fields for your search hits, it can be totally arbitrary.
MWSearch does contain an analyzer that knows something about wiki markup, and instead of stripping it out it just skips it. I don't know if you can also use an analyzer to not just skip, but remove tokens. If so you could jump right into the highlighting stage and use that analyzer to quickly preprocess the text that you analyze and then highlight. Otherwise you have to come up with a custom method of getting rid of the markup. Without writing a custom parser, a first approximation would be to keep periods, commas, apostrophes, etc.., while deleting all other non alpha-numeric characters.
Good luck.
mediawiki-l@lists.wikimedia.org