On Mon, Sep 14, 2009 at 11:37 AM, Wood, Jamey <Jamey.Wood(a)nrel.gov> wrote:
Is there some established way to ensure that the
snippets shown with
MediaWiki search results will not include raw Wiki markup? I've looked
around and found others noting this issue for both the "Lucene search"
extension:
http://stackoverflow.com/questions/778166/mediawiki-lucene-how-to-strip-mar…
...and the default search mechanism:
http://www.mediawiki.org/wiki/Search_issues#.22Search_Results_Show_Markup_J…
http://www.searchtools.com/analysis/mediawiki-search/5-mw-search-stinks-sho…
Unfortunately, none of these mention a working solution.
Would it be best to use the rendered HTML output of Wiki pages to generate
these snippets? That seems like the only way to be sure that you're
properly transforming any Wiki text (and handling the syntax additions of
any extensions you may have installed, etc). If so, is there any
publically-available code to make that happen?
Thanks,
Jamey
In Lucene it is possible to highlight arbitrary text based on the results of
a query. First you parse your query and then you search for it. Next you
create a QueryScorer and pass your query to it. Then you pass your scorer to
a highlighter. You then pass arbitrary text to a token stream analyzer and
ask your highlighter to get the best fragment from it. The text that you
pass to the analyzer does not need to be the same text that is one of the
fields for your search hits, it can be totally arbitrary.
MWSearch does contain an analyzer that knows something about wiki markup,
and instead of stripping it out it just skips it. I don't know if you can
also use an analyzer to not just skip, but remove tokens. If so you could
jump right into the highlighting stage and use that analyzer to quickly
preprocess the text that you analyze and then highlight. Otherwise you have
to come up with a custom method of getting rid of the markup. Without
writing a custom parser, a first approximation would be to keep periods,
commas, apostrophes, etc.., while deleting all other non alpha-numeric
characters.
Good luck.