These are the options we use for the more_like_this query:
$wgCirrusSearchMoreLikeThisConfig = array(
'min_doc_freq' => 2, // Minimum number of documents (per shard) that need a term for it to be considered
'max_query_terms' => 25,
'min_term_freq' => 2,
'percent_terms_to_match' => 0.3,
'min_word_len' => 0,
'max_word_len' => 0,
);
We only use the "text" field of the articles - no weighting based on, well, anything. See the text field in
https://en.wikipedia.org/wiki/Barack_Obama?action=cirrusdump for example.
Stuff we could do really, really easily:
1. Add url parameters that override each of those options for easy experimenting.
2. Add url parameters to use different fields like our weighted all field, the wikitext, or intro paragraphs (don't ask how we extract into paragraphs - its a horrible hack), or the section headers, or the "secondary" text like the inforboxes and image subtitles.
These are seriously very little work. A couple of hours. A day if we're being really good about testing _and_ someone merges something to core that screws up the tests. If it enables lots of cool experimenting I'm all for doing it.
Nik