I have an idea for an improvement to the system of redirects, by using pattern-based aliases. We've discussed it a bit on wikien-l where it has some support, so I'm posting here to find out:
a) If it's feasible (ie, is not computationally too expensive) b) How much work is required to implement it c) If it was implemented, whether it would be enabled at Wikipedia d) If anyone is interested in actually implementing it. If not, I may have a go myself.
The problem: Many pages require a largeish number of redirects, to cope with differences in spelling, optional words, accented characters etc. It's a surprising amount of work to create and maintain these, when the value of each individual redirect is so low. For example, [[Thomas-François Dalibard]] might be spelt four ways, each requiring a redirect: Thomas-Francois Dalibart, Thomas François Dalibard, Thomas Francois Dalibard.
General solution: Instead of having redirects that point to a page, have the page itself specify aliases which can be used to find it. This is specified as a pattern, like a very cut-down regexp: #ALIASES Thomas[-]Fran[ç|c]ois Dalibard
The proposed syntax would be as follows (but is debatable):
Foo - matches Foo [Foo] - matches Foo or blank. [Foo|Moo] - matches Foo or Moo. [Foo|Moo|] or [|Foo|Moo] - matches Foo or Moo or blank. Foo[Moo - matches the literal string Foo[Moo
All whitespace is equivalent to a single space. So "Boo [Foo] [Moo] Woo" matches "Boo Woo", rather than "Boo<space><space><space>Woo" for instance.
When a user searches for a term (using "Go"), MediaWiki would perform a normal query first, and if that fails, do an alias-based search. Thus:
- Search term matches no real pages, no aliases: takes you to some search results. - Search term matches one real page, no aliases: takes you to real page. - Search term matches one real page, some aliases: takes you to real page. (Arguably gives you a "did you mean...?" banner, but not critical) - Search term matches one alias, no real page: takes you to page. - Search term matches several aliases, no real page: either an automatically generated disambiguation page, or shows you search results with the matching aliases shown first.
An automatically generated disambiguation page could make use of some other hypothetical keyword like {{disambig|A 19th century novelist best known for ...}}. So embedding in search results might be simpler, and would work well if it could be forced to show the first sentence or two from the article.
Unresolved issues: * Since pattern matching is prone to abuse, the total number of matching aliases should be restricted in some way, perhaps to 10 or 20. The best way to handle an excessively broad query (eg, [A|b|c|d|e][A|b|c|d|e] etc) is left as an open question. Possibiliies include silently failing, noisily failing (with error message in rendered text), a special page for bad aliases... * Whether there should just be one #ALIASES statement, or whether multiple would be allowed. Allowing several would be much more beginner friendly - they could simply state all the intended redirects explicitly. * The role of redirects once this system is in place. One possible implementation would simply create and destroy redirects as required. In any case, they would still be needed for some licensing issues.
Possible implementation: Without knowing the MediaWiki DB schema at all, I speculated on a possible implementation that would be a good tradeoff between size and speed. Two new tables are needed:
AliasesRaw would contain a constantly updated list of the actual aliases patterns used in articles. Each time an article is saved, this would possibly be updated. AliasesExpanded would contain expansions of these aliases, either fully or partially. So an expansion of #ALIASES [City of ][Greater ]Melbourne[, Victoria| (Australia)] to 5 characters would lead to three rows: "City ","of [Greater ]Melbourne[, Victoria| (Australia)]" "Great", "er Melbourne[, Victoria| (Australia)]" "Melbo", "urne[, Victoria| (Australia)]
This means that if a user searches for "Greater Melbourne", then the search process would go something like: - Look for an article called Greater Melbourne, GREATER MELBOURNE, greater melbourne (as present) - assume this fails. - Look up "Great" in the AliasesExpanded table. Now iterate over the matching results, finding one that matches.
Obviously the number of characters stored in the expanded aliases could be tuned.
I look forward to any comments, Steve