MediaWiki aliases feature proposal - Wikitech-l

24 Oct 2007


      I have an idea for an improvement to the system of redirects, by using
pattern-based aliases. We've discussed it a bit on
wikien-l where it has some support, so I'm posting here to find out:
a) If it's feasible (ie, is not computationally too expensive)
b) How much work is required to implement it
c) If it was implemented, whether it would be enabled at Wikipedia
d) If anyone is interested in actually implementing it. If not, I may have a
go myself.
The problem:
Many pages require a largeish number of
redirects, to cope with differences in spelling, optional words,
accented characters etc. It's a surprising amount of work to create
and maintain these,
when the value of each individual redirect is so low. For example,
[[Thomas-François Dalibard]] might be spelt four ways, each requiring a
redirect: Thomas-Francois Dalibart, Thomas François Dalibard, Thomas
Francois Dalibard.
General solution:
Instead of having redirects that point to a page, have the page itself
specify aliases which can be used to find it. This is specified as a
pattern, like a very cut-down regexp: #ALIASES Thomas[-]Fran[ç|c]ois
Dalibard
The proposed syntax would be as follows (but is debatable):
Foo - matches Foo
[Foo] - matches Foo or blank.
[Foo|Moo] - matches Foo or Moo.
[Foo|Moo|] or [|Foo|Moo] - matches Foo or Moo or blank.
Foo[Moo - matches the literal string Foo[Moo
All whitespace is equivalent to a single space. So "Boo [Foo]
[Moo] Woo" matches "Boo Woo", rather than
"Boo<space><space><space>Woo" for instance.
When a user searches for a term (using "Go"), MediaWiki would perform a
normal query first, and if that fails, do an alias-based search. Thus:
- Search term matches no real pages, no aliases: takes you to some search
results.
- Search term matches one real page, no aliases: takes you to real page.
- Search term matches one real page, some aliases: takes you to real page.
(Arguably gives you a "did you mean...?" banner, but not critical)
- Search term matches one alias, no real page: takes you to page.
- Search term matches several aliases, no real
page: either an automatically generated disambiguation page, or shows you
search results with the matching aliases shown first.
An automatically generated disambiguation page could make use of some other
hypothetical keyword like {{disambig|A 19th century novelist best known for
...}}. So embedding in search results might be simpler,
and would work well if it could be forced to show the first sentence or two
from the article.
Unresolved issues:
* Since pattern matching is prone to abuse, the total number of matching
aliases should
be restricted in some way, perhaps to 10 or 20. The best way to handle
an excessively broad query (eg, [A|b|c|d|e][A|b|c|d|e] etc) is left
as an open question. Possibiliies include silently failing, noisily failing
(with error message in rendered text), a special page for bad aliases...
* Whether there should just be one #ALIASES statement, or whether multiple
would be allowed. Allowing several would be much more beginner friendly -
they could simply state all the intended redirects explicitly.
* The role of redirects once this system is in place. One possible
implementation would simply create and destroy redirects as required. In any
case, they would still be needed for some licensing issues.
Possible implementation:
Without knowing the MediaWiki DB schema at all, I speculated on a possible
implementation that would be a good tradeoff between size and speed. Two new
tables are needed:
AliasesRaw would contain a constantly updated list of the actual aliases
patterns used in articles. Each time an article is saved, this would
possibly be updated.
AliasesExpanded would contain expansions of these aliases, either fully or
partially. So an expansion of #ALIASES [City of ][Greater ]Melbourne[,
Victoria| (Australia)] to 5 characters would lead to three rows:
"City ","of [Greater ]Melbourne[, Victoria| (Australia)]"
"Great", "er Melbourne[, Victoria| (Australia)]"
"Melbo", "urne[, Victoria| (Australia)]
This means that if a user searches for "Greater Melbourne", then the search
process would go something like:
- Look for an article called Greater Melbourne, GREATER MELBOURNE, greater
melbourne (as present) - assume this fails.
- Look up "Great" in the AliasesExpanded table. Now iterate over the
matching results, finding one that matches.
Obviously the number of characters stored in the expanded aliases could be
tuned.
I look forward to any comments,
Steve