On 10/24/07, Steve Bennett stevagewp@gmail.com wrote:
a) If it's feasible (ie, is not computationally too expensive)
It looks so. Ultimately I'm not seeing it as much different from current redirects, implementationally.
b) How much work is required to implement it
Probably a reasonable amount.
c) If it was implemented, whether it would be enabled at Wikipedia
I don't see why not.
Instead of having redirects that point to a page, have the page itself specify aliases which can be used to find it. This is specified as a pattern, like a very cut-down regexp: #ALIASES Thomas[-]Fran[รง|c]ois Dalibard
The proposed syntax would be as follows (but is debatable):
Foo - matches Foo [Foo] - matches Foo or blank. [Foo|Moo] - matches Foo or Moo. [Foo|Moo|] or [|Foo|Moo] - matches Foo or Moo or blank. Foo[Moo - matches the literal string Foo[Moo
This would essentially be like regexes, but defined *without* the operation of iteration: only catenation and union are allowed. This is a large benefit because it means there are a finite number of possible patterns, and so they can be stored in enumerated form.
All whitespace is equivalent to a single space. So "Boo [Foo] [Moo] Woo" matches "Boo Woo", rather than "Boo<space><space><space>Woo" for instance.
Generally speaking I would like to see titles that differ only up to compression of whitespace to be considered identical. If this were the case, the searchable forms of all titles would be whitespace-normalized, and this point would be resolved automatically. Until then, I suggest that this aspect of it be brushed under the carpet for aliases as for anything.
- Search term matches one real page, some aliases: takes you to real page.
(Arguably gives you a "did you mean...?" banner, but not critical)
- Search term matches one alias, no real page: takes you to page.
- Search term matches several aliases, no real
page: either an automatically generated disambiguation page, or shows you search results with the matching aliases shown first.
I see. Possibly this is better than having the aliases be unique, yes.
Unresolved issues:
- Since pattern matching is prone to abuse, the total number of matching
aliases should be restricted in some way, perhaps to 10 or 20. The best way to handle an excessively broad query (eg, [A|b|c|d|e][A|b|c|d|e] etc) is left as an open question. Possibiliies include silently failing, noisily failing (with error message in rendered text), a special page for bad aliases...
It can create exponential database rows in the length of the alias string, yes, so that needs to be dealt with -- if we're doing explicit storage, anyway. I think 20 is probably too low.
- The role of redirects once this system is in place. One possible
implementation would simply create and destroy redirects as required. In any case, they would still be needed for some licensing issues.
Why?
Possible implementation: Without knowing the MediaWiki DB schema at all, I speculated on a possible implementation that would be a good tradeoff between size and speed. Two new tables are needed:
AliasesRaw would contain a constantly updated list of the actual aliases patterns used in articles. Each time an article is saved, this would possibly be updated. AliasesExpanded would contain expansions of these aliases, either fully or partially. So an expansion of #ALIASES [City of ][Greater ]Melbourne[, Victoria| (Australia)] to 5 characters would lead to three rows: "City ","of [Greater ]Melbourne[, Victoria| (Australia)]" "Great", "er Melbourne[, Victoria| (Australia)]" "Melbo", "urne[, Victoria| (Australia)]
This means that if a user searches for "Greater Melbourne", then the search process would go something like:
- Look for an article called Greater Melbourne, GREATER MELBOURNE, greater
melbourne (as present) - assume this fails.
- Look up "Great" in the AliasesExpanded table. Now iterate over the
matching results, finding one that matches.
Obviously the number of characters stored in the expanded aliases could be tuned.
I don't understand this. Why don't you simply create an alias table that mirrors the redirect table, like
alias_to alias_namespace alias_title
and every time a set of aliases is created for an article, just add or the appropriate rows to that table? Then some special-case logic would be added to appropriate classes and methods to deal with aliases, and in particular, any method of the form "create an object corresponding to the named article, following redirects" would take aliases into account. (Actually, you seem to have caught on to this point in your last post, written after I wrote that.)
Of course, that wouldn't be quite enough. There would be all sorts of things expecting particular behavior of redirects, and so this would create a fair amount of backwards incompatibility, and generally confuse things. Ideally I would like to see a proposal that merges redirects and aliases altogether: do we want them to have a corresponding page entry or not? They shouldn't be treated as distinct.
What we're looking for is a way to easily create and maintain redirects, not some totally new feature, and despite my suggestions above and below, I think that's how the problem should be posed. A special page to easily manage all redirects to a page, including to batch-create and -delete* them, is probably the best way to handle this. Grouping on this redirects page by category would be a good feature to have, for instance, and category management from it as well. But to start with, reversible batch creation and deletion is all that's needed.
*(Unprivileged users should indeed ideally be allowed to delete redirects in general if they have no substantial content, as currently they can during moves. However, history and easy reversibility needs to be built into this before it can be deployed, needless to say.)
On 10/24/07, Andrew Garrett andrew@epstone.net wrote:
No need for the complex setup you envisiage. For mysql, at least, we could create a new table 'article_aliases', and "select aa_page from article_aliases where 'my_title' like aa_alias". Of course, we'd need to do some built-in, potentially expensive checking on the aliases that would be originally introduced, like checking if any other pages match the regex (if so, block the alias), and if the article title itself matches the regex (if not, block the alias).
And you'd have to scan the table every time you want to check if an alias exists for a given string. Probably not a great idea.