On 10/24/07, Steve Bennett <stevagewp(a)gmail.com> wrote:
a) If it's feasible (ie, is not computationally
too expensive)
It looks so. Ultimately I'm not seeing it as much different from
current redirects, implementationally.
b) How much work is required to implement it
Probably a reasonable amount.
c) If it was implemented, whether it would be enabled
at Wikipedia
I don't see why not.
Instead of having redirects that point to a page, have
the page itself
specify aliases which can be used to find it. This is specified as a
pattern, like a very cut-down regexp: #ALIASES Thomas[-]Fran[รง|c]ois
Dalibard
The proposed syntax would be as follows (but is debatable):
Foo - matches Foo
[Foo] - matches Foo or blank.
[Foo|Moo] - matches Foo or Moo.
[Foo|Moo|] or [|Foo|Moo] - matches Foo or Moo or blank.
Foo\[Moo - matches the literal string Foo[Moo
This would essentially be like regexes, but defined *without* the
operation of iteration: only catenation and union are allowed. This
is a large benefit because it means there are a finite number of
possible patterns, and so they can be stored in enumerated form.
All whitespace is equivalent to a single space. So
"Boo [Foo]
[Moo] Woo" matches "Boo Woo", rather than
"Boo<space><space><space>Woo" for instance.
Generally speaking I would like to see titles that differ only up to
compression of whitespace to be considered identical. If this were
the case, the searchable forms of all titles would be
whitespace-normalized, and this point would be resolved automatically.
Until then, I suggest that this aspect of it be brushed under the
carpet for aliases as for anything.
- Search term matches one real page, some aliases:
takes you to real page.
(Arguably gives you a "did you mean...?" banner, but not critical)
- Search term matches one alias, no real page: takes you to page.
- Search term matches several aliases, no real
page: either an automatically generated disambiguation page, or shows you
search results with the matching aliases shown first.
I see. Possibly this is better than having the aliases be unique, yes.
Unresolved issues:
* Since pattern matching is prone to abuse, the total number of matching
aliases should
be restricted in some way, perhaps to 10 or 20. The best way to handle
an excessively broad query (eg, [A|b|c|d|e][A|b|c|d|e] etc) is left
as an open question. Possibiliies include silently failing, noisily failing
(with error message in rendered text), a special page for bad aliases...
It can create exponential database rows in the length of the alias
string, yes, so that needs to be dealt with -- if we're doing explicit
storage, anyway. I think 20 is probably too low.
* The role of redirects once this system is in place.
One possible
implementation would simply create and destroy redirects as required. In any
case, they would still be needed for some licensing issues.
Why?
Possible implementation:
Without knowing the MediaWiki DB schema at all, I speculated on a possible
implementation that would be a good tradeoff between size and speed. Two new
tables are needed:
AliasesRaw would contain a constantly updated list of the actual aliases
patterns used in articles. Each time an article is saved, this would
possibly be updated.
AliasesExpanded would contain expansions of these aliases, either fully or
partially. So an expansion of #ALIASES [City of ][Greater ]Melbourne[,
Victoria| (Australia)] to 5 characters would lead to three rows:
"City ","of [Greater ]Melbourne[, Victoria| (Australia)]"
"Great", "er Melbourne[, Victoria| (Australia)]"
"Melbo", "urne[, Victoria| (Australia)]
This means that if a user searches for "Greater Melbourne", then the search
process would go something like:
- Look for an article called Greater Melbourne, GREATER MELBOURNE, greater
melbourne (as present) - assume this fails.
- Look up "Great" in the AliasesExpanded table. Now iterate over the
matching results, finding one that matches.
Obviously the number of characters stored in the expanded aliases could be
tuned.
I don't understand this. Why don't you simply create an alias table
that mirrors the redirect table, like
alias_to
alias_namespace
alias_title
and every time a set of aliases is created for an article, just add or
the appropriate rows to that table? Then some special-case logic
would be added to appropriate classes and methods to deal with
aliases, and in particular, any method of the form "create an object
corresponding to the named article, following redirects" would take
aliases into account. (Actually, you seem to have caught on to this
point in your last post, written after I wrote that.)
Of course, that wouldn't be quite enough. There would be all sorts of
things expecting particular behavior of redirects, and so this would
create a fair amount of backwards incompatibility, and generally
confuse things. Ideally I would like to see a proposal that merges
redirects and aliases altogether: do we want them to have a
corresponding page entry or not? They shouldn't be treated as
distinct.
What we're looking for is a way to easily create and maintain
redirects, not some totally new feature, and despite my suggestions
above and below, I think that's how the problem should be posed. A
special page to easily manage all redirects to a page, including to
batch-create and -delete* them, is probably the best way to handle
this. Grouping on this redirects page by category would be a good
feature to have, for instance, and category management from it as
well. But to start with, reversible batch creation and deletion is
all that's needed.
*(Unprivileged users should indeed ideally be allowed to delete
redirects in general if they have no substantial content, as currently
they can during moves. However, history and easy reversibility needs
to be built into this before it can be deployed, needless to say.)
On 10/24/07, Andrew Garrett <andrew(a)epstone.net> wrote:
No need for the complex setup you envisiage. For
mysql, at least, we
could create a new table 'article_aliases', and "select aa_page from
article_aliases where 'my_title' like aa_alias". Of course, we'd need
to do some built-in, potentially expensive checking on the aliases
that would be originally introduced, like checking if any other pages
match the regex (if so, block the alias), and if the article title
itself matches the regex (if not, block the alias).
And you'd have to scan the table every time you want to check if an
alias exists for a given string. Probably not a great idea.