-----BEGIN PGP SIGNED MESSAGE-----
Moin,
On Monday 19 December 2005 01:35, Lars Aronsson wrote:
For Google-style page ranking, it is supposedly important to have links from one page to another. If the word "Colombia" is mentioned in the article about "Bogota" but not linked, this relationship will be missed in the ranking. One way to avoid such misses would be for a robot to take the list of article titles and search for their occurance in the text body of all articles, and insert brackets where they are missing.
No, I don't suggest that such a robot should be used in Wikipedia. For one thing, we do have articles about many common words and for every year in history, but it would not make sense to make a link for every mentioning of a year or such common words.
What I would like to ask is whether this kind of text mining is common and has a name? So this is more of a general question about information retrieval (IR) in large text corpuses than about Wikipedia. Are there arithmetic rules for when such links should be avoided?
One place where such automatic linking could be interesting is a scanned paper encyclopedia, where no links exist beforehand, e.g. http://en.wikisource.org/wiki/The_New_Student%27s_Reference_Work
I used a technique for that for
http://search.cpan.org/~tels/Convert-Wiki-0.05/
which can be used to convert READMEs into wikitext. There are a frew rules like "dont link to the same article twice in a paragraph", and you can supply a list of terms you want it to link. However, it is a hack, so any insight into formal rules or techniques would be of interest to me.
Best wishes,
Tels
- -- Signed on Mon Dec 19 18:51:35 2005 with key 0x93B84C15. Visit my photo gallery at http://bloodgate.com/photos/ PGP key on http://bloodgate.com/tels.asc or per email.
"Retsina?" - "Ja, Papa?" - "Rasenmähen." - "Is gut, Papa."