For Google-style page ranking, it is supposedly important to have links from one page to another. If the word "Colombia" is mentioned in the article about "Bogota" but not linked, this relationship will be missed in the ranking. One way to avoid such misses would be for a robot to take the list of article titles and search for their occurance in the text body of all articles, and insert brackets where they are missing.
No, I don't suggest that such a robot should be used in Wikipedia. For one thing, we do have articles about many common words and for every year in history, but it would not make sense to make a link for every mentioning of a year or such common words.
What I would like to ask is whether this kind of text mining is common and has a name? So this is more of a general question about information retrieval (IR) in large text corpuses than about Wikipedia. Are there arithmetic rules for when such links should be avoided?
One place where such automatic linking could be interesting is a scanned paper encyclopedia, where no links exist beforehand, e.g. http://en.wikisource.org/wiki/The_New_Student%27s_Reference_Work
-----BEGIN PGP SIGNED MESSAGE-----
Moin,
On Monday 19 December 2005 01:35, Lars Aronsson wrote:
For Google-style page ranking, it is supposedly important to have links from one page to another. If the word "Colombia" is mentioned in the article about "Bogota" but not linked, this relationship will be missed in the ranking. One way to avoid such misses would be for a robot to take the list of article titles and search for their occurance in the text body of all articles, and insert brackets where they are missing.
No, I don't suggest that such a robot should be used in Wikipedia. For one thing, we do have articles about many common words and for every year in history, but it would not make sense to make a link for every mentioning of a year or such common words.
What I would like to ask is whether this kind of text mining is common and has a name? So this is more of a general question about information retrieval (IR) in large text corpuses than about Wikipedia. Are there arithmetic rules for when such links should be avoided?
One place where such automatic linking could be interesting is a scanned paper encyclopedia, where no links exist beforehand, e.g. http://en.wikisource.org/wiki/The_New_Student%27s_Reference_Work
I used a technique for that for
http://search.cpan.org/~tels/Convert-Wiki-0.05/
which can be used to convert READMEs into wikitext. There are a frew rules like "dont link to the same article twice in a paragraph", and you can supply a list of terms you want it to link. However, it is a hack, so any insight into formal rules or techniques would be of interest to me.
Best wishes,
Tels
- -- Signed on Mon Dec 19 18:51:35 2005 with key 0x93B84C15. Visit my photo gallery at http://bloodgate.com/photos/ PGP key on http://bloodgate.com/tels.asc or per email.
"Retsina?" - "Ja, Papa?" - "Rasenmähen." - "Is gut, Papa."
Lars Aronsson schrieb:
No, I don't suggest that such a robot should be used in Wikipedia. For one thing, we do have articles about many common words and for every year in history, but it would not make sense to make a link for every mentioning of a year or such common words.
A year ago I have developed a web based "wikifyer" for the German Wikipedia at http://217.160.138.71/development/wikipedia/wikify/index.php (German only), which calculates a list of all article titles which exist in the German Wikipedia and also occur in his selected article, but are not yet linked from this article. The user can then select a set of subjects to be linked automatically and can afterwards copy&paste the resulting article text. So the decision about the linking is still the decision of the user, but he is better supported in knowing what he can link to and to do that fast.
I didn't work on that tool since a year for reasons of low spare time, but I still have the idea to rewrite that tool from scratch and to put it onto the Wikimedia tool server to be able to use a recent copy of the Wikipedia for the list of article titles. Does that sound like a useful thing to you?
Ciao, Michael.
wikitech-l@lists.wikimedia.org