Re: [Wikitech-l] text mining and automatic linking

19 Dec 2005


      -----BEGIN PGP SIGNED MESSAGE-----
Moin,
On Monday 19 December 2005 01:35, Lars Aronsson wrote:
...
For Google-style page ranking, it is supposedly important to have
links from one page to another.  If the word "Colombia" is
mentioned in the article about "Bogota" but not linked, this
relationship will be missed in the ranking.  One way to avoid such
misses would be for a robot to take the list of article titles and
search for their occurance in the text body of all articles, and
insert brackets where they are missing.
No, I don't suggest that such a robot should be used in Wikipedia.
For one thing, we do have articles about many common words and for
every year in history, but it would not make sense to make a link
for every mentioning of a year or such common words.
What I would like to ask is whether this kind of text mining is
common and has a name?  So this is more of a general question
about information retrieval (IR) in large text corpuses than about
Wikipedia.  Are there arithmetic rules for when such links should
be avoided?
One place where such automatic linking could be interesting is a
scanned paper encyclopedia, where no links exist beforehand, e.g.
http://en.wikisource.org/wiki/The_New_Student%27s_Reference_Work
I used a technique for that for
http://search.cpan.org/~tels/Convert-Wiki-0.05/
which can be used to convert READMEs into wikitext. There are a frew rules 
like "dont link to the same article twice in a paragraph", and you can 
supply a list of terms you want it to link. However, it is a hack, so any 
insight into formal rules or techniques would be of interest to me.
Best wishes,
Tels
- -- 
 Signed on Mon Dec 19 18:51:35 2005 with key 0x93B84C15.
 Visit my photo gallery at http://bloodgate.com/photos/
 PGP key on http://bloodgate.com/tels.asc or per email.
"Retsina?" - "Ja, Papa?" - "Rasenmähen." - "Is gut, Papa."
...PGP SIGNATURE...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iQEVAwUBQ6bzl3cLPEOTuEwVAQEbNgf+OlePwZIJsaAXv/LMhjFqo5mjESQEBaNQ
s/ehk7s5gDPyb2jgES6xPVArzZwd2XAm2x75qq4uHtyPe/KUEMpyWpZw5HKQqXu0
ph4vVM/3Wfv2rF4SWgfO0wq5miRBLyKykQfxPYkVcPLXSjZ4mo6xfGIXAscIM8Qi
x+ppWBntCmlFC2k12gOi3sSivvByRVi7d0rSrZMQFxVrCvjXJHEcvWyO2A42YzFi
qRc1I5pqGP+DwoAaVDNt+JlE+RZqcJCoH3rk5CR6SDD5RxeEbjowZ6cJwzAFhGOf
c/04Kv67DTdp16erYRWmuBvhFKvDNxQANc8TcpyUlumPo59P2H/B5Q==
=10Sk
-----END PGP SIGNATURE-----

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] text mining and automatic linking