Re: [Wikitech-l] Topic and cathegory analyser - Wikitech-l

5 Mar 2011


      Although there might be links that do not have strong connections, if the
articles are written according to Wikipedia guidelines there should be a
minimal amount of such links (distractions and noise).
Every article consists of words and semantic structures. If we could
partition all the articles and analyse the occurrence of these structures
statistically we could see a different distribution for every article. If we
were proceeding further analysing these distributions, we could notice that
there are different types of these distributions and that they can be
categorised according to what they have in common. Let us pick up some
articles that represent categories,
http://en.wikipedia.org/wiki/Mathematicsbeing one of them. Then the
system would assign to each article a
probability or closeness to a certain category. So for example
http://en.wikipedia.org/wiki/Isac_Newton could have 15% Physics, 13%
Mathematics, 11% Famous People...
There could be numerous methods applied for the analysis - Bayesian
probabilities, PageRank and so on or their combinations.
Percentages are highly illustrative, relevance can be expressed with more
than one dimension as some combination of vectors or functions dependent on
some factors (time, location, depth of information span...).
Possible applications:
"See also" suggestions
Search results
Experimental semantic navigation
Analysis of scientific papers