I have been compiling a machine compiled lexicon created from link and disambiguation pages from the XML dumps. Oddly, the associations contained in [[ARTICLE_NAME | NAME]] form a comprehesive "real time" thesauraus of common associations used by current English Speakers in Wikipedia, and perhaps comprise the worlds largest and most comprehesive Thesaurus on the planet emedded within the mesh of these links within the dumps.
[... snip ...]
The first part of the message discusses a machine created thesaurus based upon these links which I will post as an XML dump when th program is completed. That part may be of interest moving forward as it would enable a built in thesaurus for MediaWiki. The wikitrans uses this thesaurus created from within the dumps. Could have a lot of applications for translators. I have found it very useful.
Hi Jeff,
I ran a small project a couple of years ago to try and create "missing" redirects and disambiguation pages using this information ( http://en.wikipedia.org/w/index.php?title=User:Nickj/Redirects ) - I'll quickly describe what it did in case it helps anyone who wants to do something similar now.
A list of possible new redirects was created based on piped-link / [[ARTICLE_NAME | LINK_NAME]] usage in articles in the main namespace (using database dumps of enwiki), where: * all or most of the source LINK_NAME "votes" agreed on what the target ARTICLE_NAME was; * and a certain minimum threshold for votes was crossed (I think it might have been >= 3 votes); * and where there was no article currently at [[LINK_NAME]]; * and where there was an article currently at ARTICLE_NAME (since redirects that point to non-existent articles should be deleted with extreme prejudice, IMHO).
These redirect suggestions were then reviewed by humans, and if they liked them, they were added by them clicking on a link (which used a GET request, to give a Preview of the result, and which supplied an edit description, and all the body contents). The meant that a new redirect could be added with just 2 mouse clicks, using a standard browser. (Using the exact same method today is not currently possible due to http://bugzilla.wikimedia.org/show_bug.cgi?id=3693 , although it is currently possible to use "Show Changes" instead of "Preview", to achieve a very similar result using GET requests).
A series of disambiguation pages were also suggested, and these suggestions were created using the same methods, based on [[ARTICLE_NAME | LINK_NAME]] usage, but where the LINK_NAME "votes" did _not_ agree on what the target ARTICLE_NAME was. In these cases, it suggested a disambig page that basically said "LINK_NAME is either [[A]], [[B]] or [[C]]".
Anyone who wanted to give something like this a go (and I'm sure in 2 years that there must have been tonnes more links added, which means a lot more raw data to work with), would probably want to have a quick glance over the "Previously Rejected Suggestions" ( http://en.wikipedia.org/wiki/User:Nickj/Redirects#Previously_rejected_sugges... ) to see what people did not like previously.
Oh, and once something like this was done, you could maybe start a thesaurus directly from the redirects themselves, thus helping both the thesaurus people and the Wikipedia - win/win :-) And if you wanted to create a truly open thesaurus, you'd probably want to tag the redirects that were worthy of inclusion with something like [[Category:Thesaurus Redirect]], and you'd probably also want to tag the ones that weren't worthy of inclusion somehow too, and that way anyone could build on this data and come up with new and cool ways of using it ;-)
All the best, Nick.