I have been compiling a machine compiled lexicon created from link and disambiguation pages from the XML dumps. Oddly, the associations contained in [[ARTICLE_NAME | NAME]] form a comprehesive "real time" thesauraus of common associations used by current English Speakers in Wikipedia, and perhaps comprise the worlds largest and most comprehesive Thesaurus on the planet emedded within the mesh of these links within the dumps.
While going through the dumps and constructing associative link maps of all these expressions, I have noticed a serious issue with embdded linking with proper names. It appears there may be a robot running somewhere that is associating Proper Names listed in articles about relationships between people by linking blindly to any entry in Wikipedia that matches a name in an article.
Some of the content may create controversy to post examples here, so I will complete the thesaurus compilation, and folks should go through the encyclopedia. Articles about movies stars and other "gossipy" type articles seem to have the highest errors linking proper names to unrelated people without proper disambiguation pages. It could be interpreted as violations of WP:BLP and some of the error linkages could be troublesome for the foundation.
Whomever is running bots that link between articles should look at proper name links based on categories and check into this. I found a large number of these types of errors. They are subtle, but will most probably show up when browsing through articles unless you can analyze the link targets and relationships in the dumps.
Jeff
On 4/1/07, Jeffrey V. Merkey jmerkey@wolfmountaingroup.com wrote:
I have been compiling a machine compiled lexicon created from link and disambiguation pages from the XML dumps. Oddly, the associations contained in [[ARTICLE_NAME | NAME]] form a comprehesive "real time" thesauraus of common associations used by current English Speakers in Wikipedia, and perhaps comprise the worlds largest and most comprehesive Thesaurus on the planet emedded within the mesh of these links within the dumps.
Hey Jeff, Would you mind forwarding me a copy of your extracted data? A long time back I extracted the same data using an instrumented copy of the mediawiki parser, for the purpose of creating missing redirect pages. I didn't save my work, and getting the data from you would save me from reinventing the wheel all over again.
Thanks.
Gregory Maxwell wrote:
On 4/1/07, Jeffrey V. Merkey jmerkey@wolfmountaingroup.com wrote:
I have been compiling a machine compiled lexicon created from link and disambiguation pages from the XML dumps. Oddly, the associations contained in [[ARTICLE_NAME | NAME]] form a comprehesive "real time" thesauraus of common associations used by current English Speakers in Wikipedia, and perhaps comprise the worlds largest and most comprehesive Thesaurus on the planet emedded within the mesh of these links within the dumps.
Hey Jeff, Would you mind forwarding me a copy of your extracted data? A long time back I extracted the same data using an instrumented copy of the mediawiki parser, for the purpose of creating missing redirect pages. I didn't save my work, and getting the data from you would save me from reinventing the wheel all over again.
Thanks.
I'll post it today at ftp://ftp.wikigadugi.org. It's very useful.
Jeff
wikimedia-l@lists.wikimedia.org