I have been compiling a machine compiled lexicon created from link and disambiguation pages from the XML dumps. Oddly, the associations contained in [[ARTICLE_NAME | NAME]] form a comprehesive "real time" thesauraus of common associations used by current English Speakers in Wikipedia, and perhaps comprise the worlds largest and most comprehesive Thesaurus on the planet emedded within the mesh of these links within the dumps.
While going through the dumps and constructing associative link maps of all these expressions, I have noticed a serious issue with embdded linking with proper names. It appears there may be a robot running somewhere that is associating Proper Names listed in articles about relationships between people by linking blindly to any entry in Wikipedia that matches a name in an article.
Some of the content may create controversy to post examples here, so I will complete the thesaurus compilation, and folks should go through the encyclopedia. Articles about movies stars and other "gossipy" type articles seem to have the highest errors linking proper names to unrelated people without proper disambiguation pages. It could be interpreted as violations of WP:BLP and some of the error linkages could be troublesome for the foundation.
Whomever is running bots that link between articles should look at proper name links based on categories and check into this. I found a large number of these types of errors. They are subtle, but will most probably show up when browsing through articles unless you can analyze the link targets and relationships in the dumps.
Jeff
On 01/04/07, Jeffrey V. Merkey jmerkey@wolfmountaingroup.com wrote:
Whomever is running bots that link between articles should look at proper name links based on categories and check into this. I found a large number of these types of errors. They are subtle, but will most probably show up when browsing through articles unless you can analyze the link targets and relationships in the dumps.
If I understand the problem, and I'm not sure if I do, then the advice I have is the operators of a lot of such bots won't be subscribed to this list; try wikipedia-l.
Rob Church
Whomever is running bots that link between articles should look at proper name links based on categories and check into this. I found a large number of these types of errors. They are subtle, but will most probably show up when browsing through articles unless you can analyze the link targets and relationships in the dumps.
Jeff
Try http://en.wikipedia.org/wiki/Wikipedia:Bot_owners%27_noticeboard. --Mets501
Mets501 wrote:
Whomever is running bots that link between articles should look at proper name links based on categories and check into this. I found a large number of these types of errors. They are subtle, but will most probably show up when browsing through articles unless you can analyze the link targets and relationships in the dumps.
Jeff
Try http://en.wikipedia.org/wiki/Wikipedia:Bot_owners%27_noticeboard. --Mets501
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
AFAIK links to another articles are done using human editors NOT bots..
I think the issue here is that someone linking to [[Thomas Figby]] might not realize that the article at [[Thomas Figby]] is not about the man who is currently on trial for the murder of his wife, but rather Thomas Figby, the movie star, and instead he is looking for [[Thomas Lawrence Figby]].
I don't think this has anything to do with bots.
Mark
On 31/03/07, Mohamed Magdy mohamed.m.k@gmail.com wrote:
Mets501 wrote:
Whomever is running bots that link between articles should look at proper name links based on categories and check into this. I found a large number of these types of errors. They are subtle, but will most probably show up when browsing through articles unless you can analyze the link targets and relationships in the dumps.
Jeff
Try http://en.wikipedia.org/wiki/Wikipedia:Bot_owners%27_noticeboard. --Mets501
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
AFAIK links to another articles are done using human editors NOT bots..
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Mohamed Magdy wrote:
Mets501 wrote:
Whomever is running bots that link between articles should look at proper name links based on categories and check into this. I found a large number of these types of errors. They are subtle, but will most probably show up when browsing through articles unless you can analyze the link targets and relationships in the dumps.
Jeff
Try http://en.wikipedia.org/wiki/Wikipedia:Bot_owners%27_noticeboard. --Mets501
AFAIK links to another articles are done using human editors NOT bots..
I try to not post to the English Wikipedia or its mailing lists. The problem was reported on Foundation-l with a courtesty notice to the developers who control the dumps of the problem.
Jeff
Jeff
On 01/04/07, Jeffrey V. Merkey jmerkey@wolfmountaingroup.com wrote:
I try to not post to the English Wikipedia or its mailing lists. The problem was reported on Foundation-l with a courtesty notice to the developers who control the dumps of the problem.
We control the dumps, we don't control the content of the dumps. The issue is, in fact, nothing to do with us.
Rob Church
Rob Church wrote:
On 01/04/07, Jeffrey V. Merkey jmerkey@wolfmountaingroup.com wrote:
I try to not post to the English Wikipedia or its mailing lists. The problem was reported on Foundation-l with a courtesty notice to the developers who control the dumps of the problem.
We control the dumps, we don't control the content of the dumps. The issue is, in fact, nothing to do with us.
The origianl message was in two parts. The second part seems to be the contentious part. I tend to agree with you that the developers are unconcerned with human errors in the dumps (which has now been verified as a non-Bot non-MediaWiki issue which was unconfirmed before).
The first part of the message discusses a machine created thesaurus based upon these links which I will post as an XML dump when th program is completed. That part may be of interest moving forward as it would enable a built in thesaurus for MediaWiki. The wikitrans uses this thesaurus created from within the dumps. Could have a lot of applications for translators. I have found it very useful.
Jeff
I have been compiling a machine compiled lexicon created from link and disambiguation pages from the XML dumps. Oddly, the associations contained in [[ARTICLE_NAME | NAME]] form a comprehesive "real time" thesauraus of common associations used by current English Speakers in Wikipedia, and perhaps comprise the worlds largest and most comprehesive Thesaurus on the planet emedded within the mesh of these links within the dumps.
[... snip ...]
The first part of the message discusses a machine created thesaurus based upon these links which I will post as an XML dump when th program is completed. That part may be of interest moving forward as it would enable a built in thesaurus for MediaWiki. The wikitrans uses this thesaurus created from within the dumps. Could have a lot of applications for translators. I have found it very useful.
Hi Jeff,
I ran a small project a couple of years ago to try and create "missing" redirects and disambiguation pages using this information ( http://en.wikipedia.org/w/index.php?title=User:Nickj/Redirects ) - I'll quickly describe what it did in case it helps anyone who wants to do something similar now.
A list of possible new redirects was created based on piped-link / [[ARTICLE_NAME | LINK_NAME]] usage in articles in the main namespace (using database dumps of enwiki), where: * all or most of the source LINK_NAME "votes" agreed on what the target ARTICLE_NAME was; * and a certain minimum threshold for votes was crossed (I think it might have been >= 3 votes); * and where there was no article currently at [[LINK_NAME]]; * and where there was an article currently at ARTICLE_NAME (since redirects that point to non-existent articles should be deleted with extreme prejudice, IMHO).
These redirect suggestions were then reviewed by humans, and if they liked them, they were added by them clicking on a link (which used a GET request, to give a Preview of the result, and which supplied an edit description, and all the body contents). The meant that a new redirect could be added with just 2 mouse clicks, using a standard browser. (Using the exact same method today is not currently possible due to http://bugzilla.wikimedia.org/show_bug.cgi?id=3693 , although it is currently possible to use "Show Changes" instead of "Preview", to achieve a very similar result using GET requests).
A series of disambiguation pages were also suggested, and these suggestions were created using the same methods, based on [[ARTICLE_NAME | LINK_NAME]] usage, but where the LINK_NAME "votes" did _not_ agree on what the target ARTICLE_NAME was. In these cases, it suggested a disambig page that basically said "LINK_NAME is either [[A]], [[B]] or [[C]]".
Anyone who wanted to give something like this a go (and I'm sure in 2 years that there must have been tonnes more links added, which means a lot more raw data to work with), would probably want to have a quick glance over the "Previously Rejected Suggestions" ( http://en.wikipedia.org/wiki/User:Nickj/Redirects#Previously_rejected_sugges... ) to see what people did not like previously.
Oh, and once something like this was done, you could maybe start a thesaurus directly from the redirects themselves, thus helping both the thesaurus people and the Wikipedia - win/win :-) And if you wanted to create a truly open thesaurus, you'd probably want to tag the redirects that were worthy of inclusion with something like [[Category:Thesaurus Redirect]], and you'd probably also want to tag the ones that weren't worthy of inclusion somehow too, and that way anyone could build on this data and come up with new and cool ways of using it ;-)
All the best, Nick.
wikitech-l@lists.wikimedia.org