Topic and cathegory analyser

List overview All Threads
Download

newer

older

Wikimedia engineering February...

Re: [Wikitech-l] [Commons-l]...

Dávid Tóth

3 Mar 2011 3 Mar '11

6:12 p.m.

Would it be useful to make a program that would create topic relations for each wikipedia article based on the links and the distribution of semantic structures?

Show replies by date

Diederik van Liere

3 Mar 3 Mar

6:21 p.m.

Please elaborate. Diederik

Sent from my iPhone

On 2011-03-03, at 16:12, Dávid Tóth 900102xy@gmail.com wrote:

...

Would it be useful to make a program that would create topic relations for each wikipedia article based on the links and the distribution of semantic structures? _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Paul Houle

4 Mar 4 Mar

10:32 a.m.

On 3/3/2011 7:12 PM, Dávid Tóth wrote:

...

Would it be useful to make a program that would create topic relations for each wikipedia article based on the links and the distribution of semantic structures?

This would be very useful for me.

I'm thinking about attack this problem by discovering 'low hanging fruits'.

To some extent you can assume that

:X :wikiLink :Y -> :X skos:related :Y

but the nature and strength of links is hard to estimate. I've developed a good metric for approximating the importance of topic :X, but I've yet to get a handle on relationship strength. To take an example, there's a link from :Metallica to :Yale_University because

:Metallica :Sued :Yale_University

That's not a very strong connection. Now, if Wikipedia mentioned the "Dead at Cornell" recording which was made when Jerry Garcia had just gotten hooked on opium and the band was playing at it's best, we might say

:Grateful_Dead :PlayedAt :Cornell_University

maybe you think that's a stronger connection than the above, maybe you don't. Then again,

:Rod_Serling :TaughtAt :Ithaca_College

is one of the stronger links involving :Ithaca_College in my opinion.

There are two angles I see for extracting better relationships from Wikipedia and these are

(i) databases such as Freebase and DBPedia, in particular, these have certain relationships already semantized and other information that can be used to infer about possible relationships. For instance,

:Brown_Bear :Sued :Pelican

doesn't make any sense and should be rejected.

(ii) analysis of the text around a link. You could certainly see certain language patterns that are frequently used, for instance

"A is a B", "C married D", "E was born at F"

you could either find some of these by hand or you could write something that uses machine learning techniques to discover these. Information from type (i) could be useful here. For instance, we could find a bunch of relationships that exist in Freebase and use these as positive training examples. The trouble I see here is the creation of a good set of negative training examples, which has a few aspects: one is that examples that should be positive will slip into a negative sample, attempts to automatically exclude positives will probably also exclude 'near miss' negatives that would be especially important to include training set, and generally, the number of negatives would be 1000 or more times prevalent than positives, which gives most ML methods Bayesian priors that destroy recall.

Another issue is that you'll see the patterns

"E was born at F" "[[E]] was born at F" "E was born at [[F]]" "[[E]] was born at [[F]]"

all occur (sometimes they make the text describing the subject a link, sometimes they don't.) Getting good recall then means solving the named entity extraction problem as well, however, making this part of a 'whole system' might create the kind of feedback control loop that's necessary for high-performing A.I.

The best attack on this, I think, is to pick one particular relationship that you want to extract, particularly one that has a bit of a 'closed world' aspect in that you can presume that that property ought to exist for all members of a type. For instance, we can say that

"any person was born at some location"

but even there you can get into trouble quick, if you look at :Joan_of_arc, you see that wikipedia says that she was

"A peasant http://en.wikipedia.org/wiki/Peasant girl born in eastern France"

you note that "A peasant girl" == :Joan_of_arc and that a more specific birthplace can be found in the infobox.

Alex Brollo

10:49 a.m.

2011/3/4 Paul Houle paul@ontology2.com

Briefly, atthe border of OT: I see the magic word ontology into your mail address. :-) :-)

I discovered ontology ... well, a long history. Ontological classification is used to collect data on cancer by National Cancer Insititute; and, strange to tell, I discovered it as an unexpected result of posting a picture on Commons, a low grade prostatic PIN... then I found that NCI use SemanticWiki. In other terms: from wiki, to wiki again. :-)

My aim about ontologies is very, very simpler; it's simply to create something I called "catwords", t.i. a system of categorization (wiki sistem is perfect) that can be used too as a list of keywords. I can't wait for installation of DynamicPageList into it.source, since the engine I need is simply a good method to get intersection of categories; but I found that it's not sufficient, some peculiar conventions in categorization are needed too, far from complex.... well, I'll tell you news as soon as I will get my tool. :-)

Alex

Platonides

3:08 p.m.

Paul Houle wrote:

...

"A peasant http://en.wikipedia.org/wiki/Peasant girl born in eastern France"

you note that "A peasant girl" == :Joan_of_arc and that a more specific birthplace can be found in the infobox.

You will find that the infoboxes are the best article pieces to mine.

5054

Age (days ago)

5054

Last active (days ago)

wikitech-l@lists.wikimedia.org

4 comments

5 participants

tags (0)

participants (5)

Alex Brollo
Diederik van Liere
Dávid Tóth
Paul Houle
Platonides