GSOC 2012 : Lucene Automatic Query Expansion From Wikipedia Text

List overview All Threads
Download

newer

older

Mentor still needed for Watchlist...

Development process doesn't work...

Gautham Shankar

5 Apr 2012 5 Apr '12

8:58 p.m.

Hello,

Based on the feedback i received i have updated my proposal page.

https://www.mediawiki.org/wiki/User:Gautham_shankar/Gsoc

There is about 20 Hrs for the deadline and any final feedback would be useful. I have also submitted the proposal at the GSOC page.

Regards, Gautham Shankar

Show replies by date

Gregory Varnum

5 Apr 5 Apr

11:09 p.m.

Also a reminder for folks that this and some other proposals need mentors.

Gautham - thank you for the updated proposal page. I would also solicit feedback in our Irc channel if you can and connect with interested mentors: https://www.mediawiki.org/wiki/GSOC#Mentor_signup

https://www.mediawiki.org/wiki/MediaWiki_on_IRC

-Greg aka varnent

___________ Sent from my iPad. Apologies for any typos. A more detailed response may be sent later.

On Apr 5, 2012, at 8:58 PM, Gautham Shankar gautham.shankar@hiveusers.com wrote:

...

Hello,

Based on the feedback i received i have updated my proposal page.

https://www.mediawiki.org/wiki/User:Gautham_shankar/Gsoc

There is about 20 Hrs for the deadline and any final feedback would be useful. I have also submitted the proposal at the GSOC page.

Regards, Gautham Shankar _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Robert Stojnic

6 Apr 6 Apr

5:40 a.m.

New subject: GSOC 2012 : Lucene Automatic Query Expansion From Wikipedia Text

Hi Gautham,

I think mining wiktionary is an interesting project. However, about the more practical Lucene part: at some point I tried using wordnet to expand queries however I found that it introduces too many false positives. The most challenging part I think it *context-based* expansion. I.e. a simple synonym-based expansion is of no use because it introduces too many meanings that the user didn't quite have in mind. However, if we could somehow use the words in the query to find a meaning from a set of possible meanings that could be really helpful.

You can look into existing lucene-search source to see how I used wordnet. I think in the end I ended up using it only for very obvious stuff (e.g. 11 = eleven, UK = United Kingdom, etc..).

Cheers, r.

On 06/04/12 01:58, Gautham Shankar wrote:

...

Hello,

Based on the feedback i received i have updated my proposal page.

https://www.mediawiki.org/wiki/User:Gautham_shankar/Gsoc

There is about 20 Hrs for the deadline and any final feedback would be useful. I have also submitted the proposal at the GSOC page.

Regards, Gautham Shankar _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Gautham Shankar

12:01 p.m.

Robert Stojnic <rainmansr <at> gmail.com> writes:

...

Hi Gautham,

I think mining wiktionary is an interesting project. However, about the more practical Lucene part: at some point I tried using wordnet to expand queries however I found that it introduces too many false positives. The most challenging part I think it *context-based* expansion. I.e. a simple synonym-based expansion is of no use because it introduces too many meanings that the user didn't quite have in mind. However, if we could somehow use the words in the query to find a meaning from a set of possible meanings that could be really helpful.

You can look into existing lucene-search source to see how I used wordnet. I think in the end I ended up using it only for very obvious stuff (e.g. 11 = eleven, UK = United Kingdom, etc..).

Cheers, r.

On 06/04/12 01:58, Gautham Shankar wrote:

...
Hello,

Based on the feedback i received i have updated my proposal page.

https://www.mediawiki.org/wiki/User:Gautham_shankar/Gsoc

There is about 20 Hrs for the deadline and any final feedback would be useful. I have also submitted the proposal at the GSOC page.

Regards, Gautham Shankar _______________________________________________ Wikitech-l mailing list Wikitech-l <at> lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Hi Robert,

Thank you for your feedback. Like you pointed out, query expansion using the wordnet data directly, reduces the quality of the search.

I found this research paper very interesting. www.sftw.umac.mo/~fstzgg/dexa2005.pdf They have built a TSN (Term Semantic Network) for the given query based on the usage of words in the documents. The expansion words obtained from the wordnet are then filtered out based on the TSN data.

I did not add this detail to my proposal since i thought it deals more with the creation of the wordnet. I would love to implement the TSN concept once the wordnet is complete.

Regards, Gautham Shankar

Oren Bochman

12:54 p.m.

New subject: GSOC 2012 : Lucene Automatic Query Expansion From Wikipedia Text

Hi Robert Stojnic and Gautham Shankar

I wanted to let Gautham that he has written a great proposal and thank you for the feedback as well.

I wanted to point out that in my point of view the main goal of this multilingual wordnet isn't queary expansion, but rather means for ever greater cross language capabilites in search and content analytics. A wordnet seme can be further disambiguated using a topic map algorithm run which would consider all the contexts like you suggest. But this is planned latter and so the wordnet would be a milestone. To further clarify Gautham's integration will place a XrossLanguage-seme Word Net tokens during indexing for words it recognises - allow the ranking algorithm to use knowldege drawn from all the wikipedia articles. (For example one part of the ranking would peek into featured article in German on "A" rank it >> then "B" featured in Hungarian and use them as oracles to rank A >> B >> ... in English where the picture might now be X

...

...
Y >> Z >> ... B >> A ...)

I mention in passing that I have began to develop dataset for use with open relavance to sytematicly review and evaluate dramatic changes to relevance due to changes in the search engine. I will post on this in due course as it matures - since I am working on a number of smaller projects i'd like to demo at WikiMania.)

On Fri, Apr 6, 2012 at 6:01 PM, Gautham Shankar < gautham.shankar@hiveusers.com> wrote:

...

Robert Stojnic <rainmansr <at> gmail.com> writes:

...
Hi Gautham,

I think mining wiktionary is an interesting project. However, about the more practical Lucene part: at some point I tried using wordnet to expand queries however I found that it introduces too many false positives. The most challenging part I think it *context-based* expansion. I.e. a simple synonym-based expansion is of no use because it introduces too many meanings that the user didn't quite have in mind. However, if we could somehow use the words in the query to find a meaning from a set of possible meanings that could be really helpful.

You can look into existing lucene-search source to see how I used wordnet. I think in the end I ended up using it only for very obvious stuff (e.g. 11 = eleven, UK = United Kingdom, etc..).

Cheers, r.

On 06/04/12 01:58, Gautham Shankar wrote:

...
Hello,

Based on the feedback i received i have updated my proposal page.

https://www.mediawiki.org/wiki/User:Gautham_shankar/Gsoc

There is about 20 Hrs for the deadline and any final feedback would be useful. I have also submitted the proposal at the GSOC page.

Regards, Gautham Shankar _______________________________________________ Wikitech-l mailing list Wikitech-l <at> lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Hi Robert,

Thank you for your feedback. Like you pointed out, query expansion using the wordnet data directly, reduces the quality of the search.

I found this research paper very interesting. www.sftw.umac.mo/~fstzgg/dexa2005.pdfhttp://www.sftw.umac.mo/%7Efstzgg/dexa2005.pdf They have built a TSN (Term Semantic Network) for the given query based on the usage of words in the documents. The expansion words obtained from the wordnet are then filtered out based on the TSN data.

I did not add this detail to my proposal since i thought it deals more with the creation of the wordnet. I would love to implement the TSN concept once the wordnet is complete.

Regards, Gautham Shankar

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Hi again

-- Oren Bochman Office tel. 061 4921492 Mobile +36 30 866 6706 skype id: orenbochman e-mail: oren@romai-horizon.com site http://www.riverport.hu

Robert Stojnic

8 Apr 8 Apr

6:53 a.m.

New subject: GSOC 2012 : Lucene Automatic Query Expansion From Wikipedia Text

Hello,

Yep, generating the wodnet itself is a challenging and interesting project. I was simply commenting on the Lucene part, i.e. on possible application.

Currently the lucene backend works by employing some very general rules (e.g. titles get highest score, then first sentence in articled, then first paragraph, then words occurring in clusters e.g. within ~20 words, etc..). However, in many cases they fail.

I found it helpful to run a number of queries and then see when/why the search fails to identify the most relevant article. When wordnet is mentioned, two examples come in mind which are both currently unsolved. One is a query of type "mao last name" where an article "mao (surname)". If we are lucky, the article will have words "last name" somewhere in the article and the search won't totally fail, however, it would be nice if the algorithm knew that "last name" == "surname". Another is when the query is of type "population of africa" and the article "African population". That is, it would be helpful if the backend knew of language constructs like "x of y" == "x-an y". I wonder if Wordnet type of approach can find those cases as well.

Cheers, Robert

On 06/04/12 17:54, Oren Bochman wrote:

...

Hi Robert Stojnic and Gautham Shankar

I wanted to let Gautham that he has written a great proposal and thank you for the feedback as well.

I wanted to point out that in my point of view the main goal of this multilingual wordnet isn't queary expansion, but rather means for ever greater cross language capabilites in search and content analytics. A wordnet seme can be further disambiguated using a topic map algorithm run which would consider all the contexts like you suggest. But this is planned latter and so the wordnet would be a milestone. To further clarify Gautham's integration will place a XrossLanguage-seme Word Net tokens during indexing for words it recognises - allow the ranking algorithm to use knowldege drawn from all the wikipedia articles. (For example one part of the ranking would peek into featured article in German on "A" rank it>> then "B" featured in Hungarian and use them as oracles to rank A>> B>> ... in English where the picture might now be X

...
...
Y>> Z>> ... B>> A ...)

I mention in passing that I have began to develop dataset for use with open relavance to sytematicly review and evaluate dramatic changes to relevance due to changes in the search engine. I will post on this in due course as it matures - since I am working on a number of smaller projects i'd like to demo at WikiMania.)

On Fri, Apr 6, 2012 at 6:01 PM, Gautham Shankar< gautham.shankar@hiveusers.com> wrote:

...
Robert Stojnic<rainmansr<at> gmail.com> writes:

...
Hi Gautham,

I think mining wiktionary is an interesting project. However, about the more practical Lucene part: at some point I tried using wordnet to expand queries however I found that it introduces too many false positives. The most challenging part I think it *context-based* expansion. I.e. a simple synonym-based expansion is of no use because it introduces too many meanings that the user didn't quite have in mind. However, if we could somehow use the words in the query to find a meaning from a set of possible meanings that could be really helpful.

You can look into existing lucene-search source to see how I used wordnet. I think in the end I ended up using it only for very obvious stuff (e.g. 11 = eleven, UK = United Kingdom, etc..).

Cheers, r.

On 06/04/12 01:58, Gautham Shankar wrote:

...
Hello,

Based on the feedback i received i have updated my proposal page.

https://www.mediawiki.org/wiki/User:Gautham_shankar/Gsoc

There is about 20 Hrs for the deadline and any final feedback would be useful. I have also submitted the proposal at the GSOC page.

Regards, Gautham Shankar _______________________________________________ Wikitech-l mailing list Wikitech-l<at> lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Hi Robert,

Thank you for your feedback. Like you pointed out, query expansion using the wordnet data directly, reduces the quality of the search.

I found this research paper very interesting. www.sftw.umac.mo/~fstzgg/dexa2005.pdfhttp://www.sftw.umac.mo/%7Efstzgg/dexa2005.pdf They have built a TSN (Term Semantic Network) for the given query based on the usage of words in the documents. The expansion words obtained from the wordnet are then filtered out based on the TSN data.

I did not add this detail to my proposal since i thought it deals more with the creation of the wordnet. I would love to implement the TSN concept once the wordnet is complete.

Regards, Gautham Shankar

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Hi again

Gautham Shankar

12 Apr 12 Apr

5:28 a.m.

Robert Stojnic <rainmansr <at> gmail.com> writes:

...

Hello,

Yep, generating the wodnet itself is a challenging and interesting project. I was simply commenting on the Lucene part, i.e. on possible application.

Currently the lucene backend works by employing some very general rules (e.g. titles get highest score, then first sentence in articled, then first paragraph, then words occurring in clusters e.g. within ~20 words, etc..). However, in many cases they fail.

I found it helpful to run a number of queries and then see when/why the search fails to identify the most relevant article. When wordnet is mentioned, two examples come in mind which are both currently unsolved. One is a query of type "mao last name" where an article "mao (surname)". If we are lucky, the article will have words "last name" somewhere in the article and the search won't totally fail, however, it would be nice if the algorithm knew that "last name" == "surname". Another is when the query is of type "population of africa" and the article "African population". That is, it would be helpful if the backend knew of language constructs like "x of y" == "x-an y". I wonder if Wordnet type of approach can find those cases as well.

Cheers, Robert

On 06/04/12 17:54, Oren Bochman wrote:

...
Hi Robert Stojnic and Gautham Shankar

I wanted to let Gautham that he has written a great proposal and thank you for the feedback as well.

I wanted to point out that in my point of view the main goal of this multilingual wordnet isn't queary expansion, but rather means for ever greater cross language capabilites in search and content analytics. A wordnet seme can be further disambiguated using a topic map algorithm run which would consider all the contexts like you suggest. But this is planned latter and so the wordnet would be a milestone. To further clarify Gautham's integration will place a XrossLanguage-seme Word Net tokens during indexing for words it recognises - allow the ranking algorithm to use knowldege drawn from all the wikipedia articles. (For example one part of the ranking would peek into featured article in German on "A" rank it>> then "B" featured in Hungarian and use them as oracles to rank A>> B>> ... in English where the picture might now be X

...
...
Y>> Z>> ... B>> A ...)

I mention in passing that I have began to develop dataset for use with open relavance to sytematicly review and evaluate dramatic changes to relevance due to changes in the search engine. I will post on this in due course as it matures - since I am working on a number of smaller projects i'd like to demo at WikiMania.)

Hello,

Thank you Oren for your feedback , would love to work on the wordnet creation if given an opportunity.

And regarding Robert's mail, yes I believe that using a wordnet will be able to solve the problem in both the examples you pointed out.

In the first case during query expansion, the word "last name" would yield the synonyms of the word , one of them being "surname". Thus when the query is run there will be a hit for the article "mao (surname)".

In the second example, the word "Africa" will be drilled down to get derived words like "African" . Also the in other cases the root words will be found and searched for. In this case "Africa" is already a root word. So hopefully these expansions should solve the language construct problems.

Again the key is to filter out the noise that could come from adding unwanted expansion words. For this we will have to find the relevance of the expansion words with respect to the given search query and the existing documents. Maybe the TSN concept that i pointed out in the earlier mail would help in doing so.

Regards, Gautham Shankar

4651

Age (days ago)

4657

Last active (days ago)

wikitech-l@lists.wikimedia.org

6 comments

4 participants

tags (0)

participants (4)

Gautham Shankar
Gregory Varnum
Oren Bochman
Robert Stojnic