[Foundation-l] LA Times article / Advertising in Wikipedia

Wed Mar 12 19:17:58 UTC 2008

Brian wrote:

>My collaborators and I have enhanced Wikipedia's search engine to use
>pseudo-relevance feedback based on the concepts discovered in a link
>analysis using Latent Dirichlet Allocation. The same engine does reranking
>using the weights assigned to hundreds of features related to the quality of
>an article, assigned by their ability to predict the quality ratings of the
>Wikipedia Editorial Team using the Random Forests classifier. Other
>  
>

This seems interesting, is there a public test site or a paper?

>researchers have used Trust and Reputation metrics to improve Wikipedia's
>search engine. These methods produce significantly better results in terms
>of precision, recall and user happiness, but also incur a significant
>computational cost in terms of both indexing and query-time processing (with
>pseudo-relevance feedback and clustering, each search is actually several
>  
>

yes, i think no major search engine is using these advanced query-time 
processing thingies because they are just too expensive...

>searches done in the background based on the query). And yet the Foundation
>can't even afford to show snippets on search results, which is the number
>one improvement in usability over a search engine that does not show them.
>  
>

it is true that it's difficult to be efficient with snippets, and 
Lucene's implementation is not very good, it splits text into equal 
parts, which won't find a best snippet, cannot handle phrases, doesn't 
have an efficient storage mechanism, etc.. however, snippet can be 
implemented efficiently with some effort, but it takes time since I'm 
currently the only person working on our lucene wrapper... You can go to 
(slow and sometimes offline) test site http://ls2.wikimedia.org/  for 
sneak preview..

>I may be off base here, but I am under the impression that we don't
>implement basic usability improvements in our search engine, such as showing
>snippets and researched back link analysis (as simple as PageRank), because
>we can't afford it.  If that's not the case, snippets are extremely simple
>to enable in Lucene, and here's the SQL query you need to get the links out
>of a MediaWiki database for PageRank. There are lots of Java implementations
>that take a format similar to this. Being an expert in not only SQL but
>MediaWiki's database schema, you can probably optimize this query, but it
>worked for me:
>  
>

Actually, we use something similar to pagerank already (i.e. backlink 
count). However, you need to keep in mind that pagerank has been 
developed for web, it assumes that people can make links on their sites 
however they want to, so it is a form of popularity measurement....
On the other side, links in Wikipedia are dictated by how general a 
certain article is, e.g. most links go to years, countries, etc... So, 
just "turning on" pagerank will not improve results too much, it needs 
to be fit in with various other measures, like where in the document 
words appear (which we currently don't do - mainly becuase it's 
expensive, but will), some context, etc..

>Having thought it out, I don't think it's true that offloading searches in
>the main article namespace to a third party service provider is in
>contradiction to developing an open tool that is imperfect and can be
>improved to everyone's benefit. The people that use the search engine for
>finding article content will benefit from Google's usability, and the expert
>users that are interested in searching other namespaces will benefit from
>the Lucene based open tool.
>  
>
I think Brion already answered this one...

r.