[WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia

Brian Brian.Mingus at colorado.edu
Sat Aug 22 18:25:08 UTC 2009


On Sat, Aug 22, 2009 at 12:24 PM, Brian <Brian.Mingus at colorado.edu> wrote:

> On Sat, Aug 22, 2009 at 12:05 PM, Gwern Branwen <gwern0 at gmail.com> wrote:
>
>>
>> I tried this out the other day; it's a very cool idea, but by and
>> large, it seems that this hacker doesn't have enough CPU power to
>> extract the really good wikilinks, the ones that aren't already linked
>> inside the article. (eg. if I try it on [[Encyclopedia of the Brethren
>> of Purity]], I have to go all the way down to find a suggestion which
>> isn't already linked by the article.)
>>
>> Perhaps in a decade we'll have enough computing power on the servers
>> that this could be a plugin - we'd then have auto-generated See Alsos,
>> which would be really cool.
>>
>> --
>> gwern
>>
>
> A fancy technique called Latent Dirichlet Allocation can be used to find
> links that aren't already linked in the document themselves. I did this for
> a class project. Here is an expert from the paper which also shows you the
> latent connections it found for the Simple article on hippies.
>
> http://upload.wikimedia.org/wikipedia/meta/2/25/LDA-Wiki-Search.png
>
> I note that Google has released parallel lda so its not feasible to run it
> on all of wikipedia using an ordinary Beowulf cluster.
> http://code.google.com/p/plda/
>

* now feasible


More information about the WikiEN-l mailing list