[Foundation-l] LA Times article / Advertising in Wikipedia

Wed Mar 12 18:43:21 UTC 2008

My collaborators and I have enhanced Wikipedia's search engine to use
pseudo-relevance feedback based on the concepts discovered in a link
analysis using Latent Dirichlet Allocation. The same engine does reranking
using the weights assigned to hundreds of features related to the quality of
an article, assigned by their ability to predict the quality ratings of the
Wikipedia Editorial Team using the Random Forests classifier. Other
researchers have used Trust and Reputation metrics to improve Wikipedia's
search engine. These methods produce significantly better results in terms
of precision, recall and user happiness, but also incur a significant
computational cost in terms of both indexing and query-time processing (with
pseudo-relevance feedback and clustering, each search is actually several
searches done in the background based on the query). And yet the Foundation
can't even afford to show snippets on search results, which is the number
one improvement in usability over a search engine that does not show them.
How can you say that Wikipedia's search engine is open to further
development when the ranking techniques used, which are straight out of
1995, would get you a C- in an information retrieval course, but already
there are better technologies available? The real limiting factor here is
that we cannot afford to conduct as many searches as we are while providing
an acceptable level of usability.

You mention that the search engine is useful for its ability to search in
all namespaces. What percentage of searches on Wikipedia's search engine use
any of the other namespaces (e.g., non-default ones)? Is this number less
than 1%? Likely. Is using Google's search engine for article namespace
searches in contradiction with using the Lucene-based search engine for
searches in other namespaces? Why can't they peacefully coexist?

I may be off base here, but I am under the impression that we don't
implement basic usability improvements in our search engine, such as showing
snippets and researched back link analysis (as simple as PageRank), because
we can't afford it.  If that's not the case, snippets are extremely simple
to enable in Lucene, and here's the SQL query you need to get the links out
of a MediaWiki database for PageRank. There are lots of Java implementations
that take a format similar to this. Being an expert in not only SQL but
MediaWiki's database schema, you can probably optimize this query, but it
worked for me:

SELECT pl_from, GROUP_CONCAT(DISTINCT page_id ORDER BY page_id) FROM
> pagelinks INNER JOIN page ON pagelinks.pl_title = page.page_title G
> ROUP BY(pl_from)
>

Having thought it out, I don't think it's true that offloading searches in
the main article namespace to a third party service provider is in
contradiction to developing an open tool that is imperfect and can be
improved to everyone's benefit. The people that use the search engine for
finding article content will benefit from Google's usability, and the expert
users that are interested in searching other namespaces will benefit from
the Lucene based open tool.

On Wed, Mar 12, 2008 at 11:38 AM, Brion Vibber <brion at wikimedia.org> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Brian wrote:
> | If we were to make a deal with Google to use their site search by
> default as
> | the search engine on Wikipedia, we would not only make boatloads of
> money,
> | but we would save money for not rendering that page billions of times a
> | year.
> |
> | Not to mention that our fairly default installation of Lucene is
> pretty much
> | awful. What exactly does "Relevance" mean? What about the article was
> | relevant? Presumably showing snippets is so computationally expensive
> when
> | done billions of times that we can't afford it. We've got a little bit
> of
> | link analysis in there, but Google's algorithms do a much better job, as
> | they know not only Wikipedia's internal structure, but how it fits in
> with
> | the rest of the web. Using Google's search engine instead of our
> stripped
> | down Lucene would be an improvement in usability, make us money and
> save us
> | money.
>
> Lucene is a low-level indexing library, not a search engine -- the
> search engine we've built around it is very much customized, and a lot
> of new development on it is still ongoing.
>
> Google's general purpose web search will likely never be able to include
> Wikipedia-specific features such as searching for template invocations
> and category intersections -- opportunities we have as long as we're in
> control.
>
> More generally, Wikimedia has a strong commitment to make all our tools
> available to the world for use, reuse, and further development,
> strengthening the public infrastructure with open source software. By
> handing off all responsibility for search to a highly secretive,
> proprietary company, we'd be abandoning that responsibility.
>
> Given the choice of an open tool which is imperfect, but can be improved
> to everyone's benefit, and a closed tool which is pretty good, but is
> kept under lock and key, our mission requires us to choose the open tool.
>
> - -- brion vibber (brion @ wikimedia.org)
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.8 (Darwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iEYEARECAAYFAkfYFQQACgkQwRnhpk1wk47rUwCfVaTTkjnKRKcIp25b1s98iBSL
> KkcAnRVMk3qK9DqXnyhm25GDfaYA88/p
> =C58g
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> foundation-l mailing list
> foundation-l at lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>