[Wikide-l] [PRESS] Q&A With Jimmy Wales On Search Wikia

Mathias Schindler mathias.schindler at gmail.com
Sa Dez 30 00:50:03 UTC 2006


http://searchengineland.com/061229-193718.php
Dec. 29, 2006 at 7:37pm Eastern
Q&A With Jimmy Wales On Search Wikia

News came out earlier this week that Wikipedia cofounder Jimmy Wales
had a new project in mind, to build a community-driven "Google-killer"
search engine. I've just finished talking with Jimmy about his plans.
Here's a rundown on his vision and what may come as his Search Wikia
project grows over the course of the next year or two.

Note that in the Q&A, I've had to recreate my questions as best I
remember asking them. I was focused more on getting down Jimmy's
responses.

Q. Since the news emerged, there's been some confusion about Amazon
and Wikipedia in relation to Search Wikia project. What's the
situation?

We recently completed a funding round with Amazon [for Wikia], but
other than that, they don't have anything to do with the search
project. [The project] is a Wikia project [the for-profit company that
Wales is chairman of], not a Wikipedia project [the separate
community-driven encyclopedia he co-founded].

Q. Was the search project formally announced, or did the Search Wikia
site come online as a result of The Times article discussing it.

It was a combination of them both. I've been working on this for a
long time. We didn't actually intend to announce per se just yet, but
me and my big mouth, the reporter asked me if I ever thought about
search.

Q. It's been said the search engine would launch in the first quarter
of 2007. That's fast. Is that really just when you expect active
development work to begin?

During Q1, we're going to set up a project to get developers involved
with building the site, writing the code and getting the search engine
going. We're going to rely initially with Nutch and Lucene [related
open-source search software that's been developed over the past few
years].

We'll start from scratch on how to apply the Wikipedia principles to
keep it as simple as possible and move forward.

It's just the development starting. We're not producing a Google
killing search engine in three months. I only wish I were that good of
a programmer.

We'll have some servers open, some development, maybe a pre-pre-alpha
demo site up. We'd really anticipate it would be a year or two until
we're able to launch a viable search engine.

Q. How do you see this improving on what's out there?

There are a lot of things that we've learned in the wiki world on how
to get communities involved and engaged to build trusted networks in
communities.

A lot of the people who have tried to do this in the past have
stumbled not on technical issues but on community issues ... dmoz [The
Open Directory] was too closed ... that was their response because of
the pressure of spammers ... others have thought in terms of ranking
algorithms. That's not the right approach. The right approach allows
for open dialog and debate and discussion.

Q. How do you envision the community participating? Will they be
selecting sites? Will this leverage material in Wikipedia? Will they
rate sites?

This will be completely independent of Wikipedia.

Exactly how people can be involved is not yet certain. If I had to
speculate about it, I would say it's several of those things, not just
community involved with rating URLs but also community rating for
whole web sites, what to include or not to include and also the whole
algorithm ... That's a human type process that we can empower people
to guide the spider

Q. Do you see humans reviewing the most popular queries, perhaps
picking the right answers to come up?

Part of it might be a human review of queries. For the narrow subset
of the really popular queries, I think it's important to apply humans
.... if someone types Ford Motor Company, there is a correct answer
for that. There's no reason to beat our brains out to train our
algorithm to do that.

Q. Search engines have actually gotten much better over time with
these type of navigational requests. You don't need humans so much to
make sure the right answer shows up.

Those kinds are not too difficult. The harder one if you type ford,
did you mean President Ford or do you mean the Ford Motor Company?
That's the type of thing where human disambiguation pages like we have
at Wikipedia are helpful.

Q. Search engines already do a lot of this type of stuff. Ask has its
Zoom suggestions, others have clusterings or related searches. Do you
imagine people being forced to make a query refinement choice before
they actually get search results?

If you type ford, you should get some disambiguation terms that humans
have collected, then some search results....this is one of the places
where I think human intelligence is most important

[NOTE: For more on query refinement, see some of my past posts such as
Robert Scoble Wants What We Had -- Better Query Refinement. So Do I!,
Hello Natural Language Search, My Old Over-Hyped Search Friend and Why
Search Sucks & You Won't Fix It The Way You Think. The first link in
particular discusses how Microsoft used to have disambiguation created
by editors very similar to what Wales hopes to recreate. Sadly, it was
killed in the quest to chase Google on the algorithmic front.]

Q. Are you planning to crawl the entire web, billions and billions of
pages? Or will you go after a subset of important ones?

The number of pages is yet to be determined. Obviously we won't be
doing that initially [gathering everything], but we'll invest in the
hardware. Not to belittle the investment required to do a full crawl
of the web on a regular basis, but I think it's a fairly commoditized.

Q. Crawling is one thing. Serving up millions of queries per day is an
entire other issue. Wikipedia handles a lot of traffic, but not at a
Google scale. How's it going with that?

The traffic's not too bad. Servers are getting more and more powerful.
Bandwidth is getting cheaper. It's all pretty much off the shelf. It's
pretty efficient.

Q. Will you be selling ads, and if so, how will that work?

There are no immediate plan to sell ads, so for now we're not too
focused on that. If we don't build something useful, selling ads on it
is sort of a moot point.

Q. Why do this at all? What do you see wrong with search?

For certain types of searches, search engines are very good. But I
still see major failures, where they aren't delivering useful results.
I think at a deeper almost political level, I think it's important
that we as a global society have some transparency in search. What are
the algorithms involved? What are the reasons why one site comes up
over another one. [Wales also raised the issue of how ads might
influence regular listings, perhaps search engines trying to keep
commercial sites out of the free listings to make money. From there,
he went on....] Those types of incentives are problematic in search.
The only solution I know to that is to be transparent

Q. How are you going to keep the community from being gamed. Wikipedia
is very good at keeping out spam, but it's not perfect. And despite
its size, it's dealing with far fewer topics than unique searches that
will happen on any particular day. How do you police all those
searches?

You have to recognize the difference between the way community is
often used on the internet, which is short hand for millions of people
clicking on some stuff as compared to community in the wiki world,
which is people who actually know each other.

It's one thing to say if you have millions of spammers out there
trying to game and trick an algorithm .... but it's not the number of
queries. it's the web sites themselves. A lot of numbers are thrown
about for sites on the web, but the number of legitimate pages that
are not coming from affiliate sites and spammers is a much more finite
number. It's much easier for a community to ban the bad stuff.

Q. But what if someone gets into a "good" domain. We've had cases
where bad content gets shoved into "trusted" sites or even places like
university sites. Do you ban those entire domains? How do they get
back in?

At Wikipedia, we'd have a big discussion. [Wales then explained that
people might realize a domain had done something accidentally wrong or
without thinking about spam issues and so might be allowed back in.]

Q. You probably already search a lot, probably mostly with Google. Is
it not finding what you want already most of the time, without a flood
of spam or crud in your way?

Usually I'm looking for pages on Wikipedia, so they do a good job with
that. It depends on the types of searches you are doing. If you're
doing a factual search, then Wikipedia [in the results] would be good.
In other areas, I think there's a strong commercial incentive. Why is
it bad if I search for tampa hotels?

[NOTE: I then did this search on Google, which we discussed. I noted I
saw plenty of good hotels listed, and that if I clicked through to the
local search results, I got an even better experience of hotels
listed.

Wales replied that he's often after reviews of hotels, not the hotels
themselves. That took me back to the original results, where I pointed
out the top listing was from TripAdvisor, exactly the type of review
site he mentioned liking -- and that I often found them listed on
these types of queries.

I also noted that Google even offers refinement categories at the top
of the page similar to the disambiguation he wanted, with lodging
guides as one of the categories. Unfortunately for Google, I didn't
find that the results from that refinement did a good job bringing
back trusted hotel guides]

Q. Back to transparency. People keep saying they want more of this.
But can you name some exact examples of what you want to see? Do you
want Google to say that using a term in bold text adds X percent of a
score to the ranking criteria? And if you do that, don't you think
spammers will just abuse the recipe that's been published?

If your search relies on some secret factors that you hope people
won't discover, you haven't really come up with a good solution the
problem.

Q. Microsoft has spent millions of dollars and years now of effort to
try and be a Google killer and haven't made it. You're coming into
this fresh with fewer resources and no real prior experience. Can you
really do it?

I have no idea. I only do whatever sounds like it is fun.

Q. What type of funding do you have behind this?

Wikia's initial round was 4 million from a variety of angels, then
there was second round from Amazon, but the amount wasn't announced.

Closing Comments

When I first heard of the plans, I was pretty dubious the project
would have much success. For one thing, the idea of the "open source"
search engine to take on the world and provide more transparency is
old news. Consider this from back when Nutch first came out, out of
New Scientist in 2003:

    The project "is about providing free technology that should not be
controlled by private, commercial, secretive organisations," says Doug
Cuttings, veteran web search engineer, and a Nutch founder.

Three years on, nothing really changed despite the reasoning behind
such a project being the same. And this was despite Nutch having some
big names behind it.

In 2004, Nutch got another round of attention in an ACM article
looking at how it works. My comment at that time was:

    Interesting read especially for the efforts that are involved to
defeat spam. The argument is that though Nutch is open, revealing
secrets won't hurt because spammers will batter down any defenses, no
matter how tightly protected. OK, so what will stop spam? Nutch hopes
that an open, public discussion may reveal new methods. Perhaps. But
the real test will only come if Nutch is deployed by a major,
highly-trafficked site. Spammers aren't going to bother trying the
defenses of other places. It's not worth the time. That's also a
positive for those considering Nutch. If you operate a small, vertical
site or just want Nutch to be used on your own content, then spam
concerns are much less an issue.

The spam test simply hasn't happened with Nutch. And every new search
engine project I've looked at coming in over the years completely
underestimates the spam problem they face. When I looked at the Search
Wikia site, comments like this almost seemed laughable:

    search active for spammer sites

    * trying to simulate user-typos (ie. "yaoho.com" rather than
"yahoo.com"); see also: Microsoft's URL Tracer

    * blacklist domains, where spammails are linking to; create
actively honeypods to get spam; use a pattern like
<domain-where-we-have-registered>@myhoneypod.com to identify the spam
networks; shell the common user get the possibility to register such a
mail-adress?

Seek out the spam sites? Hey, don't worry -- if you're popular,
they'll find you fast enough. And as you blacklist one, two more
throwaway domains will show up in their place.

I also tend to think Wales is completely underestimating how crawling
a big chunk of the web, keeping those pages fresh, ranking them
quickly to provide answers and doing so for millions each day isn't an
off-the-shelf commodity.

Still, I find myself oddly hopeful. I don't think a Google killer will
emerge, but perhaps some new ways of a community to be involved with
search will come out of it. I wouldn't have thought Wikipedia would
work. Certainly it's flawed, but it's also an incredible resource.
Maybe something useful will come from the Search Wikia project.

At the very least, I've long wanted humans to be back in the role of
reviewing queries and actually looking to see if they make sense,
rather than so much reliance on algorithms. Maybe the mere concept of
the Search Wikia project will encourage the major search engines to do
more in this area.