Wenn alles so stimmt, wird in der Focus-Ausgabe von morgen ein etwas
längerer Artikel über Wikipedia stehen. Am Flughafen in Milano lag
eine ältere Ausgabe, wo dieser Artikel zu den drei Hauptthemen der
kommenden Woche zählt. Letzte Woche hat Focus von einzelnen Leuten
Interviews und Fotos gemacht.
In other news:
Eben gerade entdeckt. Orhan Pamuks Artikel in der türkischsprachigen
Wikipedia und die Entstehungsgeschichte dazu. Sieht so aus, als hätten
die Admins einiges an Arbeit....
Was mich echt ankotzt sind diese totalitaeren gesellen hier wie
schindler, die man sich auch gut im faschismus vorstellen koennte.
sobald auch nur die leiseste kritik an der obrigkeit geuebt wird oder
das reizwort demokratie faellt (sogar in der csu ein mehr oder minder
akzeptiertes prinzip) wird die grosse klatsche rausgeholt.
Quod omnes tangit, ab omnibus approbari debet.
diese weitsheit hat ihren ursprung nicht in einer demokratie.
Dec. 29, 2006 at 7:37pm Eastern
Q&A With Jimmy Wales On Search Wikia
News came out earlier this week that Wikipedia cofounder Jimmy Wales
had a new project in mind, to build a community-driven "Google-killer"
search engine. I've just finished talking with Jimmy about his plans.
Here's a rundown on his vision and what may come as his Search Wikia
project grows over the course of the next year or two.
Note that in the Q&A, I've had to recreate my questions as best I
remember asking them. I was focused more on getting down Jimmy's
Q. Since the news emerged, there's been some confusion about Amazon
and Wikipedia in relation to Search Wikia project. What's the
We recently completed a funding round with Amazon [for Wikia], but
other than that, they don't have anything to do with the search
project. [The project] is a Wikia project [the for-profit company that
Wales is chairman of], not a Wikipedia project [the separate
community-driven encyclopedia he co-founded].
Q. Was the search project formally announced, or did the Search Wikia
site come online as a result of The Times article discussing it.
It was a combination of them both. I've been working on this for a
long time. We didn't actually intend to announce per se just yet, but
me and my big mouth, the reporter asked me if I ever thought about
Q. It's been said the search engine would launch in the first quarter
of 2007. That's fast. Is that really just when you expect active
development work to begin?
During Q1, we're going to set up a project to get developers involved
with building the site, writing the code and getting the search engine
going. We're going to rely initially with Nutch and Lucene [related
open-source search software that's been developed over the past few
We'll start from scratch on how to apply the Wikipedia principles to
keep it as simple as possible and move forward.
It's just the development starting. We're not producing a Google
killing search engine in three months. I only wish I were that good of
We'll have some servers open, some development, maybe a pre-pre-alpha
demo site up. We'd really anticipate it would be a year or two until
we're able to launch a viable search engine.
Q. How do you see this improving on what's out there?
There are a lot of things that we've learned in the wiki world on how
to get communities involved and engaged to build trusted networks in
A lot of the people who have tried to do this in the past have
stumbled not on technical issues but on community issues ... dmoz [The
Open Directory] was too closed ... that was their response because of
the pressure of spammers ... others have thought in terms of ranking
algorithms. That's not the right approach. The right approach allows
for open dialog and debate and discussion.
Q. How do you envision the community participating? Will they be
selecting sites? Will this leverage material in Wikipedia? Will they
This will be completely independent of Wikipedia.
Exactly how people can be involved is not yet certain. If I had to
speculate about it, I would say it's several of those things, not just
community involved with rating URLs but also community rating for
whole web sites, what to include or not to include and also the whole
algorithm ... That's a human type process that we can empower people
to guide the spider
Q. Do you see humans reviewing the most popular queries, perhaps
picking the right answers to come up?
Part of it might be a human review of queries. For the narrow subset
of the really popular queries, I think it's important to apply humans
.... if someone types Ford Motor Company, there is a correct answer
for that. There's no reason to beat our brains out to train our
algorithm to do that.
Q. Search engines have actually gotten much better over time with
these type of navigational requests. You don't need humans so much to
make sure the right answer shows up.
Those kinds are not too difficult. The harder one if you type ford,
did you mean President Ford or do you mean the Ford Motor Company?
That's the type of thing where human disambiguation pages like we have
at Wikipedia are helpful.
Q. Search engines already do a lot of this type of stuff. Ask has its
Zoom suggestions, others have clusterings or related searches. Do you
imagine people being forced to make a query refinement choice before
they actually get search results?
If you type ford, you should get some disambiguation terms that humans
have collected, then some search results....this is one of the places
where I think human intelligence is most important
[NOTE: For more on query refinement, see some of my past posts such as
Robert Scoble Wants What We Had -- Better Query Refinement. So Do I!,
Hello Natural Language Search, My Old Over-Hyped Search Friend and Why
Search Sucks & You Won't Fix It The Way You Think. The first link in
particular discusses how Microsoft used to have disambiguation created
by editors very similar to what Wales hopes to recreate. Sadly, it was
killed in the quest to chase Google on the algorithmic front.]
Q. Are you planning to crawl the entire web, billions and billions of
pages? Or will you go after a subset of important ones?
The number of pages is yet to be determined. Obviously we won't be
doing that initially [gathering everything], but we'll invest in the
hardware. Not to belittle the investment required to do a full crawl
of the web on a regular basis, but I think it's a fairly commoditized.
Q. Crawling is one thing. Serving up millions of queries per day is an
entire other issue. Wikipedia handles a lot of traffic, but not at a
Google scale. How's it going with that?
The traffic's not too bad. Servers are getting more and more powerful.
Bandwidth is getting cheaper. It's all pretty much off the shelf. It's
Q. Will you be selling ads, and if so, how will that work?
There are no immediate plan to sell ads, so for now we're not too
focused on that. If we don't build something useful, selling ads on it
is sort of a moot point.
Q. Why do this at all? What do you see wrong with search?
For certain types of searches, search engines are very good. But I
still see major failures, where they aren't delivering useful results.
I think at a deeper almost political level, I think it's important
that we as a global society have some transparency in search. What are
the algorithms involved? What are the reasons why one site comes up
over another one. [Wales also raised the issue of how ads might
influence regular listings, perhaps search engines trying to keep
commercial sites out of the free listings to make money. From there,
he went on....] Those types of incentives are problematic in search.
The only solution I know to that is to be transparent
Q. How are you going to keep the community from being gamed. Wikipedia
is very good at keeping out spam, but it's not perfect. And despite
its size, it's dealing with far fewer topics than unique searches that
will happen on any particular day. How do you police all those
You have to recognize the difference between the way community is
often used on the internet, which is short hand for millions of people
clicking on some stuff as compared to community in the wiki world,
which is people who actually know each other.
It's one thing to say if you have millions of spammers out there
trying to game and trick an algorithm .... but it's not the number of
queries. it's the web sites themselves. A lot of numbers are thrown
about for sites on the web, but the number of legitimate pages that
are not coming from affiliate sites and spammers is a much more finite
number. It's much easier for a community to ban the bad stuff.
Q. But what if someone gets into a "good" domain. We've had cases
where bad content gets shoved into "trusted" sites or even places like
university sites. Do you ban those entire domains? How do they get
At Wikipedia, we'd have a big discussion. [Wales then explained that
people might realize a domain had done something accidentally wrong or
without thinking about spam issues and so might be allowed back in.]
Q. You probably already search a lot, probably mostly with Google. Is
it not finding what you want already most of the time, without a flood
of spam or crud in your way?
Usually I'm looking for pages on Wikipedia, so they do a good job with
that. It depends on the types of searches you are doing. If you're
doing a factual search, then Wikipedia [in the results] would be good.
In other areas, I think there's a strong commercial incentive. Why is
it bad if I search for tampa hotels?
[NOTE: I then did this search on Google, which we discussed. I noted I
saw plenty of good hotels listed, and that if I clicked through to the
local search results, I got an even better experience of hotels
Wales replied that he's often after reviews of hotels, not the hotels
themselves. That took me back to the original results, where I pointed
out the top listing was from TripAdvisor, exactly the type of review
site he mentioned liking -- and that I often found them listed on
these types of queries.
I also noted that Google even offers refinement categories at the top
of the page similar to the disambiguation he wanted, with lodging
guides as one of the categories. Unfortunately for Google, I didn't
find that the results from that refinement did a good job bringing
back trusted hotel guides]
Q. Back to transparency. People keep saying they want more of this.
But can you name some exact examples of what you want to see? Do you
want Google to say that using a term in bold text adds X percent of a
score to the ranking criteria? And if you do that, don't you think
spammers will just abuse the recipe that's been published?
If your search relies on some secret factors that you hope people
won't discover, you haven't really come up with a good solution the
Q. Microsoft has spent millions of dollars and years now of effort to
try and be a Google killer and haven't made it. You're coming into
this fresh with fewer resources and no real prior experience. Can you
really do it?
I have no idea. I only do whatever sounds like it is fun.
Q. What type of funding do you have behind this?
Wikia's initial round was 4 million from a variety of angels, then
there was second round from Amazon, but the amount wasn't announced.
When I first heard of the plans, I was pretty dubious the project
would have much success. For one thing, the idea of the "open source"
search engine to take on the world and provide more transparency is
old news. Consider this from back when Nutch first came out, out of
New Scientist in 2003:
The project "is about providing free technology that should not be
controlled by private, commercial, secretive organisations," says Doug
Cuttings, veteran web search engineer, and a Nutch founder.
Three years on, nothing really changed despite the reasoning behind
such a project being the same. And this was despite Nutch having some
big names behind it.
In 2004, Nutch got another round of attention in an ACM article
looking at how it works. My comment at that time was:
Interesting read especially for the efforts that are involved to
defeat spam. The argument is that though Nutch is open, revealing
secrets won't hurt because spammers will batter down any defenses, no
matter how tightly protected. OK, so what will stop spam? Nutch hopes
that an open, public discussion may reveal new methods. Perhaps. But
the real test will only come if Nutch is deployed by a major,
highly-trafficked site. Spammers aren't going to bother trying the
defenses of other places. It's not worth the time. That's also a
positive for those considering Nutch. If you operate a small, vertical
site or just want Nutch to be used on your own content, then spam
concerns are much less an issue.
The spam test simply hasn't happened with Nutch. And every new search
engine project I've looked at coming in over the years completely
underestimates the spam problem they face. When I looked at the Search
Wikia site, comments like this almost seemed laughable:
search active for spammer sites
* trying to simulate user-typos (ie. "yaoho.com" rather than
"yahoo.com"); see also: Microsoft's URL Tracer
* blacklist domains, where spammails are linking to; create
actively honeypods to get spam; use a pattern like
<domain-where-we-have-registered>@myhoneypod.com to identify the spam
networks; shell the common user get the possibility to register such a
Seek out the spam sites? Hey, don't worry -- if you're popular,
they'll find you fast enough. And as you blacklist one, two more
throwaway domains will show up in their place.
I also tend to think Wales is completely underestimating how crawling
a big chunk of the web, keeping those pages fresh, ranking them
quickly to provide answers and doing so for millions each day isn't an
Still, I find myself oddly hopeful. I don't think a Google killer will
emerge, but perhaps some new ways of a community to be involved with
search will come out of it. I wouldn't have thought Wikipedia would
work. Certainly it's flawed, but it's also an incredible resource.
Maybe something useful will come from the Search Wikia project.
At the very least, I've long wanted humans to be back in the role of
reviewing queries and actually looking to see if they make sense,
rather than so much reliance on algorithms. Maybe the mere concept of
the Search Wikia project will encourage the major search engines to do
more in this area.
In den letzten paar Stunden hat der Spendencounter der Wikimedia
Foundation die "750.000" Dollar übersprungen. Das ist insofern der
konservative Stand, als dass noch nicht die Gelder verrechnet wurden,
die beispielsweise derzeit noch in der Post liegen oder das Matching
der Spenden von Virgin Unite von gestern (AFAIK).
Der Spendenbalken hat damit auch die 50% übersprungen, was eine Form
von Halbzeit suggeriert. "Offiziell" gibt es kein Spendenziel von 1,5
Millionen mehr, auch wenn es mal eines gegeben haben sollte.
Danke an alle Spender.
Im aktuellen Focus (Nr. 52) auf den Seiten 114 bis 116 findet sich nun
der Artikel über Wikipedia. Auf der Titelseite ist er mit Logo und
"Wikipedia.... Manipulationen kratzen am Ruf" ausgewisen. Der Artikel
selbst läuft unter der Rubrik Medien, Abteilung Internet, Titel: "Mehr
Gemessen an den guten Vorgaben des Magazins aus Hamburg ist der Anfang
eher langweilig, sachlich angreifbar. Es gäbe keine Redakteure und
keine Qualitätskontrolle. Wenn da noch ein "bezahlt" drin stünde,
könnte man schon eher zustimmen. Die directmedia-DVD wird erwähnt (wie
kommen die darauf, daß es die zweite Auflage ist?) Die Geschäftstelle
des deutschen Vereins wird erwähnt und Jaron Lanier bekommt sein
Upgrade zum "Computerwissenschaftler". Arne darf zwischendurch
erklären, was das Konzept von "stabilen Versionen" ist. Ein Prof
Andreas Dengel wird als voll des Lobes zitiert, *trotz* unserer
Weigerung, daß wir keine "Qualitätsgarantie" abgeben wollen.
Meinen Tag machte der Satz (...wird man in wenigen Jahren vernünftig
mit dem Werk arbieten können (Dengel)): "Dann müssten sich die
klassischen Lexikonverlage ernsthafte Gedanken über künftige
Geschäftsmodelle machen." Ah, dann also erst. Ich kann ja nicht für
Brockhaus sprechen, aber die Unterstellung, daß die Verlage bis dahin
nichts machen, sollte man sich als Lexikonverlag nicht bieten lassen,
Florian Langenscheidt hingegen wird mit dem Versprechen auf "Wissen
mit Qualitätssiegel" zitiert. Leider ohne "Für jeden gefundenen Fehler
gibt es 5 Euro zurück"-Garantie.
Außerdem wird auf Larry's Citizendium verwiesen.
Focus-like gibt es auch wieder bunte Graphiken:
1. "Das Welt-Wisssen der Massen": Balkengraphik der Artikel nach
Sprache, deutsch ist rot, der rest blau.
2. Nielsen-Netratings-Zahlen: Unique user per month aus Deutschland:
5,4 Millionen im Oktober 200, 11,6 Millionen im November 2006.
3. Kasten: "Löcher im System": 1. Kleinfeld 2. DTH 3. Süddeutsche 4.
"Wikipedia übernimmt Falschmeldung des Tagesspiegel, korrigiert Fehler
nach wenigen Minuten" 5. Borat
4. Kasten: "Tipps: Die freie Enzyklopädie richtig nutzen"
1. Stets kritisch lesen
2. Viel Raum für Nebensächliches (das ist kein Tipp, sondern ein
3. Meinungen der Autoren prüfen
4. Heisse Themen doppelt prüfen
5. Solides aus der Wissenschaft (will heissen: Technik-Artikel sind
toll, Geschichte auch)
6. Guter Einstieg in die Recherche
Grobe Fehler konnte ich nicht finden, vielleicht bin ich nach dem
"Manfred Schindler" der Deutschen Welle aber einfach nur abgehärtet...
seit ein paar tagen werden bei mir nicht alle thumbs dargestellt, wobei
die Betonung auf "alle" liegt
statt dessen wird ein [AD] als Platzhalter angezeigt.
in IE 7, Firefox und Opera.
Die Firewall kanns nicht sein (bin auf eine andere umgestiegen, hatte
also ein paar Stunden gar keine. (mache bringe ja einen Werbeblocker mit))
Grüße aus der Eifel
Unter http://www.heise.de/tp/r4/artikel/24/24221/1.html findet sich der
Dritte Teil von Stefan Weber's Auszügen aus seinem Buch "Das
Google-Copy-Paste-Syndrom. Wie Netzplagiate Ausbildung und Wissen
gefährden" - in dem mehrfach auf Wikipedia und die Unsitte des wenn
schon nicht urheberrechtlich doch wissenschaftlich und moralisch
falschen Kopierens hingewiesen wird. Die Schlußfolgerungen des Autors
teile ich nicht ganz (es kommt nämlich nicht darauf an, wo eine Quelle
herkommt, sondern wie sie verwendet wird) aber zur nächsten
Veranstaltung zu Qualität und Wikipedia sollten wir Stefan Weber mal
als Referent einladen.