[Wikipedia-l] Re: Robots and special pages
Daniel Mayer
maveric149 at yahoo.com
Sat May 18 23:32:52 UTC 2002
On Saturday 18 May 2002 12:01 pm, you wrote:
> Message: 6
> Date: Fri, 17 May 2002 15:37:47 -0700
> From: lcrocker at nupedia.com
> To: wikipedia-l at nupedia.com
> Subject: [Wikipedia-l] Robots and special pages
> Reply-To: wikipedia-l at nupedia.com
>
> This is a multi-part message in MIME format...
>
> ------------=_1021675067-23053-0
> Content-Type: text/plain
> Content-Disposition: inline
> Content-Transfer-Encoding: binary
>
> A discussion just came up on the tech list that deserves input from
> the list at large: how do we want to restrict access (if at all) to
> robots on wikipedia special pages and edit pages and such?
My two cents (well maybe a bit more),
On talk pages: OPEN to bots
Its A OK for bots to index talk pages -- these pages often have interesting
discussion that should be on search engines. Of course, if this becomes a
performance issue then we could prevent bots from indexing these.
On wikipedia pages: OPEN to bots
I STRONGLY feel that wikipedia pages should be open to bots -- remember we
are also trying to expand our community here and people do search for those
things on the net.
On user pages: OPEN to bots
I also don't see anything wrong with letting bots crawl all over user pages
-- I occasionally browse personal home pages of other people that have
similar interests to myself. This project isn't just about the articles it is
also about community building.
On log, history, print and special pages: CLOSED to bots (closed at least for
indexing -- not sure about allowing the 'follow links' function. Would
closing this allow bots to do their thing faster or slower? Is this at all
important for us to consider? If a bot can index our site fast, will it do it
more often?)
I think that the wikipedia pages are FAR better at naturally explaining what
the project is about than are the log, history and special pages are -- these
pages are far too technical and change too quickly to be useful for any
search performed on a search engine. There is also limited utility of having
direct links to the Printable version of articles -- these don't have any
active wiki links in them which obscures the fact that the page is from a
wiki.
Having history pages in the search results of external search engines is
potentially dangerous, since somebody could easily click into an older
version and save it -- thus reverting the article and unwittingly "earning"
the label of VANDAL (even if they did make a stab at improving the version
they read). Another reason to disallow bots access to history is because
there often is copyrighted material in the history of pages that has since
been removed from the current article version (it would be nice for an admin
to able to delete just an older version of an article BTW).
On Edit links: CLOSED to bots (for index and probably follow links)
The edit links REALLY should NOT be allowed to be indexed by any bot: When
somebody searches for something on a search engine, gets a link to our site,
and clicks on it; do we want them to be greeted with an edit window? They
want information -- not an edit window. No wonder we have so many pages that
only have "Describe the new page here" as their only content.
I've been tracking this for awhile and almost every one of these pages that
are created, are created by an IP that never returns to edit again. Many (if
not most) of these "mysteriously" created pages are probably from someone
clicking from a search engine, becoming puzzled by the edit window, and
hitting the save button in frustration. Heck, I think I may have created a
few of these in my pre-wiki days.
This has become a bit of a maintenance issue for the admins -- we can't
delete these pages fast enough, let alone create stubs for them. If left
unchecked, this could reduce the average quality of wikipedia articles and
give people doubt as to whether an "active" wiki link really has an article
(or even a stub) behind it.
There could, of course, be a purely technical fix for this by having the
software not recognize newly created blank or "Describe the new page here"
pages as being real pages (a Good Idea BTW). But then we still would have
frustrated people who were looking for actual info that in the future may
avoid clicking through to our site because of a previous "edit window
experience".
Conclusion:
We should try to put our best foot forward when allowing bots to index the
site and only allow indexing of pages that have information which is
potentially useful for the person searching.
Edit widows and outdated lists are NOT useful to somebody clicking through
for the first time (Recent Changes might be the only exception: Even though
any index of this will be outdated, it is centrally important to the project
and fairly self-explanatory). Links to older versions of articles and to
history pages also sets-up would-be contributors into becoming labeled as
"vandals" when trying to edit an older version -- thus turning them away
forever.
Let visitors explore a real article first and discover the difference between
an edit window an an actual article -- then they can decide about becoming a
contributor, visitor or even a developer for that matter.
maveric149
More information about the Wikipedia-l
mailing list