[WikiEN-l] WP and Deep Web, was Re: Age fabrication and original research

Thu Oct 8 18:40:10 UTC 2009

David Goodman wrote:
> Quite apart from  the incredible range available from a research
> library, the great majority of Wikipedians, even experienced ones, do
> not use even those sources which are made available free from local
> public libraries to residents. Many seem not to even think about using
> anything free on the internet except that reachable through the
> Googles.  if Google News reports a newspaper or magazine behind a pay
> wall, they do not even think of looking for it in other databases or
> web sites  that they may have available.  
David's issue here is something he describes as familiar generally to 
librarians. It does seem to me to be a hybrid of that one (leading the 
horse to the reference library water is not the same as having the horse 
drink), with another one. Tim Berners-Lee is apparently interested in 
the [[Deep Web]], which is to a first approximation what you can't 
Google for, but is out there. One clear cause is online databases, where 
if the webcrawler can't think up a good query, the potential web page 
answer won't get reported.

I was thinking about this more obliquely, because of my current 
interests: another couple of causes occur to me. There are texts online 
which are reference material, but need proof-reading (tell me about it) 
before the text is accurate enough for the search term to be there "in 
clear". And (as I found out just now) there are texts online that are 
downloads that are huge files. I've just looked at a PDF that is over 
500 Mb. Both these issues are obvious to me as user of archive.org. 
There is a route for information to migrate onto the Web as

book -> scan -> post to archive.org.

Which is fruitful and gets it "out there". It happens that for reference 
information our model is more useful by a factor of at least 1000 (you 
can check the figures for archive.org downloads).

So, the deeper Web needs "dredging" work before such things turn up on 
most people's first page of search engine hits. I'd quite agree with 
David that simply using the "shallow Web" and moving information from 
one part of it to another is not the only thing research for WP should 
be about. It seems to me that during Wikipedia's second decade we'll 
need to become more thoughtful about what is involved. (In Wikisource 
terms, for example, it would be great to see development of that project 
as the "reference Commons", matching the function the Commons serves for 
media files. But that's a potentially divisive idea, since it is already 
a "free library" with its own mission.)

Charles