[Foundation-l] Open Library, Wikisource, and cleaning and translating OCR of Classics

Mon Aug 17 22:33:34 UTC 2009

David Goodman wrote:
> The problem is extraordinarily   complex. A database of all "books"
> (and other media) ever published is beyond the joint  capabilities of
> everyone interested. There are intermediate entities between "books"
> and "works", and important subordinate entities, such as "article" ,
> "chapter" , and those like "poem" which could be at any of several
> levels.  

I've already been in raging arguments at Wikisource about the meaning of 
"work".  The general tendency there has been to treat "work" as 
equivalent to a book or set of related books.  This is highly 
problematical for periodicals, encyclopedias and dictionaries.

I do agree that the problem is so complex, but there is a resistance on 
the part of many to accept standards that have been developed over a 
long period of time. Before the Category: namespace was made a part of 
Wikipedia there was considerable antipathy to adopting any kind of 
established category system.  Muddling through from square one was the 
preferred option.

> This is not a job for amateurs, unless they are prepared to
> first learn the actual standards of bibliographic description for
> different types of material, and to at least recognize the
> inter-relationships, and the many undefined areas. At research
> libraries, one allows a few years of training for a newcomer with just
> a MLS degree to work with a small subset of this. I have thirty years
> of experience in related areas of librarianship, and I know only
> enough to be aware of the problems.
>   

This does not bode well!  The big factor in Wiki participation and 
success is amateur involvement and crowd sourcing.  What are the PhDs 
doing to bridge the gap?  What efforts are being made to at least bring 
the most significant points to the level of the general contributor?  
Saying that it takes several years to bring an MLS up to speed is not 
good enough.  Knowledge needs to be brought to the level where it was 
most useful.  When I went to school typing was not introduced as a 
subject until the 10th grade; my son learned keyboarding in the first grade.

Our wiki projects also have a superfluity of people with an IT 
background who also do not do a very good job bringing information to 
where it belongs, and end up creating a mind-boggling assortment of 
templates of questionable value.  In theory they are trying to bring 
standardization and simplicity to the projects, but just as often 
produce a simplistic and premature narrowing of the way knowledge is 
organized.

> The difficulty of merging the many thousands of partial correct and
> incorrect sources of available data typically requires the manual
> resolution of each of the tens of millions of instances.
>   

Yes, of course.  There is no magic software that will do it all.  Humans 
need to retain the right to decide the limits of technology.

> OL rather than Wikimedia has the advantage that more of the people
> there understand the problems.

The librarians have their work cut out for them.  They can help to build 
a system for the future, or they can let everyone muddle their way into 
a fuck-up.

Ec