[Foundation-l] [Wikisource-l] Open Library, Wikisource, and cleaning and translating OCR of Classics

Wed Aug 12 01:16:26 UTC 2009

Samuel Klein wrote:

> I think we agree on what needs to happen.  The only thing I am 
> not sure of is where you would like to see the work take place.

I'm not so sure we agree.  I think we're talking about two 
different things.

This thread started out with a discussion of why it is so hard to 
start new projects within the Wikimedia Foundation.  My stance is 
that projects like OpenStreetMap.org and OpenLibrary.org are doing 
fine as they are, and there is no need to duplicate their effort 
within the WMF.  The example you gave was this:

> >> >> *A wiki for book metadata, with an entry for every published
> >> >> work, statistics about its use and siblings, and discussion
> >> >> about its usefulness as a citation (a collaboration with
> >> >> OpenLibrary, merging WikiCite ideas)

To me, that sounds exactly as what OpenLibrary already does (or 
could be doing in the near time), so why even set up a new project 
that would collaborate with it?  Later you added:

> >> I could see this happening on Wikisource.

That's when I asked why this couldn't be done inside OpenLibrary.  

I added:

> > (Plus you would have to motivate why a copy of OpenLibrary should
> > go into the English Wikisource and not the German or French one.)

You replied:

> I don't understand what you mean -- English source materials and
> metadata go on en:ws, German on de:ws, &c.  How is this different from
> what happens today?

I was talking about the metadata for all books ever published, 
including the Swedish translations of Mark Twain's works, which 
are part of Mark Twain's bibliography, of the translator's 
bibliography, of American literature, and of Swedish language 
literature.  In OpenLibrary all of these are contained in one 
project.  In Wikisource, they are split in one section for English 
and another section for Swedish.  That division makes sense for 
the contents of the book, but not for the book metadata.

Now you write:

> However, the project I have in mind for OCR cleaning and 
> translation needs to

That is a change of subject. That sounds just like what Wikisource 
(or PGDP.net) is about.  OCR cleaning is one thing, but it is an 
entirely different thing to set up "a wiki for book metadata, with 
an entry for every published work".  So which of these two project 
ideas are we talking about?

Every book ever published means more than 10 million records.  
(It probably means more than 100 million records.) OCR cleaning 
attracts hundreds or a few thousand volunteers, which is 
sufficient to take on thousands of books, but not millions.

Google scanned millions of books already, but I haven't heard of 
any plans for cleaning all that OCR text.

> Let's take a practical example.  A classics professor I know 
> (Greg Crane, copied here) has scans of primary source materials, 
> some with approximate or hand-polished OCR, waiting to be 
> uploaded and converted into a useful online resource for 
> editors, translators, and classicists around the world.
> 
> Where should he and his students post that material?

On Wikisource.  What's stopping them?

-- 
  Lars Aronsson (lars at aronsson.se)
  Aronsson Datateknik - http://aronsson.se