[Foundation-l] [ol-discuss] [Wikisource-l] Open Library, Wikisource, and cleaning and translating OCR of Classics

Wed Aug 12 09:54:43 UTC 2009

On Tue, Aug 11, 2009 at 9:16 PM, Lars Aronsson<lars at aronsson.se> wrote:

>> Let's take a practical example.  A classics professor I know
>> (Greg Crane, copied here) has scans of primary source materials,
>> some with approximate or hand-polished OCR, waiting to be
>> uploaded and converted into a useful online resource for
>> editors, translators, and classicists around the world.
>>
>> Where should he and his students post that material?
>
> On Wikisource.  What's stopping them?

Greg: does Wikisource seem like the right place to post and revise OCR
to you?  If not, where?  If so, what's stopping you?

> I'm not so sure we agree.  I think we're talking about two
> different things.
>
> This thread started out with a discussion of why it is so hard to
> start new projects within the Wikimedia Foundation.  My stance is
> that projects like OpenStreetMap.org and OpenLibrary.org are doing
> fine as they are, and there is no need to duplicate their effort
> within the WMF.  The example you gave was this:

I agree that there's no point in duplicating existing functionality.
The best solution is probably for OL to include this explicitly in
their scope and add the necessary functionality.   I suggested this on
the OL mailing list in March.
   http://mail.archive.org/pipermail/ol-discuss/2009-March/000391.html

>> >> >> *A wiki for book metadata, with an entry for every published
>> >> >> work, statistics about its use and siblings, and discussion
>> >> >> about its usefulness as a citation (a collaboration with
>> >> >> OpenLibrary, merging WikiCite ideas)
>
> To me, that sounds exactly as what OpenLibrary already does (or
> could be doing in the near time), so why even set up a new project
> that would collaborate with it?  Later you added:

However, this is not what OL or its wiki do now.  And OL is not run by
its community, the community helps support the work of a centrally
directed group.  So there is only so much I feel I can contribute to
the project by making suggestions.  The wiki built into the fiber of
OL is intentionally not used for general discussion.

> I was talking about the metadata for all books ever published,
> including the Swedish translations of Mark Twain's works, which
> are part of Mark Twain's bibliography, of the translator's
> bibliography, of American literature, and of Swedish language
> literature.  In OpenLibrary all of these are contained in one
> project.  In Wikisource, they are split in one section for English
> and another section for Swedish.  That division makes sense for
> the contents of the book, but not for the book metadata.

This is a problem that Wikisource needs to address, regardless of
where the OpenLibrary metadata goes.  It is similar to the Wiktionary
problem of wanting some content - the array of translations of a
single definition - to exist in one place and be transcluded in each
language.

> Now you write:
>
>> However, the project I have in mind for OCR cleaning and
>> translation needs to
>
> That is a change of subject. That sounds just like what Wikisource
> (or PGDP.net) is about.  OCR cleaning is one thing, but it is an
> entirely different thing to set up "a wiki for book metadata, with
> an entry for every published work".  So which of these two project
> ideas are we talking about?

They are closely related.

There needs to be a global authority file for works -- a [set of]
universal identifier[s] for a given work in order for wikisource (as
it currently stands) to link the German translation of the English
transcription of OCR of the 1998 photos of the 1572 Rotterdam Codex...
to its metadata entry [or entries].

I would prefer for this authority file to be wiki-like, as the
Wikipedia authority file is, so that it supports renames, merges, and
splits with version history and minimal overhead; hence I wish to see
a wiki for this sort of metadata.

Currently OL does not quite provide this authority file, but it could.
 I do not know how easily.

> Every book ever published means more than 10 million records.
> (It probably means more than 100 million records.) OCR cleaning
> attracts hundreds or a few thousand volunteers, which is
> sufficient to take on thousands of books, but not millions.

Focusing efforts on notable works with verifiable OCR, and using the
sorts of helper tools that Greg's paper describes, I do not doubt that
we could effectively clean and publish OCR for all primary sources
that are actively used and referenced in scholarship today (and more
besides).  Though 'we' here is the world - certainly more than a few
thousand volunteers have at least one book they would like to polish.
Most of them are not currently Wikimedia contributors, that much is
certain -- we don't provide any tools to make this work convenient or
rewarding.

> Google scanned millions of books already, but I haven't heard of
> any plans for cleaning all that OCR text.

Well, Google does not believe in distributed human effort.  (This came
up in a recent Knol thread as well.)  I'm not sure that is the best
comparison.

SJ