[Foundation-l] [ol-discuss] Open Library, Wikisource, and cleaning and translating OCR of Classics

Tue Aug 11 21:13:17 UTC 2009

Onion sourcing.  That would be a nice improvement on simple cite styles.

On Tue, Aug 11, 2009 at 12:10 PM, Gregory Crane<gregory.crane at tufts.edu> wrote:

> There are various layers to this onion. The key element is that books and
> pages are artifacts in many cases. What we really want are the logical
> structures that splatter across pages.

And across and around works...

> First, we have added a bunch of content -- esp. editions of Greek and Latin
> sources -- to the Internet Archive holdings and we are cataloguing editions
> that are the overall collection, regardless of who put them there. This goes
> well beyond the standard book catalogue records -- we are interested in the
> content not in books per se. Thus, we may add hundreds of records for a

Is there a way to deep link to a specific page-image from one of these
works without removing it from the Internet Archive?

> We would like to have useable etexts from all of these editions -- many of
> which are not yet in our collections. Many of these are in Greek and need a
> lot of work because the OCR is not very good.

So bad OCR for them exists, but no usable etexts?

> To use canonical texts, you need book/chapter/verse markup and you need
> FRBR-like citations ... deep annotations... syntactic analyses, word sense,
> co-reference...

These are nice features, but perhaps you can develop a clean etext
first, and overlay this metadata in parallel or later on.

> My question is what environments can support contributions at various
> levels. Clearly, proofreading OCR output is standard enough.
>
> If you want to get a sense of what operations need ultimately to be
> supported, you could skim
> http://digitalhumanities.org/dhq/vol/3/1/000035.html.

That's a good question.  What environments currently support OCR
proofreading and translation, and direct links to page-images of the
original source?  This is doable, with no special software or tools,
via wikisource (in multiple languages, with interlanguage links and
crude paragraph alignment) and commons (for page images).  The pages
could also be stored in other repositories such as the Archive, as
long as there is an easy way to link out to them or transclude
thumbnails.  [maybe an InstaCommons plugin for the Internet Archive?]

That's quite an interesting monograph you link to.  I see six main
sets of features/operations described there.  Each of them deserves a
mention in Wikimedia's strategic planning.  Aside from language
analysis, each has significant value for all of the Projects, not just
wikisource.

OCR tools
 *  OCR optimization: statistical data, page layout hints
 *  Capturing page layout logical structures

CROSS REFERENCING
 *  Quote, source, plagiarism idenfication.
 *  Named entity identification (automatic for some entities?  hints)
 *  Automatic linking (of urls, abbrv. citations, &c), markup projection

TEXT ALIGNMENT
 *  Canonical text services (chapter/verse equivalents)
 *  Version Analysis b/t versions.
 *  Translation alignment

TRANSLATION SUPPORT
 *  Automated translation (seed translations, hints for humans)
 *  Translation dictionaries (on mouseover?)

CROSS-LANGUAGE SEARCHING
 *  Cross-referencing across translations
 *  Quote identification across translations

LANGUAGE ANALYSIS
 *  Word analysis: word sense discovery, morphology.
 *  Sentence analysis: syntactic, metrical (poetry)

> Greg
>
> John Vandenberg wrote:
>>
>> On Tue, Aug 11, 2009 at 3:00 PM, Samuel Klein<meta.sj at gmail.com> wrote:
>>
>>>
>>> ...
>>> Let's take a practical example.  A classics professor I know (Greg
>>> Crane, copied here) has scans of primary source materials, some with
>>> approximate or hand-polished OCR, waiting to be uploaded and converted
>>> into a useful online resource for editors, translators, and
>>> classicists around the world.
>>>
>>> Where should he and his students post that material?
>>>
>>
>> I am a bit confused.  Are these texts currently hosted at the Perseus
>> Digital Library?
>>
>> If so, they are already a useful online resource. ;-)
>>
>> If they would like to see these primary sources pushed into the
>> Wikimedia community, they would need to upload the images (or DjVu)
>> onto Commons, and the text onto Wikisource where the distributed
>> proofreading software resides.
>>
>> We can work with them to import a few texts in order to demonstrate
>> our technology and preferred methods, and then they can decide whether
>> they are happy with this technology, the community, and the potential
>> for translations and commentary.
>>
>> I made a start on creating a Perseus-to-Wikisource importer about a year
>> ago...!
>>
>> Or they can upload the djvu to Internet Archive.. or a similar
>> depositories... and see where it goes from there.
>>
>>
>>>
>>> Wherever they end up, the primary article about each article would
>>> surely link out to the OL and WS pages for each work (where one
>>> exists).
>>>
>>
>> Wikisource has been adding OCLC numbers to pages, and adding links to
>> archive.org when the djvu files came from there (these links contain
>> an archive.org identifier).  There are also links to LibraryThing and
>> Open Library; we have very few rules ;-)
>>
>> --
>> John Vandenberg
>>
>
>