Re: [Wikisource-l] [ol-discuss] Open Library, Wikisource, and cleaning and translating OCR of Classics - Wikisource-l

11 Aug 2009

Onion sourcing.  That would be a nice improvement on simple cite styles.

On Tue, Aug 11, 2009 at 12:10 PM, Gregory Crane&lt;gregory.crane(a)tufts.edu&gt; wrote:

...
  There are various layers to this onion. The key
element is that books and
 pages are artifacts in many cases. What we really want are the logical
 structures that splatter across pages. 
And across and around works...

...
  First, we have added a bunch of content -- esp.
editions of Greek and Latin
 sources -- to the Internet Archive holdings and we are cataloguing editions
 that are the overall collection, regardless of who put them there. This goes
 well beyond the standard book catalogue records -- we are interested in the
 content not in books per se. Thus, we may add hundreds of records for a 
Is there a way to deep link to a specific page-image from one of these
works without removing it from the Internet Archive?

...
  We would like to have useable etexts from all of these
editions -- many of
 which are not yet in our collections. Many of these are in Greek and need a
 lot of work because the OCR is not very good. 
So bad OCR for them exists, but no usable etexts?

...
  To use canonical texts, you need book/chapter/verse
markup and you need
 FRBR-like citations ... deep annotations... syntactic analyses, word sense,
 co-reference... 
These are nice features, but perhaps you can develop a clean etext
first, and overlay this metadata in parallel or later on.

...
  My question is what environments can support
contributions at various
 levels. Clearly, proofreading OCR output is standard enough.

 If you want to get a sense of what operations need ultimately to be
 supported, you could skim
 http://digitalhumanities.org/dhq/vol/3/1/000035.html. 
That's a good question.  What environments currently support OCR
proofreading and translation, and direct links to page-images of the
original source?  This is doable, with no special software or tools,
via wikisource (in multiple languages, with interlanguage links and
crude paragraph alignment) and commons (for page images).  The pages
could also be stored in other repositories such as the Archive, as
long as there is an easy way to link out to them or transclude
thumbnails.  [maybe an InstaCommons plugin for the Internet Archive?]

That's quite an interesting monograph you link to.  I see six main
sets of features/operations described there.  Each of them deserves a
mention in Wikimedia's strategic planning.  Aside from language
analysis, each has significant value for all of the Projects, not just
wikisource.

OCR tools
 *  OCR optimization: statistical data, page layout hints
 *  Capturing page layout logical structures

CROSS REFERENCING
 *  Quote, source, plagiarism idenfication.
 *  Named entity identification (automatic for some entities?  hints)
 *  Automatic linking (of urls, abbrv. citations, &c), markup projection

TEXT ALIGNMENT
 *  Canonical text services (chapter/verse equivalents)
 *  Version Analysis b/t versions.
 *  Translation alignment

TRANSLATION SUPPORT
 *  Automated translation (seed translations, hints for humans)
 *  Translation dictionaries (on mouseover?)

CROSS-LANGUAGE SEARCHING
 *  Cross-referencing across translations
 *  Quote identification across translations

LANGUAGE ANALYSIS
 *  Word analysis: word sense discovery, morphology.
 *  Sentence analysis: syntactic, metrical (poetry)

...
  Greg

 John Vandenberg wrote:

 On Tue, Aug 11, 2009 at 3:00 PM, Samuel Klein&lt;meta.sj(a)gmail.com&gt; wrote:

 ...
 Let's take a practical example.  A classics professor I know (Greg
 Crane, copied here) has scans of primary source materials, some with
 approximate or hand-polished OCR, waiting to be uploaded and converted
 into a useful online resource for editors, translators, and
 classicists around the world.

 Where should he and his students post that material?

 I am a bit confused.  Are these texts currently hosted at the Perseus
 Digital Library?

 If so, they are already a useful online resource. ;-)

 If they would like to see these primary sources pushed into the
 Wikimedia community, they would need to upload the images (or DjVu)
 onto Commons, and the text onto Wikisource where the distributed
 proofreading software resides.

 We can work with them to import a few texts in order to demonstrate
 our technology and preferred methods, and then they can decide whether
 they are happy with this technology, the community, and the potential
 for translations and commentary.

 I made a start on creating a Perseus-to-Wikisource importer about a year
 ago...!

 Or they can upload the djvu to Internet Archive.. or a similar
 depositories... and see where it goes from there.

 Wherever they end up, the primary article about each article would
 surely link out to the OL and WS pages for each work (where one
 exists).

 Wikisource has been adding OCLC numbers to pages, and adding links to
 archive.org when the djvu files came from there (these links contain
 an archive.org identifier).  There are also links to LibraryThing and
 Open Library; we have very few rules ;-)

 --
 John Vandenberg