Re: [Wikisource-l] [ol-discuss] Open Library, Wikisource, and cleaning and translating OCR of Classics - Wikisource-l

11 Aug 2009


      Onion sourcing.  That would be a nice improvement on simple cite styles.
On Tue, Aug 11, 2009 at 12:10 PM, Gregory Cranegregory.crane@tufts.edu wrote:
...
There are various layers to this onion. The key element is that books and
pages are artifacts in many cases. What we really want are the logical
structures that splatter across pages.
And across and around works...
...
First, we have added a bunch of content -- esp. editions of Greek and Latin
sources -- to the Internet Archive holdings and we are cataloguing editions
that are the overall collection, regardless of who put them there. This goes
well beyond the standard book catalogue records -- we are interested in the
content not in books per se. Thus, we may add hundreds of records for a
Is there a way to deep link to a specific page-image from one of these
works without removing it from the Internet Archive?
...
We would like to have useable etexts from all of these editions -- many of
which are not yet in our collections. Many of these are in Greek and need a
lot of work because the OCR is not very good.
So bad OCR for them exists, but no usable etexts?
...
To use canonical texts, you need book/chapter/verse markup and you need
FRBR-like citations ... deep annotations... syntactic analyses, word sense,
co-reference...
These are nice features, but perhaps you can develop a clean etext
first, and overlay this metadata in parallel or later on.
...
My question is what environments can support contributions at various
levels. Clearly, proofreading OCR output is standard enough.
If you want to get a sense of what operations need ultimately to be
supported, you could skim
http://digitalhumanities.org/dhq/vol/3/1/000035.html.
That's a good question.  What environments currently support OCR
proofreading and translation, and direct links to page-images of the
original source?  This is doable, with no special software or tools,
via wikisource (in multiple languages, with interlanguage links and
crude paragraph alignment) and commons (for page images).  The pages
could also be stored in other repositories such as the Archive, as
long as there is an easy way to link out to them or transclude
thumbnails.  [maybe an InstaCommons plugin for the Internet Archive?]
That's quite an interesting monograph you link to.  I see six main
sets of features/operations described there.  Each of them deserves a
mention in Wikimedia's strategic planning.  Aside from language
analysis, each has significant value for all of the Projects, not just
wikisource.
OCR tools
 *  OCR optimization: statistical data, page layout hints
 *  Capturing page layout logical structures
CROSS REFERENCING
 *  Quote, source, plagiarism idenfication.
 *  Named entity identification (automatic for some entities?  hints)
 *  Automatic linking (of urls, abbrv. citations, &c), markup projection
TEXT ALIGNMENT
 *  Canonical text services (chapter/verse equivalents)
 *  Version Analysis b/t versions.
 *  Translation alignment
TRANSLATION SUPPORT
 *  Automated translation (seed translations, hints for humans)
 *  Translation dictionaries (on mouseover?)
CROSS-LANGUAGE SEARCHING
 *  Cross-referencing across translations
 *  Quote identification across translations
LANGUAGE ANALYSIS
 *  Word analysis: word sense discovery, morphology.
 *  Sentence analysis: syntactic, metrical (poetry)
...
Greg
John Vandenberg wrote:
...
On Tue, Aug 11, 2009 at 3:00 PM, Samuel Kleinmeta.sj@gmail.com wrote:
...
...
Let's take a practical example.  A classics professor I know (Greg
Crane, copied here) has scans of primary source materials, some with
approximate or hand-polished OCR, waiting to be uploaded and converted
into a useful online resource for editors, translators, and
classicists around the world.
Where should he and his students post that material?
I am a bit confused.  Are these texts currently hosted at the Perseus
Digital Library?
If so, they are already a useful online resource. ;-)
If they would like to see these primary sources pushed into the
Wikimedia community, they would need to upload the images (or DjVu)
onto Commons, and the text onto Wikisource where the distributed
proofreading software resides.
We can work with them to import a few texts in order to demonstrate
our technology and preferred methods, and then they can decide whether
they are happy with this technology, the community, and the potential
for translations and commentary.
I made a start on creating a Perseus-to-Wikisource importer about a year
ago...!
Or they can upload the djvu to Internet Archive.. or a similar
depositories... and see where it goes from there.
...
Wherever they end up, the primary article about each article would
surely link out to the OL and WS pages for each work (where one
exists).
Wikisource has been adding OCLC numbers to pages, and adding links to
archive.org when the djvu files came from there (these links contain
an archive.org identifier).  There are also links to LibraryThing and
Open Library; we have very few rules ;-)
--
John Vandenberg