Onion sourcing. That would be a nice improvement on simple cite styles.
On Tue, Aug 11, 2009 at 12:10 PM, Gregory Cranegregory.crane@tufts.edu wrote:
There are various layers to this onion. The key element is that books and pages are artifacts in many cases. What we really want are the logical structures that splatter across pages.
And across and around works...
First, we have added a bunch of content -- esp. editions of Greek and Latin sources -- to the Internet Archive holdings and we are cataloguing editions that are the overall collection, regardless of who put them there. This goes well beyond the standard book catalogue records -- we are interested in the content not in books per se. Thus, we may add hundreds of records for a
Is there a way to deep link to a specific page-image from one of these works without removing it from the Internet Archive?
We would like to have useable etexts from all of these editions -- many of which are not yet in our collections. Many of these are in Greek and need a lot of work because the OCR is not very good.
So bad OCR for them exists, but no usable etexts?
To use canonical texts, you need book/chapter/verse markup and you need FRBR-like citations ... deep annotations... syntactic analyses, word sense, co-reference...
These are nice features, but perhaps you can develop a clean etext first, and overlay this metadata in parallel or later on.
My question is what environments can support contributions at various levels. Clearly, proofreading OCR output is standard enough.
If you want to get a sense of what operations need ultimately to be supported, you could skim http://digitalhumanities.org/dhq/vol/3/1/000035.html.
That's a good question. What environments currently support OCR proofreading and translation, and direct links to page-images of the original source? This is doable, with no special software or tools, via wikisource (in multiple languages, with interlanguage links and crude paragraph alignment) and commons (for page images). The pages could also be stored in other repositories such as the Archive, as long as there is an easy way to link out to them or transclude thumbnails. [maybe an InstaCommons plugin for the Internet Archive?]
That's quite an interesting monograph you link to. I see six main sets of features/operations described there. Each of them deserves a mention in Wikimedia's strategic planning. Aside from language analysis, each has significant value for all of the Projects, not just wikisource.
OCR tools * OCR optimization: statistical data, page layout hints * Capturing page layout logical structures
CROSS REFERENCING * Quote, source, plagiarism idenfication. * Named entity identification (automatic for some entities? hints) * Automatic linking (of urls, abbrv. citations, &c), markup projection
TEXT ALIGNMENT * Canonical text services (chapter/verse equivalents) * Version Analysis b/t versions. * Translation alignment
TRANSLATION SUPPORT * Automated translation (seed translations, hints for humans) * Translation dictionaries (on mouseover?)
CROSS-LANGUAGE SEARCHING * Cross-referencing across translations * Quote identification across translations
LANGUAGE ANALYSIS * Word analysis: word sense discovery, morphology. * Sentence analysis: syntactic, metrical (poetry)
Greg
John Vandenberg wrote:
On Tue, Aug 11, 2009 at 3:00 PM, Samuel Kleinmeta.sj@gmail.com wrote:
... Let's take a practical example. A classics professor I know (Greg Crane, copied here) has scans of primary source materials, some with approximate or hand-polished OCR, waiting to be uploaded and converted into a useful online resource for editors, translators, and classicists around the world.
Where should he and his students post that material?
I am a bit confused. Are these texts currently hosted at the Perseus Digital Library?
If so, they are already a useful online resource. ;-)
If they would like to see these primary sources pushed into the Wikimedia community, they would need to upload the images (or DjVu) onto Commons, and the text onto Wikisource where the distributed proofreading software resides.
We can work with them to import a few texts in order to demonstrate our technology and preferred methods, and then they can decide whether they are happy with this technology, the community, and the potential for translations and commentary.
I made a start on creating a Perseus-to-Wikisource importer about a year ago...!
Or they can upload the djvu to Internet Archive.. or a similar depositories... and see where it goes from there.
Wherever they end up, the primary article about each article would surely link out to the OL and WS pages for each work (where one exists).
Wikisource has been adding OCLC numbers to pages, and adding links to archive.org when the djvu files came from there (these links contain an archive.org identifier). There are also links to LibraryThing and Open Library; we have very few rules ;-)
-- John Vandenberg
wikisource-l@lists.wikimedia.org