On 24 November 2014 at 13:51, Andrea Zanni <zanni.andrea84(a)gmail.com> wrote:
Another greataccomplishment could be *giving back proofread OCR* to GLAMs:
think about libraries (or Internet Archive!) give us ancient texts, and us
giving them back a perfect djvu or PDF with mapped text inside...
I'm sure we could have many GLAMs coming to us then :-)
We cannot give them back almost anything, right now, a part from our HTML
This is exactly the kind of suggestion I have been looking for. Many
cultural institutions are developing their own crowdsourced transcription
projects. I think Wikisource can be a much more robust platform than these
one-off projects, with a more well-developed community that aggregates the
transcription efforts of texts from many institutions in a single place
with a proven process.
At NARA, along with our own transcription program, we are also developing a
writable API for submitting transcriptions to it, because we recognize that
third-party platforms like Wikisource might be the best place for the
actual transcribing to take place. As long as we can ingest that data back
into our own dataset, that is.
How would I do that now? Wikisource pages are not structured data (though
Wikimedia Commons image metadata will soon be!), so there is not a clear
way to use the Wikisource API to extract just the relevant transcribed text
on the page as a field. And on top of that, any text you do extract this
way will be full of templates and other code that has no meaning outside of
the context of Wikisource. I don't see a way to easily extract just the
plain text that is meaningful and relevant (along with other fielded data,
like what page or text it belongs to).