On 24 November 2014 at 13:51, Andrea Zanni <zanni.andrea84@gmail.com> wrote:

Another greataccomplishment could be *giving back proofread OCR* to GLAMs: think about libraries (or Internet Archive!) give us ancient texts, and us giving them back a perfect djvu or PDF with mapped text inside... 
I'm sure we could have many GLAMs coming to us then :-)
We cannot give them back almost anything, right now, a part from our HTML pages.


This is exactly the kind of suggestion I have been looking for. Many cultural institutions are developing their own crowdsourced transcription projects. I think Wikisource can be a much more robust platform than these one-off projects, with a more well-developed community that aggregates the transcription efforts of texts from many institutions in a single place with a proven process.

At NARA, along with our own transcription program, we are also developing a writable API for submitting transcriptions to it, because we recognize that third-party platforms like Wikisource might be the best place for the actual transcribing to take place. As long as we can ingest that data back into our own dataset, that is.

How would I do that now? Wikisource pages are not structured data (though Wikimedia Commons image metadata will soon be!), so there is not a clear way to use the Wikisource API to extract just the relevant transcribed text on the page as a field. And on top of that, any text you do extract this way will be full of templates and other code that has no meaning outside of the context of Wikisource. I don't see a way to easily extract just the plain text that is meaningful and relevant (along with other fielded data, like what page or text it belongs to).