Interessante; una conferma della mia vecchia idea che il "cuore di
wikisource" è il nsIndice, e l'unità di trascrizione +è la pagina in
nsPagina ma è un'opinione isolata, sono stato contraddetto da chi (anche
fra i wikisourciani di altissimo livello internazionale) è convinto che
nsIndice e nsPagina siano unicamente "proofreading tools".
Ovvio che la strutturazione xml dei contenuti, per quel poco che ho visto,
richiama (è l'evoluzione?) della struttura TEI, ma vivendo dentro
wikisource vedo che il "peccato originale" di non valorizzare nsPagina
rischia di rendere le cose complesse, o impossibili, oltre ad aver disperso
incredibili energie nella "transclusione".
Le mie energie e il mio entusiasmo stanno scemando....
Alex
2015-10-05 13:04 GMT+02:00 Federico Leva (Nemo) <nemowiki(a)gmail.com>om>:
I'm finding this document quite useful:
http://www.succeed-project.eu/sites/default/files/deliverables/Succeed_6005…
See description of ALTO pasted below, which is a followup to
https://lists.wikimedia.org/pipermail/wikisource-l/2014-September/002081.ht…
. We should find a way to convert the transcribed books' HTML to ALTO
format. :)
Some libraries are apparently using
http://www.primaresearch.org/tools/Aletheia which seems an augmented (but
unfree?!) version of ScanTailor with some different purpose.
Nemo
Principles
ALTO stores layout information and OCR recognized text of pages of any
kind of printed
documents like books, journals and newspapers. ALTO can detail technical
metadata for
describing the layout and content of physical resources (text,
illustrations, graphics).
ALTO describes a content page with different views:
The Description section helps to describe some general settings and
information
of the ALTO file (measurement units, file name, etc.), and the production
process
itself (processing steps, software used, dates and actors, etc.)
The Layout section contains what‟s on the page. A page is divided into
several
regions (print space; left, right, top and bottom margins). For each
region, all
objects are listed which have been detected inside: text blocks,
illustrations,
graphical elements, composed blocks. Each object previously identified is
defined
by generic attributes: width, height, text content (for the String
element).
Besides, the reading order of all the elements can be managed.
Each ALTO file may also contain a style section where different styles (for
paragraphs and fonts) are listed.
Use cases
ALTO is one of the most common formats used by libraries for converting
text from
images. It‟s used both to deliver digitized contents and to preserve these
contents.
In a delivery perspective, the ability of ALTO to store the text content
coordinates in a
page allows the overlay of image and text (multilayer PDF) and highlight
search words
in a query.
_______________________________________________
Wikisource-l mailing list
Wikisource-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l