I'm finding this document quite useful: http://www.succeed-project.eu/sites/default/files/deliverables/Succeed_60055...
See description of ALTO pasted below, which is a followup to https://lists.wikimedia.org/pipermail/wikisource-l/2014-September/002081.htm... . We should find a way to convert the transcribed books' HTML to ALTO format. :)
Some libraries are apparently using http://www.primaresearch.org/tools/Aletheia which seems an augmented (but unfree?!) version of ScanTailor with some different purpose.
Nemo
Principles ALTO stores layout information and OCR recognized text of pages of any kind of printed documents like books, journals and newspapers. ALTO can detail technical metadata for describing the layout and content of physical resources (text, illustrations, graphics). ALTO describes a content page with different views: The Description section helps to describe some general settings and information of the ALTO file (measurement units, file name, etc.), and the production process itself (processing steps, software used, dates and actors, etc.) The Layout section contains what‟s on the page. A page is divided into several regions (print space; left, right, top and bottom margins). For each region, all objects are listed which have been detected inside: text blocks, illustrations, graphical elements, composed blocks. Each object previously identified is defined by generic attributes: width, height, text content (for the String element). Besides, the reading order of all the elements can be managed. Each ALTO file may also contain a style section where different styles (for paragraphs and fonts) are listed. Use cases ALTO is one of the most common formats used by libraries for converting text from images. It‟s used both to deliver digitized contents and to preserve these contents. In a delivery perspective, the ability of ALTO to store the text content coordinates in a page allows the overlay of image and text (multilayer PDF) and highlight search words in a query.
I don't disagree that this should be part of our long term vision, and those who can track this and advise the community on its development and implementation. That said, I don't see how we would be exporting to this or expanding to this in the wiki form.
I have concerns that we have so many basic issues unresolved, and little developer time, as such the mundane tasks are not being addressed. :-/
Regards, Billinghurst
On Mon, Oct 5, 2015 at 10:04 PM Federico Leva (Nemo) nemowiki@gmail.com wrote:
I'm finding this document quite useful:
http://www.succeed-project.eu/sites/default/files/deliverables/Succeed_60055...
See description of ALTO pasted below, which is a followup to
https://lists.wikimedia.org/pipermail/wikisource-l/2014-September/002081.htm... . We should find a way to convert the transcribed books' HTML to ALTO format. :)
Some libraries are apparently using http://www.primaresearch.org/tools/Aletheia which seems an augmented (but unfree?!) version of ScanTailor with some different purpose.
Nemo
Principles ALTO stores layout information and OCR recognized text of pages of any kind of printed documents like books, journals and newspapers. ALTO can detail technical metadata for describing the layout and content of physical resources (text, illustrations, graphics). ALTO describes a content page with different views: The Description section helps to describe some general settings and information of the ALTO file (measurement units, file name, etc.), and the production process itself (processing steps, software used, dates and actors, etc.) The Layout section contains what‟s on the page. A page is divided into several regions (print space; left, right, top and bottom margins). For each region, all objects are listed which have been detected inside: text blocks, illustrations, graphical elements, composed blocks. Each object previously identified is defined by generic attributes: width, height, text content (for the String element). Besides, the reading order of all the elements can be managed. Each ALTO file may also contain a style section where different styles (for paragraphs and fonts) are listed. Use cases ALTO is one of the most common formats used by libraries for converting text from images. It‟s used both to deliver digitized contents and to preserve these contents. In a delivery perspective, the ability of ALTO to store the text content coordinates in a page allows the overlay of image and text (multilayer PDF) and highlight search words in a query.
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Interessante; una conferma della mia vecchia idea che il "cuore di wikisource" è il nsIndice, e l'unità di trascrizione +è la pagina in nsPagina ma è un'opinione isolata, sono stato contraddetto da chi (anche fra i wikisourciani di altissimo livello internazionale) è convinto che nsIndice e nsPagina siano unicamente "proofreading tools".
Ovvio che la strutturazione xml dei contenuti, per quel poco che ho visto, richiama (è l'evoluzione?) della struttura TEI, ma vivendo dentro wikisource vedo che il "peccato originale" di non valorizzare nsPagina rischia di rendere le cose complesse, o impossibili, oltre ad aver disperso incredibili energie nella "transclusione".
Le mie energie e il mio entusiasmo stanno scemando....
Alex
2015-10-05 13:04 GMT+02:00 Federico Leva (Nemo) nemowiki@gmail.com:
I'm finding this document quite useful: http://www.succeed-project.eu/sites/default/files/deliverables/Succeed_60055...
See description of ALTO pasted below, which is a followup to https://lists.wikimedia.org/pipermail/wikisource-l/2014-September/002081.htm... . We should find a way to convert the transcribed books' HTML to ALTO format. :)
Some libraries are apparently using http://www.primaresearch.org/tools/Aletheia which seems an augmented (but unfree?!) version of ScanTailor with some different purpose.
Nemo
Principles ALTO stores layout information and OCR recognized text of pages of any kind of printed documents like books, journals and newspapers. ALTO can detail technical metadata for describing the layout and content of physical resources (text, illustrations, graphics). ALTO describes a content page with different views: The Description section helps to describe some general settings and information of the ALTO file (measurement units, file name, etc.), and the production process itself (processing steps, software used, dates and actors, etc.) The Layout section contains what‟s on the page. A page is divided into several regions (print space; left, right, top and bottom margins). For each region, all objects are listed which have been detected inside: text blocks, illustrations, graphical elements, composed blocks. Each object previously identified is defined by generic attributes: width, height, text content (for the String element). Besides, the reading order of all the elements can be managed. Each ALTO file may also contain a style section where different styles (for paragraphs and fonts) are listed. Use cases ALTO is one of the most common formats used by libraries for converting text from images. It‟s used both to deliver digitized contents and to preserve these contents. In a delivery perspective, the ability of ALTO to store the text content coordinates in a page allows the overlay of image and text (multilayer PDF) and highlight search words in a query.
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
I apologyze for using Italian, my aim was to send a personal reply to Nemo.
Being a personal comment, it doesn't deserve an English translation, so ingore it please.
Alex
2015-10-05 15:05 GMT+02:00 Alex Brollo alex.brollo@gmail.com:
Interessante; una conferma della mia vecchia idea che il "cuore di wikisource" è il nsIndice, e l'unità di trascrizione +è la pagina in nsPagina ma è un'opinione isolata, sono stato contraddetto da chi (anche fra i wikisourciani di altissimo livello internazionale) è convinto che nsIndice e nsPagina siano unicamente "proofreading tools".
Ovvio che la strutturazione xml dei contenuti, per quel poco che ho visto, richiama (è l'evoluzione?) della struttura TEI, ma vivendo dentro wikisource vedo che il "peccato originale" di non valorizzare nsPagina rischia di rendere le cose complesse, o impossibili, oltre ad aver disperso incredibili energie nella "transclusione".
Le mie energie e il mio entusiasmo stanno scemando....
Alex
2015-10-05 13:04 GMT+02:00 Federico Leva (Nemo) nemowiki@gmail.com:
I'm finding this document quite useful: http://www.succeed-project.eu/sites/default/files/deliverables/Succeed_60055...
See description of ALTO pasted below, which is a followup to https://lists.wikimedia.org/pipermail/wikisource-l/2014-September/002081.htm... . We should find a way to convert the transcribed books' HTML to ALTO format. :)
Some libraries are apparently using http://www.primaresearch.org/tools/Aletheia which seems an augmented (but unfree?!) version of ScanTailor with some different purpose.
Nemo
Principles ALTO stores layout information and OCR recognized text of pages of any kind of printed documents like books, journals and newspapers. ALTO can detail technical metadata for describing the layout and content of physical resources (text, illustrations, graphics). ALTO describes a content page with different views: The Description section helps to describe some general settings and information of the ALTO file (measurement units, file name, etc.), and the production process itself (processing steps, software used, dates and actors, etc.) The Layout section contains what‟s on the page. A page is divided into several regions (print space; left, right, top and bottom margins). For each region, all objects are listed which have been detected inside: text blocks, illustrations, graphical elements, composed blocks. Each object previously identified is defined by generic attributes: width, height, text content (for the String element). Besides, the reading order of all the elements can be managed. Each ALTO file may also contain a style section where different styles (for paragraphs and fonts) are listed. Use cases ALTO is one of the most common formats used by libraries for converting text from images. It‟s used both to deliver digitized contents and to preserve these contents. In a delivery perspective, the ability of ALTO to store the text content coordinates in a page allows the overlay of image and text (multilayer PDF) and highlight search words in a query.
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
wikisource-l@lists.wikimedia.org