[Wikisource-l] Analysed Layout and Text Object (ALTO)

5 Oct 2015


      I'm finding this document quite useful: 
http://www.succeed-project.eu/sites/default/files/deliverables/Succeed_60055...
See description of ALTO pasted below, which is a followup to 
https://lists.wikimedia.org/pipermail/wikisource-l/2014-September/002081.htm... 
. We should find a way to convert the transcribed books' HTML to ALTO 
format. :)
Some libraries are apparently using 
http://www.primaresearch.org/tools/Aletheia which seems an augmented 
(but unfree?!) version of ScanTailor with some different purpose.
Nemo
Principles
ALTO stores layout information and OCR recognized text of pages of any 
kind of printed
documents like books, journals and newspapers. ALTO can detail technical 
metadata for
describing the layout and content of physical resources (text, 
illustrations, graphics).
ALTO describes a content page with different views:
The Description section helps to describe some general settings and 
information
of the ALTO file (measurement units, file name, etc.), and the 
production process
itself (processing steps, software used, dates and actors, etc.)
The Layout section contains what‟s on the page. A page is divided into 
several
regions (print space; left, right, top and bottom margins). For each 
region, all
objects are listed which have been detected inside: text blocks, 
illustrations,
graphical elements, composed blocks. Each object previously identified 
is defined
by generic attributes: width, height, text content (for the String element).
Besides, the reading order of all the elements can be managed.
Each ALTO file may also contain a style section where different styles (for
paragraphs and fonts) are listed.
Use cases
ALTO is one of the most common formats used by libraries for converting 
text from
images. It‟s used both to deliver digitized contents and to preserve 
these contents.
In a delivery perspective, the ability of ALTO to store the text content 
coordinates in a
page allows the overlay of image and text (multilayer PDF) and highlight 
search words
in a query.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Wikisource-l] Analysed Layout and Text Object (ALTO)