pdftotext - Wikisource-l

22 Nov 2012


      Is the "pdftotext" program used when extracting
the OCR text layer from a PDF file?
In this book,
http://fr.wikisource.org/wiki/Livre:Liste_provisoire_des_noms_destines.pdf
it seems that using "pdftotext -raw" would produce
a better result than the current one.
If you download the source PDF file and try to run
pdftotext with and without the -raw option, you
will see a difference in how some very boldface
words are produced: H e l l o (without -raw) and
Hello (with -raw), respectively;
and also in the column separation of some pages,
e.g. page 81 (De Roster--Herborn), where Dyck
is followed by E (with -raw) or G (without -raw).
The man page for pdftotext says -raw is deprecated,
but I don't understand why, as it produces the
best result.
-- 
   Lars Aronsson (lars@aronsson.se)
   Projekt Runeberg - fri nordisk litteratur - http://runeberg.org/