Is the "pdftotext" program used when extracting the OCR text layer from a PDF file?
In this book, http://fr.wikisource.org/wiki/Livre:Liste_provisoire_des_noms_destines.pdf it seems that using "pdftotext -raw" would produce a better result than the current one.
If you download the source PDF file and try to run pdftotext with and without the -raw option, you will see a difference in how some very boldface words are produced: H e l l o (without -raw) and Hello (with -raw), respectively; and also in the column separation of some pages, e.g. page 81 (De Roster--Herborn), where Dyck is followed by E (with -raw) or G (without -raw).
The man page for pdftotext says -raw is deprecated, but I don't understand why, as it produces the best result.
wikisource-l@lists.wikimedia.org