On Fri, Dec 26, 2008 at 5:40 AM, Luiz Augusto lugusto@gmail.com wrote:
On Thu, Dec 25, 2008 at 3:52 PM, Ilmari Karonen nospam@vyznev.net wrote:
Luiz Augusto wrote:
I'm asking it because I've approximately 30GB of public domain scans in
format to upload on Commons on the next months (see
http://en.wikisource.org/w/index.php?oldid=928004#Royal_Society_Digital_Arch...
for further information on it) and because I fully agree to the reasons listed on https://bugzilla.wikimedia.org/show_bug.cgi?id=11215#c3
Assuming that these are scanned documents that haven't been vectorized, have you considered converting them to DjVu format? Not only does Wikimedia currently have better support for it than PDF, but you might realize some file size savings. Apparently, there's software out there to more or less automate it.
Large batches of scans should be converted to djvu, as it is a better format. PDF support will be useful for the small tasks where the person already has a PDF (or it is already uploaded onto commons), and they dont want to learn lots of tools before they start seeing results. i.e. PDF support will make wikisource more accessible.
Someone asked it on en.wikisource and I've replied with this: http://en.wikisource.org/w/index.php?title=Wikisource:Scriptorium&diff=p...
DjVu (or at least all conversion tools/configuration options that I've tried in the past months, including the LizardTech Document Express Enterprise pdf2djvu and png2djvu options) is a lossy format. If I convert a .pdf downloaded from Google Book Search I will get a low quality file (70 dpi or 150 dpi per page), but if I extract the images from the same .pdf file using Adobe Acrobat Pro 8 I will get a 600 dpi jpeg for each page (OCR softwares normally recommeds to use 300 dpi images).
My understanding is that the compression is optional, and the lossy compression is much better than the equivalent lossy compression of PDF.
I think it is the free PDF-to-image extraction tools that are causing your problems.
Of course, that doesn't in any way preclude or remove the need for _also_ improving our PDF support.
Surely :)
But PDF, as common and useful as it is, might not be the optimal format here.
Well, all digitized works from all libraries that I known (from Europe, United States and Brazil) are avaiable only in .pdf file format. The Internet Archive is the only one to make avaiable both .pdf and .djvu for the same book (the .djvu version from IA is also a low quality file, but it at least is delivered with a high-quality OCR embedded at the .djvu file due to some closed-source and pay OCR software [Abbyy FineReader, I believe]).
I have found the djvu files from IA to be of an appropriate quality, especially for transcription purposes. The PDFs are usually much larger, and not much better quality.
-- John Vandenberg