I hope this is the right place to ask this question: is there a technical reason for not hosting TIFF files on Wikimedia servers? Commons forbids them, and apparently has since the beginning. However, it would be helpful for Wikisource if TIFF files of scans of public domain works could be uploaded and used for proofreading and/or archival.
Thanks!
Nathaniel / User:Spangineer
Nathaniel wrote:
I hope this is the right place to ask this question: is there a technical reason for not hosting TIFF files on Wikimedia servers? Commons forbids them, and apparently has since the beginning. However, it would be helpful for Wikisource if TIFF files of scans of public domain works could be uploaded and used for proofreading and/or archival.
What do you need TIFF files for that PNG can't handle?
I'm genuine interested; there are things the TIFF format can do that PNG can't, but I'd like to know which features you need and why.
On 12/1/06, Ilmari Karonen nospam@vyznev.net wrote:
What do you need TIFF files for that PNG can't handle?
I'm genuine interested; there are things the TIFF format can do that PNG can't, but I'd like to know which features you need and why.
I'm not sure of all the technical details, but see the Wikisource discussion at http://en.wikisource.org/wiki/Wikisource:Scriptorium/Archives/2006/11#TIFF_f....
From what I understand, PNG's aren't all that great for photographs (or in
the Wikisource case, scans of pages of books and other documents) and don't work well with scanset, and JPEG's are ridiculously large for the same amount of information as is in a TIFF. Discussion on Commons ( http://commons.wikimedia.org/wiki/Commons:Village_pump#TIFF_files.3F) seemed to suggest that the real problem was image size, but again, it looks like JPEG's would end up being bigger than the TIFF's.
At the end of the day, we'd like to be able to store large images that are as detailed as possible, and be able to use them for proofreading on Wikisource.
Nathaniel
On 12/1/06, Nathaniel spangineer@gmail.com wrote: http://en.wikisource.org/wiki/Wikisource:Scriptorium/Archives/2006/11#TIFF_f....
From what I understand, PNG's aren't all that great for photographs (or in
the Wikisource case, scans of pages of books and other documents) and don't work well with scanset, and JPEG's are ridiculously large for the same amount of information as is in a TIFF. Discussion on Commons ( http://commons.wikimedia.org/wiki/Commons:Village_pump#TIFF_files.3F) seemed to suggest that the real problem was image size, but again, it looks like JPEG's would end up being bigger than the TIFF's.
PNG and TIFF are in the exact same boat in this regard. JPEG isn't necessarily bigger, but rather we should expect objectionable artifacts in JPEGs.
The misconception that PNGs are somehow low quality for continuous tone images appears to be rampant in our community and it needs to be stopped. What PNGs actually are is space inefficient, and while people might argue that *space is cheap*, bandwidth is not and until we have cross-format thumbnailing, size will remain a major factor in our selection of upload formats.
We also have the worry that "TIFF" doesn't actually tell us what it is.. TIFF is one of those generic wrappers which people can (and do) shove all sorts of random crap into... including proprietary/patented formats (I've seen TIFFs stuffed with Mr.SID for digital orthophotography), although it's not that common since most things can't read such tiffs.
At the end of the day, we'd like to be able to store large images that are as detailed as possible, and be able to use them for proofreading on Wikisource.
The best format for this is dejavu, which we support, but we don't have support for either the browser plugins or serverside autoconversion.
The only advantage I see for tiff is that some of the browser plugins have nice navigation abilities.. but the same is true for dejavu... if we're going to put something up that needs a browser plugin it should be the free software solution. Dejavu also happens to have MUCH better compression, and it would also be an acceptable lossy solution for photographs as well.. better than jpeg.
On 12/1/06, Gregory Maxwell gmaxwell@gmail.com wrote:
[snip]
The best format for this is dejavu, which we support, but we don't
have support for either the browser plugins or serverside autoconversion.
The only advantage I see for tiff is that some of the browser plugins have nice navigation abilities.. but the same is true for dejavu... if we're going to put something up that needs a browser plugin it should be the free software solution. Dejavu also happens to have MUCH better compression, and it would also be an acceptable lossy solution for photographs as well.. better than jpeg.
Sorry, dejavu isn't yet supported in the ProofreadPage extension. This extension is used to proofreading works at Wikisource. http://en.wikisource.org/wiki/Help:Side_by_side_image_view_for_proofreading http://bugzilla.wikimedia.org/show_bug.cgi?id=7957 <small>http://bugzilla.wikimedia.org/show_bug.cgi?id=7534</small>
On 12/1/06, Luiz Augusto lugusto@gmail.com wrote:
Sorry, dejavu isn't yet supported in the ProofreadPage extension. This extension is used to proofreading works at Wikisource. http://en.wikisource.org/wiki/Help:Side_by_side_image_view_for_proofreading http://bugzilla.wikimedia.org/show_bug.cgi?id=7957 <small>http://bugzilla.wikimedia.org/show_bug.cgi?id=7534</small>
Um. ProofreadPage doesn't support any form of "multiple page in one document" ... because thats just not how it works, AFAIK (it just uses transclusion). It also doesn't support djvu files because we don't yet server-side convert them, nor can your browser display them, nor do we emit the embedded object stuff to call up the plugin. TIFF would be in exactly the same boat.
On Fri, 01 Dec 2006 18:12:50 +0100, Ilmari Karonen nospam@vyznev.net wrote:
Nathaniel wrote:
I hope this is the right place to ask this question: is there a technical reason for not hosting TIFF files on Wikimedia servers? Commons forbids them, and apparently has since the beginning. However, it would be helpful for Wikisource if TIFF files of scans of public domain works could be uploaded and used for proofreading and/or archival.
What do you need TIFF files for that PNG can't handle?
I'm genuine interested; there are things the TIFF format can do that PNG can't, but I'd like to know which features you need and why.
I would guess the ability to include multiple "pages" in one file. If you scan a 1000 page book most document scanners will let you save it as a single multi-page TIFF file, and that is a heck of a lot easier to upload than 1000 PGN images would be.
Converting to PDF seems like a easy workaround for that though. Not quite sure why TIFF's are outright banned on Commons, security conserns maybe (it's aparently a bit prone to buffer overrun attacks)? As far as I can tell there are no licensing or patent issues with the format.
On 12/1/06, Sherool jamydlan@online.no wrote:
I would guess the ability to include multiple "pages" in one file. If you scan a 1000 page book most document scanners will let you save it as a single multi-page TIFF file, and that is a heck of a lot easier to upload than 1000 PGN images would be.
Dejavu supports multipage documents. We support .djvu uploads.
http://djvulibre.djvuzone.org/
Sure, you'll need a viewer plugin/app but tiff is in the same boat.
On Fri, 01 Dec 2006 20:47:05 +0100, Gregory Maxwell gmaxwell@gmail.com wrote:
On 12/1/06, Sherool jamydlan@online.no wrote:
I would guess the ability to include multiple "pages" in one file. If you scan a 1000 page book most document scanners will let you save it as a single multi-page TIFF file, and that is a heck of a lot easier to upload than 1000 PGN images would be.
Dejavu supports multipage documents. We support .djvu uploads.
http://djvulibre.djvuzone.org/
Sure, you'll need a viewer plugin/app but tiff is in the same boat.
Ah, I see. I'm so out of touch with all these "new" (to me) formats :-O
I still have some old IFF images from the Amiga days, is that a supported format? :P
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Sherool wrote:
On Fri, 01 Dec 2006 18:12:50 +0100, Ilmari Karonen wrote:
I'm genuine interested; there are things the TIFF format can do that PNG can't, but I'd like to know which features you need and why.
I would guess the ability to include multiple "pages" in one file.
Which, of course, is an ability that's very poorly supported.
- -- brion vibber (brion @ pobox.com)
Brion Vibber schrieb:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Sherool wrote:
On Fri, 01 Dec 2006 18:12:50 +0100, Ilmari Karonen wrote:
I'm genuine interested; there are things the TIFF format can do that PNG can't, but I'd like to know which features you need and why.
I would guess the ability to include multiple "pages" in one file.
Which, of course, is an ability that's very poorly supported.
Depends on which OS you use. Windows 98, 2k and XP have included the program "Imaging", which is able to produce and read them.
Regards, Marco
Ilmari Karonen wrote:
What do you need TIFF files for that PNG can't handle?
TIFF is a container format in two respects: 1) It can hold many images (pages of a document) in one file, just like a ZIP archive or a PDF file. This aspect is not supported by MediaWiki today, but could perhaps be in the future? 2) It can hold images of many different compression formats. One of these is the CCITT/ITU-T Group 4 facsimile, a.k.a. "G4", a lossless compression for black and white (bitonal, not grayscale) images, which is very space-efficient for scanned images of printed pages.
For example, the image http://runeberg.org/img/nfad/0670.4.png is 240 kbytes in 150 dpi 16-level grayscale PNG but only 200 kbytes in 600 dpi bitonal TIFF G4. The difference between 150 dpi grayscale and 600 dpi bitonal becomes very clear when you try to print the page on a 600 or 1200 dpi laser printer.
G4 and JPEG are probably the two most commonly used image compressions for scanned images inside PDF documents. For a scanned G4 image, you can easily convert between TIFF and PDF just by modifying the file header. This is not CPU-intensive, since you don't need to uncompress the scanned image within.
DjVu and JBIG2 are more recent developments in this area, but have only very recently become available in open source code. Most scanners and scanning programs support TIFF G4, which has been around since the early 1990s. It is well supported by libtiff.
Lars Aronsson wrote:
For example, the image http://runeberg.org/img/nfad/0670.4.png is 240 kbytes in 150 dpi 16-level grayscale PNG but only 200 kbytes in 600 dpi bitonal TIFF G4. The difference between 150 dpi grayscale and 600 dpi bitonal becomes very clear when you try to print the page on a 600 or 1200 dpi laser printer.
Have you tried a comparison against a 600 dpi bitonal PNG? The DEFLATE compression used in PNGs may or may not be as effective for bitone images as G4, but it should still handle them pretty well.
On 12/2/06, Lars Aronsson lars@aronsson.se wrote:
For example, the image http://runeberg.org/img/nfad/0670.4.png is 240 kbytes in 150 dpi 16-level grayscale PNG but only 200 kbytes in 600 dpi bitonal TIFF G4. The difference between 150 dpi grayscale and 600 dpi bitonal becomes very clear when you try to print the page on a 600 or 1200 dpi laser printer.
Well, you should compare apples to apples. The image you specified becames only 58 kbytes wen saved as 1-bit (bitonal) PNG. And bitmapped formats like PNG and TIFF have no concept of "dpi"- that's an external manipulation made by the printing program. You need to compare images with the same pixel sizes.
Ciao, Alfio
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Alfio Puglisi wrote:
On 12/2/06, Lars Aronsson lars@aronsson.se wrote:
For example, the image http://runeberg.org/img/nfad/0670.4.png is 240 kbytes in 150 dpi 16-level grayscale PNG but only 200 kbytes in 600 dpi bitonal TIFF G4. The difference between 150 dpi grayscale and 600 dpi bitonal becomes very clear when you try to print the page on a 600 or 1200 dpi laser printer.
Well, you should compare apples to apples. The image you specified becames only 58 kbytes wen saved as 1-bit (bitonal) PNG.
Nice. :)
And bitmapped formats like PNG and TIFF have no concept of "dpi"- that's an external manipulation made by the printing program.
On the contrary, these formats do have such a concept; it's metadata usually stored in the file header, describing the ratio between the pixel-level representation and the physical size of the scanned source or the expected printed output.
For PNG, DPI can be indicated in a pHYs data chunk: http://www.w3.org/TR/PNG/#11pHYs
For TIFF, the XResolution, YResolution, and ResolutionUnit fields indicate this data.
In both formats this is optional; not every image necessarily has a physical size that makes sense. However the resolution fields can also indicate the aspect ratio for non-square pixels, which are a reality for instance when working in various video formats.
You need to compare images with the same pixel sizes.
Generally one would want to do so, yes. :)
A comparison between two files of different resolution and color depth may well be a valid one at times, but it's not relevant if you're trying to compare the relative file sizes of the same raster data in different file formats.
- -- brion vibber (brion @ pobox.com / brion @ wikimedia.org)
Alfio Puglisi wrote:
On 12/2/06, Lars Aronsson lars@aronsson.se wrote:
For example, the image http://runeberg.org/img/nfad/0670.4.png is 240 kbytes in 150 dpi 16-level grayscale PNG but only 200 kbytes in 600 dpi bitonal TIFF G4. The difference between 150 dpi grayscale and 600 dpi bitonal becomes very clear when you try to print the page on a 600 or 1200 dpi laser printer.
Well, you should compare apples to apples. The image you specified becames only 58 kbytes wen saved as 1-bit (bitonal) PNG. And bitmapped formats like PNG and TIFF have no concept of "dpi"- that's an external manipulation made by the printing program. You need to compare images with the same pixel sizes.
Sorry for the confusion. Allow me to clarify this.
Were you able to read the text from that 58 kbyte image?
This page was original scanned in 600 dpi bitonal and saved as TIFF G4, a format supported by most scanner software. The image is 3616x5550 pixels and the resulting file is 196 kbyte. You can get it here:
http://runeberg.org/img/nfad-0670.tif
Since most web browsers don't support TIFF, and since 600 dpi is nowhere near the resolution of most computer screens, the web presentation uses a downscaled (by a scale factor 1:4) grayscale image instead. The conversion process is roughly:
tifftopnm | pngscale .25 | ppmquant 16 | pnmtopng
In the resulting image, now 200 kbytes, every 150 pixels correspond to one inch of the printed page, from which it was originally scanned. Thus, I'm using the shorthand terminology that this is now a 150 dpi image. Since most computer screens have a resolution in the rage 100-120 dpi, this means the image is displayed a little "larger than life".
Scanning at 600 dpi is something of a standard for printed text. Many scanned books can then be displayed on screen in 120 dpi (downscale 1:5), but this book (an encyclopedia) has small print which is more readable in 150 dpi.
You suggest the original TIFF image could be saved as a 1-bit PNG instead, and this is true. The conversion process is:
tifftopnm | pnmtopng
The resulting image 404 kbytes or twice as large as the TIFF G4 image. You can get this at http://runeberg.org/img/nfad-0670.png
On 12/4/06, Lars Aronsson lars@aronsson.se wrote:
Sorry for the confusion. Allow me to clarify this.
[...]
In the resulting image, now 200 kbytes, every 150 pixels correspond to one inch of the printed page, from which it was originally scanned. Thus, I'm using the shorthand terminology that this is now a 150 dpi image. Since most computer screens have a resolution in the rage 100-120 dpi, this means the image is displayed a little "larger than life".
Ah ok, now I understand what you are trying to do.
TIFF G4 seems actually the best format for these kind of images, regardless of the dpi setting. PNG can store 1 bit images, but even at the maximum compression level (9) the full 3616x5550 image is about 380 kbytes instead of 200.
Ciao, Alfio
On 12/4/06, Alfio Puglisi alfio.puglisi@gmail.com wrote:
In the resulting image, now 200 kbytes, every 150 pixels correspond to one inch of the printed page, from which it was originally scanned. Thus, I'm using the shorthand terminology that this is now a 150 dpi image. Since most computer screens have a resolution in the rage 100-120 dpi, this means the image is displayed a little "larger than life".
Ah ok, now I understand what you are trying to do.
TIFF G4 seems actually the best format for these kind of images, regardless of the dpi setting. PNG can store 1 bit images, but even at the maximum compression level (9) the full 3616x5550 image is about 380 kbytes instead of 200.
No. Dejavu is, by a mile.
[gmaxwell@floodlamp ~]$ cjb2 -losslevel 100 nfad-0670.tif nfad-0670.djvu [gmaxwell@floodlamp ~]$ ls -l nfad-0670.* -rw-rw-r-- 1 gmaxwell gmaxwell 72075 Dec 4 21:55 nfad-0670.djvu -rw-rw-r-- 1 gmaxwell gmaxwell 200865 Dec 4 08:01 nfad-0670.tif [gmaxwell@floodlamp ~]$ cjb2 -lossless nfad-0670.tif nfad-0670.djvu [gmaxwell@floodlamp ~]$ ls -l nfad-0670.* -rw-rw-r-- 1 gmaxwell gmaxwell 128772 Dec 4 21:56 nfad-0670.djvu -rw-rw-r-- 1 gmaxwell gmaxwell 200865 Dec 4 08:01 nfad-0670.tif [gmaxwell@floodlamp ~]$ cjb2 -losslevel 200 nfad-0670.tif nfad-0670.djvu [gmaxwell@floodlamp ~]$ ls -l nfad-0670.* -rw-rw-r-- 1 gmaxwell gmaxwell 35209 Dec 4 21:56 nfad-0670.djvu -rw-rw-r-- 1 gmaxwell gmaxwell 200865 Dec 4 08:01 nfad-0670.tif
Even at losslevel 200 zoomed way in, I'd be hard pressed to tell the original from the djvu. Djvu lossy works by reusing whole characters... so it should not adversely impact OCR, although it does not work so well on noisy documents. Even in lossless mode djvu is half the size of the tiff.
The free software djvu viewer is very good, easily better than most of the PS and Tiff viewers I've used.
Djvu can also produce mixed pages of images and text using the optimal compression method for each type. You can't do that with TIFF in G4 mode. Djvu's lossy compression of photographic material is vastly superior to jpeg (more quality per bit in terms of PSNR, and less objectionable artifacts at low quality levels). Djvu files can also be loaded incrementally.. loading more data as you zoom in. Great for high resolution images like satellite photography or other ridiculous resolution photographs.
On 12/4/06, Gregory Maxwell gmaxwell@gmail.com wrote:
No. Dejavu is, by a mile.
[snip]
Even at losslevel 200 zoomed way in, I'd be hard pressed to tell the original from the djvu.
[snip]
Pardon my overactive send button.. I guess I should be fair and give examples:
http://72.165.205.81/djvuex.gif
And while the properly compressed png (http://72.165.205.81/p3.png) weighs in at 70% more (335KiB) than the TIFF, the TIFF is 470% larger than the lossy djvu you might use if you really cared about size.
Or in, perhaps more useful terms: Assuming this page is typical, with a 28.8K modem PNG is 34 pages *per hour*, TIFF is 51 pages per hour, lossless djvu 92 pages per hour, and high lossy djvu is 360 pages per hour. So a typical reader on dialup could outread the transmission of both PNG and TIFF, a fast reader (383 WPM) would just keep up with lossless djvu, and a freak like Kat Walsh might would need the super lossy in order to not sit waiting. :) At DSL/Cable or T1 speeds even the fastest human reader couldn't read as fast as PNGs would transfer.
On Sun, Dec 03, 2006 at 08:52:31PM +0100, Alfio Puglisi wrote:
On 12/2/06, Lars Aronsson lars@aronsson.se wrote:
For example, the image http://runeberg.org/img/nfad/0670.4.png is 240 kbytes in 150 dpi 16-level grayscale PNG but only 200 kbytes in 600 dpi bitonal TIFF G4. The difference between 150 dpi grayscale and 600 dpi bitonal becomes very clear when you try to print the page on a 600 or 1200 dpi laser printer.
Well, you should compare apples to apples. The image you specified becames only 58 kbytes wen saved as 1-bit (bitonal) PNG. And bitmapped formats like PNG and TIFF have no concept of "dpi"- that's an external manipulation made by the printing program. You need to compare images with the same pixel sizes.
Well, an inference could be made that when one describes two images with differing DPI renderings that the *final size in inches* is expected to be the same, but you're right: that's expecting too much of the speakers... :-)
Cheers, -- jra
wikitech-l@lists.wikimedia.org