Hello,
We are trying to write a press release for the impending one millionth file to be uploaded at Commons. http://commons.wikimedia.org/wiki/Commons:Press_releases/1M
I thought it would be cool if we include a parapgraph or two about how Commons helps Wikisource in particular. I know DJVu files have that cool 'next/previous page' thing if nothing else. Something making mention of open standards/open file format and how this is useful for projects in multiple languages would be cool (the first djvu I came across, I noticed had Russian and English, for example...).
I recall Danny doing a bit of a writeup on Wikisource recently; if no one else is inspired, I'm sure we could crib something from there, but if anyone feels like it, it would be great. :)
thanks, Brianna user:pfctdayelise
I think djvu sounds wonderful but I don't that many people understand how to use it. I certainly don't. Is this the format Commons would like us to be using for scanned books?
About how WS uses commons, it is not consistant across all subdomains. I think the de.WS uses it most heavily but they do not participate on this list. en.WS has decided to put scans on Commons as well but we are still stumbling about it. I would very interested in working the kinks out of this with people who are familar with Commons. I have a book I have begun to scan but haven't uploaded yet because I am uncertain of the best way to go about it. I do feel that in the long-term we need to see all scans on Commons so that WS can really grow into it's full potential as a workspace for creating free translations.
Besides hosting scans needed for proofreading, WS also use Commons as other WMF projects do to host illustrations. Since our illustrations are usually things cut out of the scanned pages we tend to not use Commons a resource to find things so much as host things. In the case of reference works many of these illustrations can be also added into Wikipedia projects. Commonsticker has been a great thing for us because many of our images need to remain purposefully out-of-date in order to be appropriate to the text.
BirgitteSB
--- Brianna Laugher brianna.laugher@gmail.com wrote:
Hello,
We are trying to write a press release for the impending one millionth file to be uploaded at Commons.
http://commons.wikimedia.org/wiki/Commons:Press_releases/1M
I thought it would be cool if we include a parapgraph or two about how Commons helps Wikisource in particular. I know DJVu files have that cool 'next/previous page' thing if nothing else. Something making mention of open standards/open file format and how this is useful for projects in multiple languages would be cool (the first djvu I came across, I noticed had Russian and English, for example...).
I recall Danny doing a bit of a writeup on Wikisource recently; if no one else is inspired, I'm sure we could crib something from there, but if anyone feels like it, it would be great. :)
thanks, Brianna user:pfctdayelise _______________________________________________ Wikisource-l mailing list Wikisource-l@mail.wikimedia.org
http://mail.wikipedia.org/mailman/listinfo/wikisource-l
____________________________________________________________________________________ Sponsored Link
Mortgage rates near 39yr lows. $510k for $1,698/mo. Calculate new payment! www.LowerMyBills.com/lre
Hello,
on how to upload scanned texts:
it would be great if the MediaWiki DjVu inline renderer and the ProofreadPage extension could be made to work together. Then one could upload texts as DjVu with all its benefits (plain text/image mixing, efficient storage, only one single file upload), but one would still be able to extract single pages into Wikisource's Page: namespace.
In the state things are now, I'd opt for uploading all pages individually as PNG, because ProofreadPage is too great a convenience to lose.
Best regards, Alexander
Alexander Klauer wrote:
on how to upload scanned texts:
it would be great if the MediaWiki DjVu inline renderer and the ProofreadPage extension could be made to work together. Then one could upload texts as DjVu with all its benefits (plain text/image mixing, efficient storage, only one single file upload), but one would still be able to extract single pages into Wikisource's Page: namespace.
Ultimately, upload and download should be possible in DjVu, PDF, TIFF, and ZIP archive. All of those formats are capable of storing many pages in one file. As far as I know, DjVu and PDF are capable of mixing image and (OCR) text in one file, including the mapping of individual words to positions in the image. In a ZIP archive, you could store the scanned image in 0001.jpg (or .png or .tif) together with OCR text in 0001.txt, etc.
A download (e.g. in PDF format, for facsimile printing) should be possible for all pages in a volume or for all pages belonging to a chapter.
Currently, pages in fr.wikisource have names such as [[Page:Fermat - Livre 1-000008.jpg]] so "Fermat - Livre 1" could be the ZIP filename, and 000008.jpg would be the image contained within the ZIP archive. Instead of the dash, one might consider "/" for subpages here.
Next challenge: If the OCR text holds the position of each word in the image, can you mix this with Javascript (AJAX?) to highlight (in yellow) in the image the word you are currently wiki-editing? And how do you update that position when you move text around?
How does commercial PDF/DjVu proofreading software handle this?
There is still a lot of programming to be done for this.
On 16/11/06, Birgitte SB birgitte_sb@yahoo.com wrote:
I think djvu sounds wonderful but I don't that many people understand how to use it. I certainly don't. Is this the format Commons would like us to be using for scanned books?
Commons has no preference about this! Whatever suits you. I was just under the impression that Wikisource used DJVu. From a comment by brion ( http://en.wikisource.org/wiki/Wikisource:Scriptorium/Archives/2006/07#DjVu_u... ) it looks like it's de.WS who uses it, as you said.
en.WS has decided to put scans on Commons as well but we are still stumbling about it. I would very interested in working the kinks out of this with people who are familar with Commons. I have a book I have begun to scan but haven't uploaded yet because I am uncertain of the best way to go about it. I do feel that in the long-term we need to see all scans on Commons so that WS can really grow into it's full potential as a workspace for creating free translations.
Well, we are certainly interested in working with WS too. Danny is running as a sysop on Commons and this is very likely to succeed. So maybe this will be helpful too.
Spangineer made a comment recently about Wikisourcians possibly being worried about Commons due to images being deleted or "fixed" erroneously. ( http://commons.wikimedia.org/wiki/User_talk:Pfctdayelise/Principles#Wikisour... ) As I said to him, Wikisourcians should feel confident about (politely! :)) reverting such changes and requesting that changes be made to a new copy of a file.
regards, Brianna
Hi,
Is there a page about ProofreadPage extension somewhere? I can't find anything on meta or mediawiki... or wikisource for that matter...
Alexander, combining DjVu and ProofreadPage sounds like something that should definitely be listed in bugzilla...
cheers, Brianna
On 16/11/06, Alexander Klauer Graf.Zahl@gmx.net wrote:
Hello,
on how to upload scanned texts:
it would be great if the MediaWiki DjVu inline renderer and the ProofreadPage extension could be made to work together. Then one could upload texts as DjVu with all its benefits (plain text/image mixing, efficient storage, only one single file upload), but one would still be able to extract single pages into Wikisource's Page: namespace.
In the state things are now, I'd opt for uploading all pages individually as PNG, because ProofreadPage is too great a convenience to lose.
Best regards, Alexander _______________________________________________ Wikisource-l mailing list Wikisource-l@mail.wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikisource-l
On 11/16/06, Brianna Laugher brianna.laugher@gmail.com wrote:
Hi,
Is there a page about ProofreadPage extension somewhere? I can't find anything on meta or mediawiki... or wikisource for that matter...
(...)
Hi, I found this http://en.wikisource.org/wiki/Help:Side_by_side_image_view_for_proofreading.
I've never really been happy with the DjVu software I used. (I used the free online service at http://any2djvu.djvuzone.org/ to test out the file format.) I realize it's a lossy compression, but for black & white scanned text images, converting from PNG to DJVU butchered the display. Sure, the file size was just a few kilobytes, but it was terrible to look at. Colored pictures fared much better--the quality was reduced, but not nearly as badly as it was for B&W pictures.
I don't know what the software from LizardTech is like, but I imagine it wouldn't kill the display quite so much. Unless we can find a way of keeping a relatively high quality for the DjVu pictures, I'd support using an alternative (like TIFF, although that might produce MASSIVE file sizes if we link 600 pages together into one file).
Z
Alexander Klauer wrote:
it would be great if the MediaWiki DjVu inline renderer and the ProofreadPage extension could be made to work together. Then one could upload texts as DjVu with all its benefits (plain text/image mixing, efficient storage, only one single file upload), but one would still be able to extract single pages into Wikisource's Page: namespace.
Sorry, I did not check this ML recently. I agree that it would be great to handle djvu format. Unfortunately, I currently do not have enough time to start such a project... If a programmer is willing to do it I will be glad to give him/her the feedback that I can.
Thomas
Ryan Dabler wrote:
I realize it's a lossy compression,
Just some basics here: When you digitize old books, the greatest loss is in the paper being torn, yellow of age, stained by coffee, and ink having faded. Next step, the scanning or photography always loses part of what the printed page contains. After this, it doesn't matter quite so much if the compression of the computer file is "lossy" or not. Don't get religious about "lossless" compression. A good result can still be achieved by paying adequate attention to every step of the process. But perfectionism just doesn't pay. If you get unreadable results, you need to go back and redo the steps that failed.
Formats like GIF, JPEG and TIFF existed in 1990. DjVu and JBIG2 are more advanced formats that have appeared later than that and are still not very common in free software applications.
If you failed in using DjVu, it's not the format's fault.
(like TIFF, although that might produce MASSIVE file sizes if we link 600 pages together into one file).
TIFF is a container format that can hold images in a variety of compressions and formats. Perhaps you are referring to raw tiff.
Anyway, what's considered huge changes with time. When people download government reports in PDF today, they download an entire book (500 pages) instead of individual chapters (25 pages).
On 11/17/06, Lars Aronsson lars@aronsson.se wrote:
Just some basics here: When you digitize old books, the greatest loss is in the paper being torn, yellow of age, stained by coffee, and ink having faded. Next step, the scanning or photography always loses part of what the printed page contains. After this, it doesn't matter quite so much if the compression of the computer file is "lossy" or not. Don't get religious about "lossless" compression. A good result can still be achieved by paying adequate attention to every step of the process. But perfectionism just doesn't pay. If you get unreadable results, you need to go back and redo the steps that failed.
Formats like GIF, JPEG and TIFF existed in 1990. DjVu and JBIG2 are more advanced formats that have appeared later than that and are still not very common in free software applications.
If you failed in using DjVu, it's not the format's fault.
I'm not saying it is. What I'm saying is that while I realize there will be a loss in quality in every step (including converting to different formats), the image quality *up to* converting to DjVu (from say GIF or PNG) is relatively good. So, if the quality of the DjVu image, using only the freely available DjVu software I can find, is not at all appeasing, then it would be better for aesthetic reasons to keep with the non-DjVu images and do our best to crunch down the filesize (PNGs can get some good compression). That way we can still keep high quality scans around that people will actually want to look at.
Maybe the software you can buy is much better (and I wouldn't doubt it), but I'm not going to drop $400 to get a copy; I'd rather just stick with a more "convential" format.
Z
wikisource-l@lists.wikimedia.org