Re: [Wikisource-l] Does really wikisource need djvu/pdf files?

12 Jul 2019


      I don't understand fully your statement "Right now, I'm going to convert
them to DjVu and upload them, *without any text information*.". Don't you
feel any need  of an excellent OCR layer when proofreading it into
wikisource? Do you feel fully satisfied by mediawiki OCR of images?
Unluckily, I feel mediawiki OCR very uncomfortable, dealing with
not-English books, and I don't know how to get xml data about mapping of
words into page image. For sure, if mediawiki 1. could serve the best OCR
possible of images with no text layer, after self-recognition of languages
of text, 2. would encourage to upload images at best possible quality, 3.
could optionally serve hOCR or xml of mapped text layer, there would be no
need of thirdy-parts good OCR layer.
Il giorno ven 12 lug 2019 alle ore 06:01 David Starner prosfilaes@gmail.com
ha scritto:
...
On Sat, Jul 6, 2019 at 3:24 PM Alex Brollo alex.brollo@gmail.com wrote:
...
Nevertheless consider the file structure inside archive.org, who
collects images into zip files and text into _djvu.xml files, so allowing
to manage its brilliant viewer.
...
Djvu format really can be used as a compact images+xml container, but it
seems an obsolete file format, as recent discontinuation of output by
archive.org suggests.  Pdf is IMHO too complex and can't be considered an
open format.
Let's look at one of the files I'm going to upload.
https://archive.org/details/Weird_Tales_v02n02_1923-09 was originally
uploaded as a zip file of JPEG files. If I could upload it as that, or
as the zip of JP2 files, I would. Right now, I'm going to convert them
to DjVu and upload them, without any text information. However,
there's a lot of cases where we just have PDF files, and I don't want
to force some of our more technically unskilled users to have to
figure out file conversion, especially where, in the case of PDF
files, there's no point; Wikimedia can convert it loselessly to any
number of pile of page image formats without much problem.
...
Pdf is IMHO too complex and can't be considered an open format.
It's got an ISO standard and royalty-free patent licensing. An open
format doesn't have to be a simple or good one; it just has to have an
agreed-upon standard without licensing problems.
--
Kie ekzistas vivo, ekzistas espero.

Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] Does really wikisource need djvu/pdf files?