Does really wikisource need djvu/pdf files?

List overview All Threads
Download

newer

older

Wikimania

Fwd: [All-affiliates] ASBS results

Alex Brollo

6 Jul 2019 6 Jul '19

10:08 a.m.

I like and I studies - as deeply I can - djvu file structure and DjvuLibre routines; dealing with wikisource needs, I appreciate, but I like less, pdf files for their complexity. Proofread procedure is presently based on djvu or pdf files; but I see that another approach could be used, using only simpler routines.

Proofreading procedure needs two inputs: 1. a set of good images of page scans; 2. a good mapped file of text content matched with images.

About "mapped text", there are two alternatives, hOCR and xml; both can be used to extract "unmapped raw text" when needed at server level, but at local level too by jQuery. If hOCR/xml of page text could be fastly and simply accessed from nsPage, I see interesting opportunities - i.e. generalized highlighting of selected text on nsPage image both in view and in edit mode; formatting suggestions from heuristic analysis of word coordinates; different organization of high level text structures, as wrong column layout).

Alex brollo (it.wikisource)

Attachments:

attachment.htm (text/html — 1.1 KB)

Show replies by date

David Starner

6 Jul 6 Jul

10:48 a.m.

...

From my perspective, a DjVu or PDF file is just an archive format for

images. Any text that comes along with them is ancillary; if it's missing, we can always generate it from OCR. I could just as well use CBR/CBZ files, though they're not as reliable for having a sensible format. I want to avoid, as much as possible, dealing with a bunch of disconnected page images, because that maximizes the possibility for human error.

-- Kie ekzistas vivo, ekzistas espero.

Alex Brollo

7 Jul 7 Jul

12:23 a.m.

Nevertheless consider the file structure inside archive.org, who collects images into zip files and text into _djvu.xml files, so allowing to manage its brilliant viewer. Djvu format really can be used as a compact images+xml container, but it seems an obsolete file format, as recent discontinuation of output by archive.org suggests. Pdf is IMHO too complex and can't be considered an open format.

Alex brollo

Il giorno sab 6 lug 2019 alle ore 10:51 David Starner prosfilaes@gmail.com ha scritto:

...

From my perspective, a DjVu or PDF file is just an archive format for images. Any text that comes along with them is ancillary; if it's missing, we can always generate it from OCR. I could just as well use CBR/CBZ files, though they're not as reliable for having a sensible format. I want to avoid, as much as possible, dealing with a bunch of disconnected page images, because that maximizes the possibility for human error.

-- Kie ekzistas vivo, ekzistas espero.

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

David Starner

12 Jul 12 Jul

6:01 a.m.

On Sat, Jul 6, 2019 at 3:24 PM Alex Brollo alex.brollo@gmail.com wrote:

...

Nevertheless consider the file structure inside archive.org, who collects images into zip files and text into _djvu.xml files, so allowing to manage its brilliant viewer. Djvu format really can be used as a compact images+xml container, but it seems an obsolete file format, as recent discontinuation of output by archive.org suggests. Pdf is IMHO too complex and can't be considered an open format.

Let's look at one of the files I'm going to upload. https://archive.org/details/Weird_Tales_v02n02_1923-09 was originally uploaded as a zip file of JPEG files. If I could upload it as that, or as the zip of JP2 files, I would. Right now, I'm going to convert them to DjVu and upload them, without any text information. However, there's a lot of cases where we just have PDF files, and I don't want to force some of our more technically unskilled users to have to figure out file conversion, especially where, in the case of PDF files, there's no point; Wikimedia can convert it loselessly to any number of pile of page image formats without much problem.

...

Pdf is IMHO too complex and can't be considered an open format.

It's got an ISO standard and royalty-free patent licensing. An open format doesn't have to be a simple or good one; it just has to have an agreed-upon standard without licensing problems.

-- Kie ekzistas vivo, ekzistas espero.

Alex Brollo

8:22 a.m.

I don't understand fully your statement "Right now, I'm going to convert them to DjVu and upload them, *without any text information*.". Don't you feel any need of an excellent OCR layer when proofreading it into wikisource? Do you feel fully satisfied by mediawiki OCR of images? Unluckily, I feel mediawiki OCR very uncomfortable, dealing with not-English books, and I don't know how to get xml data about mapping of words into page image. For sure, if mediawiki 1. could serve the best OCR possible of images with no text layer, after self-recognition of languages of text, 2. would encourage to upload images at best possible quality, 3. could optionally serve hOCR or xml of mapped text layer, there would be no need of thirdy-parts good OCR layer.

Il giorno ven 12 lug 2019 alle ore 06:01 David Starner prosfilaes@gmail.com ha scritto:

...

On Sat, Jul 6, 2019 at 3:24 PM Alex Brollo alex.brollo@gmail.com wrote:

...
Nevertheless consider the file structure inside archive.org, who

collects images into zip files and text into _djvu.xml files, so allowing to manage its brilliant viewer.

...
Djvu format really can be used as a compact images+xml container, but it

seems an obsolete file format, as recent discontinuation of output by archive.org suggests. Pdf is IMHO too complex and can't be considered an open format.

Let's look at one of the files I'm going to upload. https://archive.org/details/Weird_Tales_v02n02_1923-09 was originally uploaded as a zip file of JPEG files. If I could upload it as that, or as the zip of JP2 files, I would. Right now, I'm going to convert them to DjVu and upload them, without any text information. However, there's a lot of cases where we just have PDF files, and I don't want to force some of our more technically unskilled users to have to figure out file conversion, especially where, in the case of PDF files, there's no point; Wikimedia can convert it loselessly to any number of pile of page image formats without much problem.

...
Pdf is IMHO too complex and can't be considered an open format.

It's got an ISO standard and royalty-free patent licensing. An open format doesn't have to be a simple or good one; it just has to have an agreed-upon standard without licensing problems.

-- Kie ekzistas vivo, ekzistas espero.

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

David Starner

9:26 a.m.

On Thu, Jul 11, 2019 at 11:22 PM Alex Brollo alex.brollo@gmail.com wrote:

...

I don't understand fully your statement "Right now, I'm going to convert them to DjVu and upload them, without any text information.". Don't you feel any need of an excellent OCR layer when proofreading it into wikisource?

I reuploaded the first issue of Weird Tales in DjVu because the PDF was significantly fuzzier than the DjVu, and looking at the PDF OCR, it's slightly better than what I can get from the interface. Given the choice between better images and better OCR, I go with the first one.

...

Do you feel fully satisfied by mediawiki OCR of images?

I can't even get the MediaWiki OCR to work. I use the Google OCR gadget.

...

I don't know how to get xml data about mapping of words into page image.

It's a pretty distant concern for me, somewhat tangential to producing transcriptions of the works.

-- Kie ekzistas vivo, ekzistas espero.

Alex Brollo

1 p.m.

Thank you for your mention of Google OCR gadget, I didn't know it; I'll test it for sure, even inf I'm far from happy to became dependent from Google services.

Alex

Il giorno ven 12 lug 2019 alle ore 09:26 David Starner prosfilaes@gmail.com ha scritto:

...

On Thu, Jul 11, 2019 at 11:22 PM Alex Brollo alex.brollo@gmail.com wrote:

...
I don't understand fully your statement "Right now, I'm going to convert

them to DjVu and upload them, without any text information.". Don't you feel any need of an excellent OCR layer when proofreading it into wikisource?

I reuploaded the first issue of Weird Tales in DjVu because the PDF was significantly fuzzier than the DjVu, and looking at the PDF OCR, it's slightly better than what I can get from the interface. Given the choice between better images and better OCR, I go with the first one.

...
Do you feel fully satisfied by mediawiki OCR of images?

I can't even get the MediaWiki OCR to work. I use the Google OCR gadget.

...
I don't know how to get xml data about mapping of words into page image.

It's a pretty distant concern for me, somewhat tangential to producing transcriptions of the works.

-- Kie ekzistas vivo, ekzistas espero.

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

1989

Age (days ago)

1995

Last active (days ago)

wikisource-l@lists.wikimedia.org

6 comments

2 participants

tags (0)

participants (2)

Alex Brollo
David Starner