Re: [Wikisource-l] Systems for proofreading scanned books - Wikisource-l

27 Dec 2020


      "There is also a difference in how we view copyright,
as my own website can cut corners and scan some books
that are "most likely" out of copyright, which is
something Wikimedia's user communities never accept."
Some of the community accept this. Polish Wikisource project uploaded
translation of one's Montgomery book, as "pseudonymous" work without any
proofs that it is pseudonym (even if they are, they are against COM:PRP).
It's still on Commons and AFAIK rejected to delete by admins or not decided
yet.
Mateusz Malinowski
niedz., 27 gru 2020, 13:02 użytkownik <
wikisource-l-request@lists.wikimedia.org> napisał:
...
Send Wikisource-l mailing list submissions to
        wikisource-l@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit
        https://lists.wikimedia.org/mailman/listinfo/wikisource-l
or, via email, send a message with subject or body 'help' to
        wikisource-l-request@lists.wikimedia.org
You can reach the person managing the list at
        wikisource-l-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Wikisource-l digest..."
Today's Topics:

Systems for proofreading scanned books (Lars Aronsson)
Re: Systems for proofreading scanned books (J Hayes)


Message: 1
Date: Sat, 26 Dec 2020 19:23:02 +0100
From: Lars Aronsson lars@aronsson.se
To: Wikimedia developers wikitech-l@lists.wikimedia.org
Cc: Wikisource wikisource-l@lists.wikimedia.org
Subject: [Wikisource-l] Systems for proofreading scanned books
Message-ID: e04dc83b-0da2-0c89-fc39-c5f28e0b5443@aronsson.se
Content-Type: text/plain; charset=utf-8; format=flowed
In 2005, at the first Wikimania in Frankfurt, Germany,
Magnus Manske asked me if I could open up my Scandinavian
book scanning website Project Runeberg to German and
other languages, or release the software as open source.
I refused, as my software is just a rapid prototype that
would need to be rewritten from scratch anyway. But I
said that Wikisource could be used for this purpose. At
the time, Wikisource was only a wiki for e-text. As a
proof of concept, I put up "Meyers Blitz-Lexikon" as
the first book with scanned page images in Wikisource,
https://de.wikisource.org/wiki/Seite:LA2-Blitz-0005.jpg
and soon after the "New Student's Reference Work",
https://en.wikisource.org/wiki/Page:LA2-NSRW-1-0013.jpg
This was the basic inspiration for the "Proofread Page"
extension, now used in Wikisource.
In 2010-2011 I tried to use Wikisource, but I thought
this extension was too hard to work with. From scanner
to finished presentation, Wikisource was so much slower
to work with than my own system. By primary gripes are:
It is too hard to upload PDF files to Commons, it's too
hard to create the Index page, each page is not created
immediately (making the raw OCR text searchable), and
pages hidden in the Page: namespace are not always
indexed by search engines. Unfortunately, the system
hasn't improved much in the last decade.
(My criticism of my own website's system is a lot
harsher, but hits different targets.)
There is also a difference in how we view copyright,
as my own website can cut corners and scan some books
that are "most likely" out of copyright, which is
something Wikimedia's user communities never accept.
In 2012, I thought the time had finally come to rewrite
my software, but I failed to organize a project around
this, and instead I continued to use the existing system,
just adding volume. Indeed, Project Runeberg has grown
from 0.75 million book pages in 2012 to 3.1 million
pages today.
Now in 2020, I'm finally tired of my existing system's
limitations. What should I do? It's not 2005 or 2012
anymore. What has changed in that time?
I can't move everything over to Wikisource, because of
the copyright differences.
Should I start to use Mediawiki + ProofreadPage and
convert my collection to that format?
Should I develop my own modification of Mediawiki?
Is that a stable ground to work from?
It seems to me that PHP, MariaDB and the architecture
of Mediawiki with extensions has now been the same for
a long time. Will this last for the next 20 years?
Or is there today some other existing systems that
solve the same problem, that weren't available in 2005?
(And that Wikisource would have picked up, if it were
started today, instead of developing its own extension.)
--
   Lars Aronsson (lars@aronsson.se)
   Project Runeberg - free Nordic literature - http://runeberg.org/

Message: 2
Date: Sat, 26 Dec 2020 18:20:06 -0500
From: J Hayes slowking4@gmail.com
To: "discussion list for Wikisource, the free library"
        wikisource-l@lists.wikimedia.org
Subject: Re: [Wikisource-l] Systems for proofreading scanned books
Message-ID:
        <CAN38RzKojj9K=
nZ50Lbbvv5ZUND9WcA5kCRGeh++33ohfHB5Gg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
My suggestions:
Simplified UX to upload works is on the wishlist
But a tool that led to user to interact on multiple projects to produce a
“rough draft” work from a scan would be a great step forward.
Copyright might be eased for a local copy at wikisource, not on commons.
But you would need some community consensus. If you were bringing tools,
they might work with you, you should reach out to them. You could also
transfer over the easy copyright works to wikisource, and retain the loose
ones at your site. (The value to using wikisource is the increased
visibility being integrated in Wikipedia, and community building potential)
So I would brainstorm some goals, and begin a conversation / partnership
with your wikisource language community toward an action plan.
If I can be of help let me know.
Cheers
Jim hayes
On Sat, Dec 26, 2020 at 1:23 PM Lars Aronsson lars@aronsson.se wrote:
...
In 2005, at the first Wikimania in Frankfurt, Germany,
Magnus Manske asked me if I could open up my Scandinavian
book scanning website Project Runeberg to German and
other languages, or release the software as open source.
I refused, as my software is just a rapid prototype that
would need to be rewritten from scratch anyway. But I
said that Wikisource could be used for this purpose. At
the time, Wikisource was only a wiki for e-text. As a
proof of concept, I put up "Meyers Blitz-Lexikon" as
the first book with scanned page images in Wikisource,
https://de.wikisource.org/wiki/Seite:LA2-Blitz-0005.jpg
and soon after the "New Student's Reference Work",
https://en.wikisource.org/wiki/Page:LA2-NSRW-1-0013.jpg
This was the basic inspiration for the "Proofread Page"
extension, now used in Wikisource.
In 2010-2011 I tried to use Wikisource, but I thought
this extension was too hard to work with. From scanner
to finished presentation, Wikisource was so much slower
to work with than my own system. By primary gripes are:
It is too hard to upload PDF files to Commons, it's too
hard to create the Index page, each page is not created
immediately (making the raw OCR text searchable), and
pages hidden in the Page: namespace are not always
indexed by search engines. Unfortunately, the system
hasn't improved much in the last decade.
(My criticism of my own website's system is a lot
harsher, but hits different targets.)
There is also a difference in how we view copyright,
as my own website can cut corners and scan some books
that are "most likely" out of copyright, which is
something Wikimedia's user communities never accept.
In 2012, I thought the time had finally come to rewrite
my software, but I failed to organize a project around
this, and instead I continued to use the existing system,
just adding volume. Indeed, Project Runeberg has grown
from 0.75 million book pages in 2012 to 3.1 million
pages today.
Now in 2020, I'm finally tired of my existing system's
limitations. What should I do? It's not 2005 or 2012
anymore. What has changed in that time?
I can't move everything over to Wikisource, because of
the copyright differences.
Should I start to use Mediawiki + ProofreadPage and
convert my collection to that format?
Should I develop my own modification of Mediawiki?
Is that a stable ground to work from?
It seems to me that PHP, MariaDB and the architecture
of Mediawiki with extensions has now been the same for
a long time. Will this last for the next 20 years?
Or is there today some other existing systems that
solve the same problem, that weren't available in 2005?
(And that Wikisource would have picked up, if it were
started today, instead of developing its own extension.)
--
   Lars Aronsson (lars@aronsson.se)
   Project Runeberg - free Nordic literature - http://runeberg.org/

Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l