New subject: [Wikisource-l] Systems for proofreading scanned books

26 Dec 2020


      In 2005, at the first Wikimania in Frankfurt, Germany,
Magnus Manske asked me if I could open up my Scandinavian
book scanning website Project Runeberg to German and
other languages, or release the software as open source.
I refused, as my software is just a rapid prototype that
would need to be rewritten from scratch anyway. But I
said that Wikisource could be used for this purpose. At
the time, Wikisource was only a wiki for e-text. As a
proof of concept, I put up "Meyers Blitz-Lexikon" as
the first book with scanned page images in Wikisource,
https://de.wikisource.org/wiki/Seite:LA2-Blitz-0005.jpg
and soon after the "New Student's Reference Work",
https://en.wikisource.org/wiki/Page:LA2-NSRW-1-0013.jpg
This was the basic inspiration for the "Proofread Page"
extension, now used in Wikisource.
In 2010-2011 I tried to use Wikisource, but I thought
this extension was too hard to work with. From scanner
to finished presentation, Wikisource was so much slower
to work with than my own system. By primary gripes are:
It is too hard to upload PDF files to Commons, it's too
hard to create the Index page, each page is not created
immediately (making the raw OCR text searchable), and
pages hidden in the Page: namespace are not always
indexed by search engines. Unfortunately, the system
hasn't improved much in the last decade.
(My criticism of my own website's system is a lot
harsher, but hits different targets.)
There is also a difference in how we view copyright,
as my own website can cut corners and scan some books
that are "most likely" out of copyright, which is
something Wikimedia's user communities never accept.
In 2012, I thought the time had finally come to rewrite
my software, but I failed to organize a project around
this, and instead I continued to use the existing system,
just adding volume. Indeed, Project Runeberg has grown
from 0.75 million book pages in 2012 to 3.1 million
pages today.
Now in 2020, I'm finally tired of my existing system's
limitations. What should I do? It's not 2005 or 2012
anymore. What has changed in that time?
I can't move everything over to Wikisource, because of
the copyright differences.
Should I start to use Mediawiki + ProofreadPage and
convert my collection to that format?
Should I develop my own modification of Mediawiki?
Is that a stable ground to work from?
It seems to me that PHP, MariaDB and the architecture
of Mediawiki with extensions has now been the same for
a long time. Will this last for the next 20 years?
Or is there today some other existing systems that
solve the same problem, that weren't available in 2005?
(And that Wikisource would have picked up, if it were
started today, instead of developing its own extension.)
-- 
   Lars Aronsson (lars@aronsson.se)
   Project Runeberg - free Nordic literature - http://runeberg.org/