Thanks Lars for launching this very important conversation! I think it
makes us think of an important topic, that is: what is Wikisource (and
its software) actually for? See also some old iterations:
https://meta.wikimedia.org/wiki/Role_of_Wikisource
I think that some limitations of Wikisource are a natural consequence of
its flexibility, which in turn is probably necessary to allow an
unlimited number of people to work on the same works. Individual wikis,
on Wikisource or elsewhere, can be more opinionated about certain
things, enforce a single standard way of working and therefore focus on
making that one very easy/fast (while other ways will become harder or
even impossible).
For instance, the various attempts at a "book manager" extension to
solve bug T17071 (MediaWiki knows about single pages but not about books
as such, except via Wikidata hacks) have failed in part because there is
too much variance out there and no easy or good way to impose some
consistency (let alone conduct large data migrations such as "move all
metadata to Wikidata).
https://www.mediawiki.org/wiki/Book_management
Similarly, things like automated OCR and automated page creation (and
maybe even image fixing à la ScanTailor and unpaper) could be offloaded
to a custom infrastructure as Internet Archive has recently built, and
instead of gadgets and bots you could have server-side processes
handling everything in the same place, if you have less variance in the
input and you don't have to worry about stepping on some other users' toes.
https://blog.archive.org/2020/11/23/foss-wins-again-free-and-open-source-co…
My strong preference would be for a project like Runeberg to try and use
MediaWiki+ProofreadPage and maybe some custom extensions. I think it
would be a chance to move our software stack to the next level, making
it more reusable for third parties so that others can possibly join its
development in the future. I'd expect such a software migration to
easily get a Wikimedia Foundation grant, given some past examples.
However, there is some other software being built. Projects based on TEI
or METS/ALTO tend to look very impressive on paper but not very
effective in practice, perhaps because they tend to strive to academic
usage of very few works. The Australian or New Zealand project for the
collaborative transcription of newspapers was very effective, but I
can't recall its name and they never made software available. The
Smithsonian is working on something custom, but it's just a set of
plugins on top of Drupal so it won't necessarily be cleaner than our own
MediaWiki stack; they said at Wikimania 2019 that the software would be
shared.
https://transcription.si.edu/
https://en.wikisource.org/?oldid=9543219#Smithsonian_Transcription_Center
Federico
Lars Aronsson, 26/12/20 20:23:
In 2005, at the first Wikimania in Frankfurt,
Germany,
Magnus Manske asked me if I could open up my Scandinavian
book scanning website Project Runeberg to German and
other languages, or release the software as open source.
I refused, as my software is just a rapid prototype that
would need to be rewritten from scratch anyway. But I
said that Wikisource could be used for this purpose. At
the time, Wikisource was only a wiki for e-text. As a
proof of concept, I put up "Meyers Blitz-Lexikon" as
the first book with scanned page images in Wikisource,
https://de.wikisource.org/wiki/Seite:LA2-Blitz-0005.jpg
and soon after the "New Student's Reference Work",
https://en.wikisource.org/wiki/Page:LA2-NSRW-1-0013.jpg
This was the basic inspiration for the "Proofread Page"
extension, now used in Wikisource.
In 2010-2011 I tried to use Wikisource, but I thought
this extension was too hard to work with. From scanner
to finished presentation, Wikisource was so much slower
to work with than my own system. By primary gripes are:
It is too hard to upload PDF files to Commons, it's too
hard to create the Index page, each page is not created
immediately (making the raw OCR text searchable), and
pages hidden in the Page: namespace are not always
indexed by search engines. Unfortunately, the system
hasn't improved much in the last decade.
(My criticism of my own website's system is a lot
harsher, but hits different targets.)
There is also a difference in how we view copyright,
as my own website can cut corners and scan some books
that are "most likely" out of copyright, which is
something Wikimedia's user communities never accept.
In 2012, I thought the time had finally come to rewrite
my software, but I failed to organize a project around
this, and instead I continued to use the existing system,
just adding volume. Indeed, Project Runeberg has grown
from 0.75 million book pages in 2012 to 3.1 million
pages today.
Now in 2020, I'm finally tired of my existing system's
limitations. What should I do? It's not 2005 or 2012
anymore. What has changed in that time?
I can't move everything over to Wikisource, because of
the copyright differences.
Should I start to use Mediawiki + ProofreadPage and
convert my collection to that format?
Should I develop my own modification of Mediawiki?
Is that a stable ground to work from?
It seems to me that PHP, MariaDB and the architecture
of Mediawiki with extensions has now been the same for
a long time. Will this last for the next 20 years?
Or is there today some other existing systems that
solve the same problem, that weren't available in 2005?
(And that Wikisource would have picked up, if it were
started today, instead of developing its own extension.)