In 2005, at the first Wikimania in Frankfurt, Germany, Magnus Manske asked me if I could open up my Scandinavian book scanning website Project Runeberg to German and other languages, or release the software as open source.
I refused, as my software is just a rapid prototype that would need to be rewritten from scratch anyway. But I said that Wikisource could be used for this purpose. At the time, Wikisource was only a wiki for e-text. As a proof of concept, I put up "Meyers Blitz-Lexikon" as the first book with scanned page images in Wikisource, https://de.wikisource.org/wiki/Seite:LA2-Blitz-0005.jpg and soon after the "New Student's Reference Work", https://en.wikisource.org/wiki/Page:LA2-NSRW-1-0013.jpg
This was the basic inspiration for the "Proofread Page" extension, now used in Wikisource.
In 2010-2011 I tried to use Wikisource, but I thought this extension was too hard to work with. From scanner to finished presentation, Wikisource was so much slower to work with than my own system. By primary gripes are: It is too hard to upload PDF files to Commons, it's too hard to create the Index page, each page is not created immediately (making the raw OCR text searchable), and pages hidden in the Page: namespace are not always indexed by search engines. Unfortunately, the system hasn't improved much in the last decade.
(My criticism of my own website's system is a lot harsher, but hits different targets.)
There is also a difference in how we view copyright, as my own website can cut corners and scan some books that are "most likely" out of copyright, which is something Wikimedia's user communities never accept.
In 2012, I thought the time had finally come to rewrite my software, but I failed to organize a project around this, and instead I continued to use the existing system, just adding volume. Indeed, Project Runeberg has grown from 0.75 million book pages in 2012 to 3.1 million pages today.
Now in 2020, I'm finally tired of my existing system's limitations. What should I do? It's not 2005 or 2012 anymore. What has changed in that time?
I can't move everything over to Wikisource, because of the copyright differences.
Should I start to use Mediawiki + ProofreadPage and convert my collection to that format?
Should I develop my own modification of Mediawiki? Is that a stable ground to work from?
It seems to me that PHP, MariaDB and the architecture of Mediawiki with extensions has now been the same for a long time. Will this last for the next 20 years?
Or is there today some other existing systems that solve the same problem, that weren't available in 2005? (And that Wikisource would have picked up, if it were started today, instead of developing its own extension.)
Thanks Lars for launching this very important conversation! I think it makes us think of an important topic, that is: what is Wikisource (and its software) actually for? See also some old iterations: https://meta.wikimedia.org/wiki/Role_of_Wikisource
I think that some limitations of Wikisource are a natural consequence of its flexibility, which in turn is probably necessary to allow an unlimited number of people to work on the same works. Individual wikis, on Wikisource or elsewhere, can be more opinionated about certain things, enforce a single standard way of working and therefore focus on making that one very easy/fast (while other ways will become harder or even impossible).
For instance, the various attempts at a "book manager" extension to solve bug T17071 (MediaWiki knows about single pages but not about books as such, except via Wikidata hacks) have failed in part because there is too much variance out there and no easy or good way to impose some consistency (let alone conduct large data migrations such as "move all metadata to Wikidata). https://www.mediawiki.org/wiki/Book_management
Similarly, things like automated OCR and automated page creation (and maybe even image fixing à la ScanTailor and unpaper) could be offloaded to a custom infrastructure as Internet Archive has recently built, and instead of gadgets and bots you could have server-side processes handling everything in the same place, if you have less variance in the input and you don't have to worry about stepping on some other users' toes. https://blog.archive.org/2020/11/23/foss-wins-again-free-and-open-source-com...
My strong preference would be for a project like Runeberg to try and use MediaWiki+ProofreadPage and maybe some custom extensions. I think it would be a chance to move our software stack to the next level, making it more reusable for third parties so that others can possibly join its development in the future. I'd expect such a software migration to easily get a Wikimedia Foundation grant, given some past examples.
However, there is some other software being built. Projects based on TEI or METS/ALTO tend to look very impressive on paper but not very effective in practice, perhaps because they tend to strive to academic usage of very few works. The Australian or New Zealand project for the collaborative transcription of newspapers was very effective, but I can't recall its name and they never made software available. The Smithsonian is working on something custom, but it's just a set of plugins on top of Drupal so it won't necessarily be cleaner than our own MediaWiki stack; they said at Wikimania 2019 that the software would be shared. https://transcription.si.edu/ https://en.wikisource.org/?oldid=9543219#Smithsonian_Transcription_Center
Federico
Lars Aronsson, 26/12/20 20:23:
In 2005, at the first Wikimania in Frankfurt, Germany, Magnus Manske asked me if I could open up my Scandinavian book scanning website Project Runeberg to German and other languages, or release the software as open source.
I refused, as my software is just a rapid prototype that would need to be rewritten from scratch anyway. But I said that Wikisource could be used for this purpose. At the time, Wikisource was only a wiki for e-text. As a proof of concept, I put up "Meyers Blitz-Lexikon" as the first book with scanned page images in Wikisource, https://de.wikisource.org/wiki/Seite:LA2-Blitz-0005.jpg and soon after the "New Student's Reference Work", https://en.wikisource.org/wiki/Page:LA2-NSRW-1-0013.jpg
This was the basic inspiration for the "Proofread Page" extension, now used in Wikisource.
In 2010-2011 I tried to use Wikisource, but I thought this extension was too hard to work with. From scanner to finished presentation, Wikisource was so much slower to work with than my own system. By primary gripes are: It is too hard to upload PDF files to Commons, it's too hard to create the Index page, each page is not created immediately (making the raw OCR text searchable), and pages hidden in the Page: namespace are not always indexed by search engines. Unfortunately, the system hasn't improved much in the last decade.
(My criticism of my own website's system is a lot harsher, but hits different targets.)
There is also a difference in how we view copyright, as my own website can cut corners and scan some books that are "most likely" out of copyright, which is something Wikimedia's user communities never accept.
In 2012, I thought the time had finally come to rewrite my software, but I failed to organize a project around this, and instead I continued to use the existing system, just adding volume. Indeed, Project Runeberg has grown from 0.75 million book pages in 2012 to 3.1 million pages today.
Now in 2020, I'm finally tired of my existing system's limitations. What should I do? It's not 2005 or 2012 anymore. What has changed in that time?
I can't move everything over to Wikisource, because of the copyright differences.
Should I start to use Mediawiki + ProofreadPage and convert my collection to that format?
Should I develop my own modification of Mediawiki? Is that a stable ground to work from?
It seems to me that PHP, MariaDB and the architecture of Mediawiki with extensions has now been the same for a long time. Will this last for the next 20 years?
Or is there today some other existing systems that solve the same problem, that weren't available in 2005? (And that Wikisource would have picked up, if it were started today, instead of developing its own extension.)
wikitech-l@lists.wikimedia.org