Please keep up this good discussion :-)
We have the Wikisource contest on it.source right now,
so this mail is not going to be as long and detailed as I hoped.
I agree with Vigneron that the Survey report is a good start:
having written it myself, I'm well aware that it's not perfect, and that
questions were not bulletproof, as well the methodology.
Nonetheless, we tried hard to make it and many results are as good and
I personally agree that a VE integration with the Proofread extension would
be much needed:
if you think about it, Wikisource is the right place for the VE.
We could simplify enormously the life of new proofreaders, and formatting
on Wikisource is ten times more difficult than in Wikipedia.
I'm sure it's one of the best thing to do right now.
At the same time, I agree with Lars (who always has great insights)
that we still need to do the big leap in digital libraries.
For me, one of the thing Wikisource offers that nobody does is
and connections and integration with other projects as Wikidata (hopefully)
I agree with him that algorithmic learning of Wikisource is an amazing
idea: just think about having a Tesseract instance for every Wikisource,
and the tesseract learns from every page the community proofreads... In few
years, we could even think about tell our Tesseract to distinguish between
XII century Italian vs XIX century... We could have amazing open source
OCRs to give to the world.
Another greataccomplishment could be *giving back proofread OCR* to GLAMs:
think about libraries (or Internet Archive!) give us ancient texts, and us
giving them back a perfect djvu or PDF with mapped text inside...
I'm sure we could have many GLAMs coming to us then :-)
We cannot give them back almost anything, right now, a part from our HTML
On Sun, Nov 23, 2014 at 6:16 PM, Lars Aronsson <lars(a)aronsson.se> wrote:
On 11/23/2014 02:55 AM, Wiki Billinghurst wrote:
What do we see as the next components for
What are our major hurdles for system development?
If we were offered development help where do people think that we
should be making use of that help? Is it incremental fixes,
transactional changes, or are we wanting transformational changes,
completely new features, and new opportunities?
Ten years ago, Wikipedia was already a given success, and
we started to branch out into projects like Wikisource,
Wikinews and what not. That was also when Google Book
Search started, and when the Internet Archive got its
current practices for book scanning (with the "Scribe"
scanning stations) in place. Ten years earlier, in the mid
90s, the first large-scale book scanning projects appeared.
In the two decades 1990-2010, several books were published
on the future of digital libraries. But what has happened
in the last decade? What is new, really? Has anything
changed in Google Book Search or the Internet Archive
in the five years 2010-2014? Yes, more books have been
digitized, but are they presented or used differently?
I think a lot more can be done, e.g. algorithmic improvement
of OCR engines. Wikisource hasn't looked into that, neither
has the Internet Archive, and nobody knows much about
what Google does internally. This isn't necessarily "wiki",
so it's not clear that it's a task for WMF and its projects.
Another thing could be "gamification" of proofreading or
mark-up / categorization / analysis of scanned books.
As for new kinds of content, the digitization of entire
newspapers is still a new area, where the Australian
national library was a pioneer some years ago, but what
has happened since then? Potentially, it could become
a cross-over between Wikisource and Wikinews, where
each event can be found on the same day in many
different newspapers. How to link them together?
The problem: If we get scanned images + OCR text
of 10 different newspapers, 10 years, 10 pages each
day, that is 365 × 10 × 10 × 10 = 365,000 large pages
to proofread, before we can do any serious analysis.
How do we proofread so many pages in any reasonable
time? We don't have enough volunteers for that.
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se
Wikisource-l mailing list