[Wikisource-l] Partnership with the French National Library

Lars Aronsson lars at aronsson.se
Wed Jul 21 04:01:18 UTC 2010


On 07/15/2010 06:47 PM, Andrea Zanni wrote:
> I would really like to ask to all what they think about this (big) issue.

Wikisource is a project that could have a great importance
in the future, if it succeeds to grow. Scanned books should
be useful references in Wikipedia. But it's far too early to
say if it will be successful, because it is still too small.

In proofreading from scanned images, it's still very small and
growing fast, because it really only started very recently.
Only 3 languages have proofread more than 50,000 pages
and 8 more languages have between 1000 and 10,000 pages.
http://wikisource.org/wiki/Wikisource:ProofreadPage_Statistics

While 50,000 pages is big for Wikipedia, it only means 250
books of 200 pages each. Such a book is 1 cm thick (0.4 inches),
so the entire bookshelf of 50,000 pages fits in 2.50 metres
of shelving, less than a normal bookcase. Swedish Wikisource
now has 5,500 proofread pages, but these belong to 237
different volumes, many being small pamphlets or individual
issues of a newspaper. Only 15 volumes are fully proofread
books with more than 100 pages. Many of the volumes are
books that someone uploaded and started to proofread, but
didn't quite finish. Many were uploaded because they were
available (scanned by a library), not because they would be
really useful as source documents. Of our 15 fully proofread
books, I can name 5 that should be really useful as sources.

The Wikisource communities are very small, counting 12
contributors (making at least 5 edits per month) in Italian,
25 in Polish, 60 in German, 100 in French and 100 in English.
For most languages, every single new contributor can
make a huge difference. This might attract some users who
have plenty of time and want to make a big difference,
such as those who write hundreds of Wikipedia articles.
But as time goes by and the projects scale up, you will
find very few such contributors.

Even if we don't pay for their work, volunteers are a
finite resource and we shouldn't waste their time.
It's easy to waste time with manually proofreading text
that has very poor OCR quality. Running a new OCR
might save hours of work. It's even easier to waste
time by proofreading a book that nobody finds useful.

That's where I think we should start: Which books are
really useful? How do we determine that? Unfortunately,
link search in Wikipedia does not count links to
Wikisource, so it's hard to get good statistics.

The first two books that I put on Wikisource in the fall
of 2005 were small encyclopedias in German and English.
This was an experiment and it worked well as an
experiment. But these small encyclopedias were next to
useless, because they contained so much less information
than was already in Wikipedia. These 1000 bytes on Elbe,
http://en.wikisource.org/wiki/The_New_Student's_Reference_Work/Elbe
were useless because Wikipedia in October 2005 already
had 5 times as much (today, that is 20 times as much).

For Arabic and other languages where Wikipedia
is still rather small, finding and scanning such a
small 5 volume encyclopedia could be a great help,
but for the languages where Wikipedia is bigger
(and this is true for all languages where Wikisource
is now active, except maybe Armenian), we need to
look for more specialized reference works to be
really useful as sources.

I'm making an experiment now with proofreading a
newspaper. Not just articles, but entire issues.
I have done three full weeks from January 1836.
Fortunately, each daily issue is only 4 pages in
3 columns, or 75 kilobytes in total, e.g.
http://sv.wikisource.org/wiki/Post-_och_Inrikes_Tidningar_1836-01-05

But it still takes a lot of work. On my own, I'm not
able to proofread one issue per day, so I'm already
lagging behind. Newspapers are useful as sources.
Wikipedia often references articles in current
newspapers, and it would be great to have complete
year runs going back in time. Still, I doubt that we
can find enough proofreading volunteers to cover any
substantial timespan.


-- 
   Lars Aronsson (lars at aronsson.se)
   Aronsson Datateknik - http://aronsson.se





More information about the Wikisource-l mailing list