[Wikipedia-l] Wikisource

Sun Oct 19 16:37:42 UTC 2008

On wikipedia-l Florence Devouard wrote:

> During that event, I mentionned that the French chapter has 
> several ongoing discussions with various museums to set up 
> content partnerships.

Wikisource is really a much larger project than Wikipedia.  
Consider any public library: The encyclopedia shelf or quick 
reference section (Wikipedia) is less than one percent of the 
whole library (Wikisource).  After seven years of writing 
Wikipedia, we are now getting useful results in many languages.  
Wikisource might take 70 years.

What we can expect during 2009 is some small step forward on this 
longer path.  Taking a single step might sound easy, but it's hard 
enough to know which direction is forward.

If you can achieve real, practical, pragmatic cooperations that 
actually result in more free content, even if it is not very much, 
that is probably the best step forward.  But you must be prepared 
that infighting and prestige among public institutions can be 
tough, especially when it comes to competing for funding.

> In Europe, at least in some countries, we meet several problems
> * many scholars have a rather bad image of Wikipedia (because 
>   written by amateurs, anonymous members, plagued by vandals 
>   etc...)

There is a clear risk that this bad image is enforced.  Our 
message that "anybody can contribute" is hard to combine with the 
prestigeous thinking among the institutions where you seek 
cooperation.

----

I'd like to recommend an article in the October 2008 issues of the 
open access journal "First Monday", "Mass book digitization: The 
deeper story of Google Books and the Open Content Alliance" by 
Kalev Leetaru, 
http://www.uic.edu/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2101/2037

This article is just one in a ton of literature on how to scan (or 
microfilm) books, that have appeared in the last 20 years.  But it 
is interesting because it evaluates two large-scale projects of 
the last few years, and compares them to each other.  Even though 
"digital libraries" is a new science, it is already full of 
established truths.  Perhaps this is due to the high involvement 
of public institutions.  One such truth is that image compression 
(with JPEG artifacts) must be avoided at all cost.

Both Google Books and the Open Content Alliance (Internet Archive) 
break this rule, by using consumer-grade digital cameras and JPEG 
compression, and should thus be considered a waste of time, 
according to conventional wisdom (or "best current practices").  
Still, nobody can avoid being impressed with their results, and so 
the scientific world needs to revise its understanding of the 
current state of the art.  The author of this article goes to 
great lengths (in the "Discussion" section) to explain that what 
these projects do is "access digitization", which is described as 
something completely different than traditional book scanning:

  "Before one can compare the two projects, it is important to 
   first realize that both projects are really only access 
   digitization projects, despite the common assertion of OCA 
   captures as preservation digitization. Neither initiative uses 
   an imaging pipeline or capture environment suitable for true 
   preservation scanning. The OCA project outputs 
   variable–resolution JPEG2000 files built from lossy 
   camera–generated JPEG files. A consumer area array digital 
   camera is used to produce images ..."

Needless to say, neither Project Gutenberg nor Wikisource are 
mentioned in this article.  Their goals are just too different 
(what? free content?), their achievements not impressive enough.  
They are not potential future employers of "digital library" 
scholars.  If you help them or cooperate with them, you will only 
help mankind in an altruistic fashion (what fools!), you will not 
help your own professional or academic career.

In the article, the Open Content Alliance already plays the role 
of the fools.  They have only (!) digitized 100,000 books, while 
Google Books has millions.  They do not provide the same search 
capability.  And so it goes on. The next time the Internet Archive 
(OCA) applies for funding or tries to establish cooperations with 
more institutions, such arguments might be used against them.

----

What Wikisource really needs to do, is to provide an explanation 
of what it does, and how this goes beyond Google Books' "access 
digitization".  In Europe, this must be set in the perspective of 
ongoing French, German and EU initiatives (Gallica, Theseus, 
Quaero, Europeana, ...).  When one of those projects applies for 
funding, it will need to show that it is successful in attracting 
cooperation partners and that it is a leader among similar 
projects.  We should be prepared that they take any opportunity to 
define Wikisource as a loser, amateurish, clueless project.  This 
is not because they are evil, only because they do what they can 
to get the funding they need.

Why should museum X or library Y or archive Z cooperate with 
Wikisource, when it risks being associated with such descriptions 
of failure?  The alternative for that institution might be to 
cooperate with the successful Google or Gallica.  So why is 
Wikisource superior? This is what we need to explain.

> * develop arguments for museums etc...

Exactly.

-- 
  Lars Aronsson (lars at aronsson.se)
  Aronsson Datateknik - http://aronsson.se