> De: "John Vandenberg" <jayvdb(a)gmail.com>
> A: "discussion list for Wikisource, the free library" <wikisource-l(a)lists.wikimedia.org>
> Objet: Re: [Wikisource-l] Changing the Wikisource main page
> Date: Sun, 14 Sep 2008 06:03:36 +1000
> A Chinese "word" has more meaning than a Spanish "word". I dont have
> the numbers, but the word "word" is not the same in all languages.
> This makes words a very complex statistic.
> Wikisource-l mailing list
I may have found a very simple solution : if we agree that a chinese sign is a word as we understand "word", than we have to found how many sign there are. I made a test, and found that a chinese sign is 3 octets. The very same statistics tells us that the average number of octets of an article on the chinese wikisource is 1957. So, there are 1957/3 = 652.3 words. The statistics counts (on may 31, 2008) 29084 articles for the chinese wikisource, and 652.3*29084 gives 18.9M words for total.
The only question remaining is : why the statistics page presents 29.3M as the number of words for the chinese wikisource ? Is that the number of "groups of letters" ?
Anyway, if we accept the figures, we would have : 1. English : 211M words - 2. French : 125M - 3. Spanish : 41.8M - 4. Russian : 22.2M - 5. Chinese : 18.9M - 6. Polish : 18.2M - 7. Portuguese : 15.5M - 8. Deutsch : 14.4M - 9. Italian : 12.0M - 10. Arabic : 10.6M.
Sent to you by pfctdayelise via Google Reader: PALINET launches
digitization initiative via Open Access News by Gavin Baker on 30/10/08
PALINET, a regional library network in the U.S., launched a
digitization project on October 21, 2008. See the press release or this
story in Library Journal:
... [T]he PALINET regional library network recently announced its Mass
Digitization Collaborative, supported in part by a grant from the
Alfred P. Sloan Foundation. Through the project, PALINET member
libraries will be able to scan and digitize selected texts as the
result of an ongoing partnership with the Internet Archive and its
regional network of digitization centers.
Participants will receive high-quality versions of the digital
editions, which will also be made freely available through Archive.org.
The goal, according to Catherine C. Wilt, PALINET’s executive director,
is to make available more than “20 million pages of text from PALINET
members,” equivalent to approximately 60,000 books.
Only works free of copyright restrictions and with existing metadata
are eligible for the project. Moreover, member institutions are
“strongly encouraged” to select unique texts and items of local and
regional significance. ...
Things you can do from here:
- Subscribe to Open Access News using Google Reader
- Get started using Google Reader to easily keep up with all your
---------- Forwarded message ----------
From: River Tarnell <river(a)loreley.flyingparchment.org.uk>
Date: Tue, Oct 28, 2008 at 9:18 AM
Subject: [Wikitech-l] new extension for embedded music scores
-----BEGIN PGP SIGNED MESSAGE-----
i have written a new extension to embed music scores in MediaWiki pages:
unlike the Lilypond extension, this uses a simple input language (ABC) that
much easier to validate for security. ABC is mostly used to transcribe
trad and other simple tunes, but it recently gained support for more
features, e.g. multiple staves and lyrics. this is supported in the
using the 'abcm2ps' tool.
unlike the existing ABC extension (AbcMusic), it doesn't support opening
arbitrary files as ABC input (which is a potential security issue), and has
several additional features:
- - The original ABC can be downloaded easily
- - The score can be downloaded as PDF, PostScript, MIDI or Ogg Vorbis
- - A media player can be embedded in the page to play the media file
i believe the ABC format is suitable for transcribing the majority of scores
currently on Wikimedia projects. although it can't handle all of them, it
better than the current situation. plus, as ABC is simple, and existing ABC
scores are easily available, it's easier for novice users to contribute.
i would be interested to hear peoples' thoughts on enabling this extension
-----BEGIN PGP SIGNATURE-----
-----END PGP SIGNATURE-----
Wikitech-l mailing list
Florence Devouard wrote:
> A couple of weeks ago, I went to an event organized in Paris by the
> French Government about "economics of culture".
> During that event, I mentionned that the French chapter has several
> ongoing discussions with various museums to set up content partnerships.
> Here are two examples of such potential partnerships:
> * a small museum with very old and precious documents. The museum has
> limited room for access and documents are fragile, so only a few
> visitors are allowed to look at them. The museum wants to digitize these
> docs, but has limited technical infrastructure.
> Opportunity: we host their documents on wikisource and provide them
> additional visibility through an article on Wikipedia, featuring their
> best manuscripts.
> * a large museum already has a digitization procedure for the documents,
> as well as a hosting service. However, the digitized version contains
> mistakes (errors generated in the process) and the museum simply does
> not have the human power to provide the corrections of the numerous
> documents digitized by their services. Our members can take care of this
This is probably more about archives than about museums, but the
problems for museums regarding three-dimensional objects is just as
severe, as Raoul Weiller has been keen to point out at the last two
Wikimanias. Lars makes a good point about distinguishing between
preservation and access digitization, but I think that capital-intensive
preservation strategies are well beyond our capabilities. Wikimedia
works best when it can marshal large quantities of free labour to
altruistic purposes; the kind of people whom we attract are properly
annoyed with capitalist profiteering that depends on our free labour.
It comes as no surprise that they intuitively support non-commercial
clauses in free licenses.
Your two examples present startlingly different circumstances. In the
first example it is up to the archives to provide the leadership, while
acknowledging that providing the needed manpower exclusively through
professional personnel is well beyond their limited budgets. At the
same time they are repeatedly the beneficiaries of acquisitions which
they can neither properly process or store. The recent story of the
Royal Ontario Museum rediscovering a Tyranosaurus skeleton that had been
misplaced for decades gives us pause to wonder. The type of artifacts
that concern us are much smaller, and consequently easier to misplace.
Museums need to engage in volunteer training programmes so that
volunteers can better take on more specialized and more responsible
tasks. If they believe that they will some day receive budgets adequate
to the task, they have been breathing too many fumes from evaporating
artifacts. They also need to make collections accessible to qualified
volunteers for longer hours than the regular opening hours of the museums.
The second example is more within our grasp. Proofreading is a tedious
process, and we should never deceive ourselves into believing that the
task can be handled by spell-checkers or other software based
techniques. In addition, since Wikisource likes to host whole books or
multi-volume reference works there is a tendency to upload this material
from other sources without any thought of checking the material for
accuracy. This means that material which clearly falls within the scope
of Wikisource grows at a phenomenal rate when compared to its
verification rate. Image files are only one part of the solution. They
provide the basis for the verification, but not the verification itself.
> Wikisources members know all that very well and much better than I. I
> just summarize that very quickly for reference.
> In Europe, at least in some countries, we meet several problems
> * many scholars have a rather bad image of Wikipedia (because written by
> amateurs, anonymous members, plagued by vandals etc...)
> * the other wikimedia projects have rather poor popularity and would
> benefit from more "light"
> * journalists are bored and need new information (otherwise, they focus
> on all the bad stories)
> * some projects are more difficult to advertise than others, because
> they are full competitors with other commercial projects of very good
> quality (eg, wiktionary, wikinews...)
I don't see the problem as one of publicity at all. It's a matter of
recruiting people who are satisfied doing humble tasks. Such people do
not want to participate in complex decision making processes; they are
completely confused if they need to deal with anything but the most
elementary of wiki markup; when faced with any conflict they just go
away. They are often older, and sensitive to disrespectful behaviour.
> Besides, my feeling is that contributors and in particular members from
> chapters need a project on which they can team.
That's worth considering. In theory at least chapter leadership is in a
better position to understand the priorities of national governments.
Chapters that host Wikisource sites themselves can better adapt to laws
that restrict the export of charitable donations.
> I would like to propose that next year be Wikisource year.
How can this be best co-ordinated with the Wikimania programme?
> And since the planet is very large, if this is done in large part
> through chapters, that it be an opportunity for some european chapters
> to work together.
An EU super-chapter? :-\
Appealing to nationalism could be more fruitful.
> I am not necessarily thinking of anything very complicated. Examples of
> efforts we could make together:
> * leaflets about wikisource updated and available in a large number of
> * webbuttons to advertise the project on the web;
> * each time someone gives a conference about Wikipedia, take the
> opportunity to spend a couple of minutes of Wikisource as well;
> distribute leaflets;
> * summarize our best cases on Wikisource;
> * develop stories about these best cases. Illustrate. Feature these
> stories on chapter websites;
> * develop initiatives on projects for cross project challenges (eg, best
> article with content improved in at least 3 projects);
> * chapters may write and distribute a couple of press releases about
> * chapters may propose conferences about wikisource (and speakers
> available to talk about it);
> * develop arguments for museums etc...
> Measures of success are numerous, from improvements of Wikisource
> (number of docs), number of mentions in the press, partnerships
> established with museums etc...
> What do you think ?
Who is the target audience for all this? How much of this will appeal
On wikipedia-l Florence Devouard wrote:
> During that event, I mentionned that the French chapter has
> several ongoing discussions with various museums to set up
> content partnerships.
Wikisource is really a much larger project than Wikipedia.
Consider any public library: The encyclopedia shelf or quick
reference section (Wikipedia) is less than one percent of the
whole library (Wikisource). After seven years of writing
Wikipedia, we are now getting useful results in many languages.
Wikisource might take 70 years.
What we can expect during 2009 is some small step forward on this
longer path. Taking a single step might sound easy, but it's hard
enough to know which direction is forward.
If you can achieve real, practical, pragmatic cooperations that
actually result in more free content, even if it is not very much,
that is probably the best step forward. But you must be prepared
that infighting and prestige among public institutions can be
tough, especially when it comes to competing for funding.
> In Europe, at least in some countries, we meet several problems
> * many scholars have a rather bad image of Wikipedia (because
> written by amateurs, anonymous members, plagued by vandals
There is a clear risk that this bad image is enforced. Our
message that "anybody can contribute" is hard to combine with the
prestigeous thinking among the institutions where you seek
I'd like to recommend an article in the October 2008 issues of the
open access journal "First Monday", "Mass book digitization: The
deeper story of Google Books and the Open Content Alliance" by
This article is just one in a ton of literature on how to scan (or
microfilm) books, that have appeared in the last 20 years. But it
is interesting because it evaluates two large-scale projects of
the last few years, and compares them to each other. Even though
"digital libraries" is a new science, it is already full of
established truths. Perhaps this is due to the high involvement
of public institutions. One such truth is that image compression
(with JPEG artifacts) must be avoided at all cost.
Both Google Books and the Open Content Alliance (Internet Archive)
break this rule, by using consumer-grade digital cameras and JPEG
compression, and should thus be considered a waste of time,
according to conventional wisdom (or "best current practices").
Still, nobody can avoid being impressed with their results, and so
the scientific world needs to revise its understanding of the
current state of the art. The author of this article goes to
great lengths (in the "Discussion" section) to explain that what
these projects do is "access digitization", which is described as
something completely different than traditional book scanning:
"Before one can compare the two projects, it is important to
first realize that both projects are really only access
digitization projects, despite the common assertion of OCA
captures as preservation digitization. Neither initiative uses
an imaging pipeline or capture environment suitable for true
preservation scanning. The OCA project outputs
variable–resolution JPEG2000 files built from lossy
camera–generated JPEG files. A consumer area array digital
camera is used to produce images ..."
Needless to say, neither Project Gutenberg nor Wikisource are
mentioned in this article. Their goals are just too different
(what? free content?), their achievements not impressive enough.
They are not potential future employers of "digital library"
scholars. If you help them or cooperate with them, you will only
help mankind in an altruistic fashion (what fools!), you will not
help your own professional or academic career.
In the article, the Open Content Alliance already plays the role
of the fools. They have only (!) digitized 100,000 books, while
Google Books has millions. They do not provide the same search
capability. And so it goes on. The next time the Internet Archive
(OCA) applies for funding or tries to establish cooperations with
more institutions, such arguments might be used against them.
What Wikisource really needs to do, is to provide an explanation
of what it does, and how this goes beyond Google Books' "access
digitization". In Europe, this must be set in the perspective of
ongoing French, German and EU initiatives (Gallica, Theseus,
Quaero, Europeana, ...). When one of those projects applies for
funding, it will need to show that it is successful in attracting
cooperation partners and that it is a leader among similar
projects. We should be prepared that they take any opportunity to
define Wikisource as a loser, amateurish, clueless project. This
is not because they are evil, only because they do what they can
to get the funding they need.
Why should museum X or library Y or archive Z cooperate with
Wikisource, when it risks being associated with such descriptions
of failure? The alternative for that institution might be to
cooperate with the successful Google or Gallica. So why is
Wikisource superior? This is what we need to explain.
> * develop arguments for museums etc...
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se