Gregory Maxwell wrote:
> For digitizing what?
Exactly, that's the first question.
> Archive.org digitizes books using a pair of canon 1Ds (? perhaps
> it was a 5D? In any case the 5DII would be sufficient now) on a
> custom stand with a hacked up copy of gphoto2 to actuate the
> cameras.
That's Brewster Kahle doing things many years ago (2002? 2003?).
Today, a much cheaper low-end digital SLR, or even compact cameras
will give you the needed 10 or so megapixels. But again, if you
need to pay your staff, a ten times more expensive camera might
easily pay its own cost in increased speed, or increased shutter
lifespan.
> I'm not sure how they're dealing with curvature (I think they
> just may lay a glass plate on the pages), but it would be easy
> enough to solve using a laser pointer with a pattern generating
> holographic grating and a second exposure to capture the page
> distortion and some fairly simple software processing after the
> fact.
The Internet Archive apparently uses a fixed glass, and lowers the
book cradle to turn pages, http://aipengineering.com/scribe/
Other designs have a fixed book cradle and lifts the glass, e.g.
the Atiz DIY, http://diy.atiz.com/
I thought the Internet Archive design was very clever, since it
keeps a fixed distance from lens to book surface (beneath the
glass), until I saw the bkrpr.org where you just lift everything.
That's a design for 2009! I haven't tried to build one myself yet.
----
However, you can capture lots of books (that can be opened fully)
with a single camera, laying the book flat on a table with a glass
on top. That's just like a flatbed scanner (but much faster)
turned upside down.
In January 2008, I used a 10 megapixel Canon EOS 400D (Digital
Rebel XTi) with a 50 mm lens to shoot this, laying flat on a table
under a glass, http://runeberg.org/stridfin/0226.html
On that webpage, the image is reduced to 120 dpi (1.2 megapixel),
but the original is 300 dpi (7.5 megapixel). The map shown is
reused in http://en.wikipedia.org/wiki/Battle_of_Alavus
That's an example of how one specialized book can be very useful
for a limited Wikiproject. This book was published in 1909 for the
100th anniversary of the Finnish War (1808-1809), and digitized in
2008 for the 200th anniversary.
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se
Project Runeberg - free Nordic literature - http://runeberg.org/
Lars,
I think we agree on what needs to happen. The only thing I am not
sure of is where you would like to see the work take place. I have
raised versions of this issue with the Open Library list, which I copy
again here (along with the people I know who work on that fine project
- hello, Peter and Rebecca). This is why I listed it below as a good
group to collaborate with.
However, the project I have in mind for OCR cleaning and translation needs to
- accept public comments and annotation about the substance or use of
a work (the wiki covering their millions of metadata entries is very
low traffic and used mainly to address metadata issues in their
records)
- handle OCR as editable content, or translations of same
- provide a universal ID for a work, with which comments and
translations can be associated (see
https://blueprints.launchpad.net/openlibrary/+spec/global-work-ids)
- handle citations, with the possibility of developing something like WikiCite
Let's take a practical example. A classics professor I know (Greg
Crane, copied here) has scans of primary source materials, some with
approximate or hand-polished OCR, waiting to be uploaded and converted
into a useful online resource for editors, translators, and
classicists around the world.
Where should he and his students post that material?
Wherever they end up, the primary article about each article would
surely link out to the OL and WS pages for each work (where one
exists).
> (Plus you would have to motivate why a copy of OpenLibrary should
> go into the English Wikisource and not the German or French one.)
I don't understand what you mean -- English source materials and
metadata go on en:ws, German on de:ws, &c. How is this different from
what happens today?
SJ
On Mon, Aug 3, 2009 at 1:18 PM, Lars Aronsson<lars(a)aronsson.se> wrote:
> Samuel Klein wrote (in two messages):
>
>> >> *A wiki for book metadata, with an entry for every published
>> >> work, statistics about its use and siblings, and discussion
>> >> about its usefulness as a citation (a collaboration with
>> >> OpenLibrary, merging WikiCite ideas)
>
>> I could see this happening on Wikisource.
>
> Why could you not see this happening within the existing
> OpenLibrary? Is there anything wrong with that project? It sounds
> to me as you would just copy (fork) all their book data, but for
> what gain?
>
> (Plus you would have to motivate why a copy of OpenLibrary should
> go into the English Wikisource and not the German or French one.)
>
>
> --
> Lars Aronsson (lars(a)aronsson.se)
> Aronsson Datateknik - http://aronsson.se
>
> _______________________________________________
> foundation-l mailing list
> foundation-l(a)lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>
Andrew Turvey wrote:
> We had a discussion at a recent Wikimedia UK board meeting about
> potentially buying some digitisation equipment which could be
> used to generate content for the Wikimedia projects. This recent
> email to the EN-WP list sparked my interest.
>
> Does anyone have any experience with equipment like this, and
> could you recommend anything? Any idea what the price range and
> quality typically is?
>
> Also, is anyone else in the Wikimedia community currently doing
> this?
I'm on the board of Wikimedia Sverige (Sweden), and also the
founder (in 1992) of Project Runeberg, the Scandinavian offspring
of Project Gutenberg. The Swedish-language Wikisource isn't doing
much, because Project Runeberg still does a lot of book scanning.
Its archive now contains images of 550,000 book pages,
corresponding to nearly 30 linear metres of shelving.
Book digitization is a matter of using the right tools for each
job. Much depends on the kind of book and the kind of labour. If
you use unpaid volunteers, you can afford slower equipment. If you
need to pay your staff, any equipment that speeds up the work will
quickly pay its own cost, including those very expensive
"professional" book scanners. On the scale of Google Book Search,
aiming to digitize millions of books, it pays off to let Google
engineers work on developing even faster equipment, just like
Google develops its own Linux-based storage architecture.
It's hard to measure the usefulness of a digitized book, since
Wikisource (and Project Runeberg) doesn't have any income. (And
neither has Google Book Search, I believe.) If your success is
measured in how much money you spend (as some charities have it),
it is very easy to invest a lot of money, without much result.
The worst you can do is to spend a lot to digitize something that
is already available for free download. Look around first.
You should start to think of what do you want to achieve? Is
there some book, or genre of books, that would be really useful
for Wikipedia to have on Wikisource? Anything British that all
those American projects haven't already covered? For us from
non-English speaking countries, it's far easier. Very little has
been digitized, so there is a lot to do. The most useful thing is
to digitize an old encyclopedia, just like that 11th edition of
Encyclopaedia Britannica (from 1911).
Now, encyclopedias are common items in used bookstores or online
auctions. You can buy 20 volumes for 200 euro, or even cheaper.
At that price, the best investment is a paper cutter (or ask a
print shop to help you) and a two-sided (duplex) sheet-feeding
scanner, such as the Canon DR-2050C or Fujitsu Scansnap s510.
http://www.youtube.com/watch?v=1oH3mQZLpL8
OCR software might be included with the scanner. Or you can buy
www.finereader.com for 160 euro.
The total investment would be less than 1000 euro (scanner + OCR
software + 20 volume encyclopedia + ask a print shop to cut the
spines). After this, you only need hours of volunteer work.
That's how I digitized the "New Student's Reference Work" (from
1914, 5 volumes, some 2500 pages) for Wikisource in 2005, only to
show that Wikisource could be used that way.
Here are some old pictures (with Swedish text, from 2001),
http://runeberg.org/admin/snuff.html
These scanners and that OCR software are not open source products,
but neither is my digital camera, and I use that to produce free
pictures for Wikimedia Commons. I know there have been many
attempts to make free OCR software, but is it any good?
In fact, if you have books where you can't afford to cut the
spine, maybe some rare thing that you only find in a library, a 10
megapixel digital camera is very useful. You need to experiment a
little with tripod stands and good lights. You only need to open
the book at 90 degrees, to get a good view of a page, which is
much friendlier to the book than an old flatbed scanner (and
faster). If you have two cameras, you can shoot left pages with
one, and right pages with the other. That's in fact how the
fastest modern "book scanners" work. Google builds their own, and
so can you. Some radical ideas are found on http://bkrpr.org/
Again, a pair of digital cameras is a total investment of less
than 1000 euro. That's a good starting range. You can achieve a
lot, and learn even more, in very little time.
What can you do with 2000 euro? Buy one sheet-feeding set, and
one pair of digital cameras. Let two teams compete against each
other. Write a report for next year's Wikimania. Have great fun!
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se
Project Runeberg - free Nordic literature - http://runeberg.org/
Wikimedia Sverige - stöd fri kunskap - http://wikimedia.se/
Joshua Gay wrote:
> David Strauss did a quick implementation (basically a demo) of an
> OpenLibrary extension for MediaWiki. In very little amount of code, he was
> able to easily search the OL (via AJAX) and when the user selected a given
> result, it poppulated a Citation template. What was nice is that when no
> results came up for a given search, there was an "add to open library"
> button that brought you to the OL site to add your bibliographic
> information.
Interesting, I didn't know that. Is this demo available somewhere?
Yann
--
http://www.non-violence.org/ | Site collaboratif sur la non-violence
http://www.forget-me.net/ | Alternatives sur le Net
http://fr.wikisource.org/ | Bibliothèque libre
http://wikilivres.info | Documents libres
This seems like an amazing chance for WikiProjects in almost any area.
(especially incubator projects ;)
You need to describe how your work supports open education, set a
project with milestones and metrics for success, and submit a grant
request:
http://blogs.talis.com/education/incubator/guidelines/
We do many of the things they ask for - licensing, educational focus,
making things visible and findable online - reflexively. It would be
great to see them get a whole specturm of wikimedia proposals. [if
you /do/ submit one, consider posting a version of it on strategy.wikimedia.org
as well. ]
SJ
---------- Forwarded message ----------
From: Brianna Laugher <brianna.laugher(a)gmail.com>
Date: Wed, Aug 19, 2009 at 10:08 PM
Subject: [Internal-l] Talis Incubator for Open Education funding available
To: "Local Chapters, board and officers coordination (closed
subscription)" <internal-l(a)lists.wikimedia.org>, chapters(a)wikimedia.ch
Via the Creative Commons blog - http://creativecommons.org/weblog/entry/17005
Talis Incubator for Open Education
For the latest news follow us on twitter: @talisincubator
Talis understands the growing importance of the Open Education
movement and its potential impact on how education is accessed,
assessed and certified.
Aimed at individuals or small groups, the Talis Incubator for Open
Education provides angel funding and other forms of assistance for
ideas and projects that have the potential to further the cause of
Open Education through the use of technology. All we ask in return is
that you donate or ‘open source’ the intellectual property generated
back to the communities that could benefit most from your work.
The brief
1. Write a proposal outlining your Open Education related project
or idea, making a bid for funding of between £1,000 and £15,000.
2. After reviewing and making sure your proposal meets the
guidelines, submit it to incubator(a)talis.com.
3. A proposal review board made up of independent thought leaders
and Talis representatives decide which projects get funding.
4. For successful bids, Talis awards you the funds and organises
any other help you have asked for.
5. Complete the project according to the schedule outlined in your proposal.
6. Talis helps you to make sure your work is disseminated amongst
the community.
from http://blogs.talis.com/education/incubator/
they also note "We also welcome applications from outside the UK,
however we regret that we can only consider and award amounts in GBP
(£), so if you are from outside the UK please account for exchange
rate fluctuation, and make sure you can receive funds paid in GBP."
Looks a bit interesting!
cheers
Brianna
--
They've just been waiting in a mountain for the right moment:
http://modernthings.org/
_______________________________________________
Internal-l mailing list
Internal-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/internal-l
Scary Transclusion is not going to be enabled anytime soon at WMF, if ever.
However, it is possible to fake cross-domains transclusion using
javascript.
I wrote a set of javascript functions that do this, using the new API.
You can see it working on la.ws.
It uses a template that transcludes a single page from another domain.
http://la.wikisource.org/wiki/Usor:ThomasV/testhttp://la.wikisource.org/wiki/Usor:ThomasV/test3
As you can see, the template inserts blank lines between pages, so it is
not
appropriate if you want to transclude many pages. Here is another template
that transcludes many pages using the 'pages' syntax :
http://la.wikisource.org/wiki/Usor:ThomasV/test2
These templates should work on all subdomains that call the the OCR.js
page at ws.org
ThomasV
Keeping a copy to wikisource-l. Yann
-------- Original Message --------
Subject: Re: [Foundation-l] Open Library, Wikisource, and cleaning and
translating OCR of Classics
Date: Thu, 13 Aug 2009 01:48:37 -0400
DGG, I appreciate your points. Would we be so motivated by this
thread if it weren't a complex problem?
The fact that all of this is quite new, and that there are so many
unknowns and gray areas, actually makes me consider it more likely
that a body of wikimedians, experienced with their own form of
large-scale authority file coordination, are in a position to say
something meaningful about how to achieve something similar for tens
of millions of metadata records.
> OL rather than Wikimedia has the advantage that more of the people
> there understand the problems.
In some areas that is certainly so. In others, Wikimedia communities
have useful recent experience. I hope that those who understand these
problems on both sides recognize the importance of sharing what they
know openly -- and showing others how to understand them as well. We
will not succeed as a global community if we say that this class of
problems can only be solved by the limited group of people with an MLS
and a few years of focused training. (how would you name the sort of
training you mean here, btw?)
SJ
On Thu, Aug 13, 2009 at 12:57 AM, David Goodman<dgoodmanny(a)gmail.com> wrote:
> Yann & Sam
>
> The problem is extraordinarily complex. A database of all "books"
> (and other media) ever published is beyond the joint capabilities of
> everyone interested. There are intermediate entities between "books"
> and "works", and important subordinate entities, such as "article" ,
> "chapter" , and those like "poem" which could be at any of several
> levels. This is not a job for amateurs, unless they are prepared to
> first learn the actual standards of bibliographic description for
> different types of material, and to at least recognize the
> inter-relationships, and the many undefined areas. At research
> libraries, one allows a few years of training for a newcomer with just
> a MLS degree to work with a small subset of this. I have thirty years
> of experience in related areas of librarianship, and I know only
> enough to be aware of the problems.
> For an introduction to the current state of this, see
> http://www.rdaonline.org/constituencyreview/Phase1Chp17_11_2_08.pdf.
>
> The difficulty of merging the many thousands of partial correct and
> incorrect sources of available data typically requires the manual
> resolution of each of the tens of millions of instances.
>
> OL rather than Wikimedia has the advantage that more of the people
> there understand the problems.
>
> David Goodman, Ph.D, M.L.S.
> http://en.wikipedia.org/wiki/User_talk:DGG
--
http://www.non-violence.org/ | Site collaboratif sur la non-violence
http://www.forget-me.net/ | Alternatives sur le Net
http://fr.wikisource.org/ | Bibliothèque libre
http://wikilivres.info | Documents libres
Keeping the wikisource list in cc: . SJ
On Wed, Aug 12, 2009 at 11:05 AM, Samuel Klein<meta.sj(a)gmail.com> wrote:
> On Wed, Aug 12, 2009 at 10:14 AM, Karen Coyle<kcoyle(a)kcoyle.net> wrote:
>> Just a few comments on OL plans....
>
> Thank you!
>
>
>>> * version history for manifestations (latest cleaned up version of a
>>> file) and expressions (latest cleaned up translation of a work)
>>> ** links to manifestations archived elsewhere, if they are not
>>> mirrored by the OL/IA for some reason
>>>
>>
>> Is this referring to the metadata or the full text?
>
> Both. (both can be edited by people, or updated/cleaned by
> context-aware or cross-language info-retrieval scripts)
>
>>> * providing a namespace and format for collections and lists of
>>> works; as a normalized way of identifying collections in which a given
>>> work has been included. This is slightly different in use, intent,
>>> and visualization than classification categories. There might be a
>>> couple dozen subject categories for a complex work, but it could have
>>> hundreds of associations with collections, awards, designations, &c.
>>>
>>
>> Yes, this is part of the "lists" function, in development, although the
>> details have not been fully worked out. We've noted that the NY Times
>> has made its best seller lists available, so that makes sense as a
>> collection; Pulitzer prizes, Booker prize, etc. All of these should form
>> lists or collections within OL. Plus users should be able to create any
>> lists, bibliographies, etc.
>
> Excellent - do you have a link to recent discussion?
>
>
>> Adding discussion pages has been discussed. There are two things here:
>> discussion on OL about the books, and discussion about the OL project.
>> As for the latter, more than discussion perhaps we need a place where
>> people share uses of OL, changes they've made to OL (all of the
>> templates are editable by anyone, although you need to share those
>> edits... I don't think we've explained this well, and definitely haven't
>> done enough to foster a community of users.). A kind of community space.
>> Yes, this is really needed.
>
> So, why not make this one use of the OL wiki? discussion about a work
> and about the project will regularly overlap as style guidelines and
> community dynamics play out.
>
>
>> OL would like to show metadata in
>> the preferred language of the user. That presents lots of issues,
>> starting with the one of: what if there isn't any metadata in the
>> language of the user? But also how you do this AND give the user an idea
>> of the origins of the work (first publication date and place and
>> language). Wikipedia is able to do this because its data is created by
>> people. OL is working with metadata created for individual editions that
>> doesn't link easily to the work. Where there is a wikipedia entry for
>> the work OL may be able to use that to determine the origins, but in
>> many cases no such entry will be available.
>
> This seems solvable - define the style you'd recommend people create
> by hand where they have the time; and write scripts that can
> approximate this where there is limited data. script-assisted people
> can do tremendous amounts of work category by category.
>
>> In any case, all of this is being discussed and considered. Since email
>> is so non-sticky, would the OL blog be a good place to provide more of
>> this information and discussion?
>
> A blog isn't sufficiently sticky for my tastes -- limited permalinks,
> no version history or diffs, limited capacity for collaboration
> directly on ideas, texts, and overviews; poor namespace control for
> naming and classifying discussions; and limited interlinnks between
> different posts/comments/contributors.
>
> Let's please use something at least as sticky as a wiki. [NTS: we
> need a term for collaboration environments that parallels
> "Turing-complete" to describe anything that can mimic a set of basic
> wiki services.]
>
> SJ
>
Onion sourcing. That would be a nice improvement on simple cite styles.
On Tue, Aug 11, 2009 at 12:10 PM, Gregory Crane<gregory.crane(a)tufts.edu> wrote:
> There are various layers to this onion. The key element is that books and
> pages are artifacts in many cases. What we really want are the logical
> structures that splatter across pages.
And across and around works...
> First, we have added a bunch of content -- esp. editions of Greek and Latin
> sources -- to the Internet Archive holdings and we are cataloguing editions
> that are the overall collection, regardless of who put them there. This goes
> well beyond the standard book catalogue records -- we are interested in the
> content not in books per se. Thus, we may add hundreds of records for a
Is there a way to deep link to a specific page-image from one of these
works without removing it from the Internet Archive?
> We would like to have useable etexts from all of these editions -- many of
> which are not yet in our collections. Many of these are in Greek and need a
> lot of work because the OCR is not very good.
So bad OCR for them exists, but no usable etexts?
> To use canonical texts, you need book/chapter/verse markup and you need
> FRBR-like citations ... deep annotations... syntactic analyses, word sense,
> co-reference...
These are nice features, but perhaps you can develop a clean etext
first, and overlay this metadata in parallel or later on.
> My question is what environments can support contributions at various
> levels. Clearly, proofreading OCR output is standard enough.
>
> If you want to get a sense of what operations need ultimately to be
> supported, you could skim
> http://digitalhumanities.org/dhq/vol/3/1/000035.html.
That's a good question. What environments currently support OCR
proofreading and translation, and direct links to page-images of the
original source? This is doable, with no special software or tools,
via wikisource (in multiple languages, with interlanguage links and
crude paragraph alignment) and commons (for page images). The pages
could also be stored in other repositories such as the Archive, as
long as there is an easy way to link out to them or transclude
thumbnails. [maybe an InstaCommons plugin for the Internet Archive?]
That's quite an interesting monograph you link to. I see six main
sets of features/operations described there. Each of them deserves a
mention in Wikimedia's strategic planning. Aside from language
analysis, each has significant value for all of the Projects, not just
wikisource.
OCR tools
* OCR optimization: statistical data, page layout hints
* Capturing page layout logical structures
CROSS REFERENCING
* Quote, source, plagiarism idenfication.
* Named entity identification (automatic for some entities? hints)
* Automatic linking (of urls, abbrv. citations, &c), markup projection
TEXT ALIGNMENT
* Canonical text services (chapter/verse equivalents)
* Version Analysis b/t versions.
* Translation alignment
TRANSLATION SUPPORT
* Automated translation (seed translations, hints for humans)
* Translation dictionaries (on mouseover?)
CROSS-LANGUAGE SEARCHING
* Cross-referencing across translations
* Quote identification across translations
LANGUAGE ANALYSIS
* Word analysis: word sense discovery, morphology.
* Sentence analysis: syntactic, metrical (poetry)
> Greg
>
> John Vandenberg wrote:
>>
>> On Tue, Aug 11, 2009 at 3:00 PM, Samuel Klein<meta.sj(a)gmail.com> wrote:
>>
>>>
>>> ...
>>> Let's take a practical example. A classics professor I know (Greg
>>> Crane, copied here) has scans of primary source materials, some with
>>> approximate or hand-polished OCR, waiting to be uploaded and converted
>>> into a useful online resource for editors, translators, and
>>> classicists around the world.
>>>
>>> Where should he and his students post that material?
>>>
>>
>> I am a bit confused. Are these texts currently hosted at the Perseus
>> Digital Library?
>>
>> If so, they are already a useful online resource. ;-)
>>
>> If they would like to see these primary sources pushed into the
>> Wikimedia community, they would need to upload the images (or DjVu)
>> onto Commons, and the text onto Wikisource where the distributed
>> proofreading software resides.
>>
>> We can work with them to import a few texts in order to demonstrate
>> our technology and preferred methods, and then they can decide whether
>> they are happy with this technology, the community, and the potential
>> for translations and commentary.
>>
>> I made a start on creating a Perseus-to-Wikisource importer about a year
>> ago...!
>>
>> Or they can upload the djvu to Internet Archive.. or a similar
>> depositories... and see where it goes from there.
>>
>>
>>>
>>> Wherever they end up, the primary article about each article would
>>> surely link out to the OL and WS pages for each work (where one
>>> exists).
>>>
>>
>> Wikisource has been adding OCLC numbers to pages, and adding links to
>> archive.org when the djvu files came from there (these links contain
>> an archive.org identifier). There are also links to LibraryThing and
>> Open Library; we have very few rules ;-)
>>
>> --
>> John Vandenberg
>>
>
>