[Foundation-l] Open Library, Wikisource, and cleaning and translating OCR of Classics

Wed Aug 12 17:15:04 UTC 2009

Hello,

This discussion is very interesting. I would like to make a summary, so
that we can go further.

1. A database of all books ever published is one of the thing still missing.
2. This needs massive collaboration by thousands of volunteers, so a
wiki might be appropriate, however...
3. The data needs a structured web site, not a plain wiki like Mediawiki.
4. A big part of this data is already available, but scattered on
various databases, in various languages, with various protocols, etc. So
a big part of work needs as much database management knowledge as
librarian knowledge.
5. What most missing in these existing databases (IMO) is information
about translations: nowhere there are a general database of translated
works, at least not in English and French. It is very difficult to find
if a translation exists for a given work. Wikisource has some of this
information with interwiki links between work and author pages, but for
a (very) small number of works and authors.
6. It would be best not to duplicate work on several places.

Personally I don't find OL very practical. May be I am too much used too
Mediawiki. ;oD

We still need to create something, attractive to contributors and
readers alike.

Yann

Samuel Klein wrote:
>> This thread started out with a discussion of why it is so hard to
>> start new projects within the Wikimedia Foundation.  My stance is
>> that projects like OpenStreetMap.org and OpenLibrary.org are doing
>> fine as they are, and there is no need to duplicate their effort
>> within the WMF.  The example you gave was this:
> 
> I agree that there's no point in duplicating existing functionality.
> The best solution is probably for OL to include this explicitly in
> their scope and add the necessary functionality.   I suggested this on
> the OL mailing list in March.
>    http://mail.archive.org/pipermail/ol-discuss/2009-March/000391.html
> 
>>>>>>> *A wiki for book metadata, with an entry for every published
>>>>>>> work, statistics about its use and siblings, and discussion
>>>>>>> about its usefulness as a citation (a collaboration with
>>>>>>> OpenLibrary, merging WikiCite ideas)
>> To me, that sounds exactly as what OpenLibrary already does (or
>> could be doing in the near time), so why even set up a new project
>> that would collaborate with it?  Later you added:
> 
> However, this is not what OL or its wiki do now.  And OL is not run by
> its community, the community helps support the work of a centrally
> directed group.  So there is only so much I feel I can contribute to
> the project by making suggestions.  The wiki built into the fiber of
> OL is intentionally not used for general discussion.
> 
>> I was talking about the metadata for all books ever published,
>> including the Swedish translations of Mark Twain's works, which
>> are part of Mark Twain's bibliography, of the translator's
>> bibliography, of American literature, and of Swedish language
>> literature.  In OpenLibrary all of these are contained in one
>> project.  In Wikisource, they are split in one section for English
>> and another section for Swedish.  That division makes sense for
>> the contents of the book, but not for the book metadata.
> 
> This is a problem that Wikisource needs to address, regardless of
> where the OpenLibrary metadata goes.  It is similar to the Wiktionary
> problem of wanting some content - the array of translations of a
> single definition - to exist in one place and be transcluded in each
> language.
> 
>> Now you write:
>>
>>> However, the project I have in mind for OCR cleaning and
>>> translation needs to
>> That is a change of subject. That sounds just like what Wikisource
>> (or PGDP.net) is about.  OCR cleaning is one thing, but it is an
>> entirely different thing to set up "a wiki for book metadata, with
>> an entry for every published work".  So which of these two project
>> ideas are we talking about?
> 
> They are closely related.
> 
> There needs to be a global authority file for works -- a [set of]
> universal identifier[s] for a given work in order for wikisource (as
> it currently stands) to link the German translation of the English
> transcription of OCR of the 1998 photos of the 1572 Rotterdam Codex...
> to its metadata entry [or entries].
> 
> I would prefer for this authority file to be wiki-like, as the
> Wikipedia authority file is, so that it supports renames, merges, and
> splits with version history and minimal overhead; hence I wish to see
> a wiki for this sort of metadata.
> 
> Currently OL does not quite provide this authority file, but it could.
>  I do not know how easily.
> 
>> Every book ever published means more than 10 million records.
>> (It probably means more than 100 million records.) OCR cleaning
>> attracts hundreds or a few thousand volunteers, which is
>> sufficient to take on thousands of books, but not millions.
> 
> Focusing efforts on notable works with verifiable OCR, and using the
> sorts of helper tools that Greg's paper describes, I do not doubt that
> we could effectively clean and publish OCR for all primary sources
> that are actively used and referenced in scholarship today (and more
> besides).  Though 'we' here is the world - certainly more than a few
> thousand volunteers have at least one book they would like to polish.
> Most of them are not currently Wikimedia contributors, that much is
> certain -- we don't provide any tools to make this work convenient or
> rewarding.
> 
>> Google scanned millions of books already, but I haven't heard of
>> any plans for cleaning all that OCR text.
> 
> Well, Google does not believe in distributed human effort.  (This came
> up in a recent Knol thread as well.)  I'm not sure that is the best
> comparison.
> 
> SJ

-- 
http://www.non-violence.org/ | Site collaboratif sur la non-violence
http://www.forget-me.net/ | Alternatives sur le Net
http://fr.wikisource.org/ | Bibliothèque libre
http://wikilivres.info | Documents libres