[Wikisource-l] Open Library, Wikisource, and cleaning and translating OCR of Classics

12 Aug 2009

Hello,

This discussion is very interesting. I would like to make a summary, so
that we can go further.

1. A database of all books ever published is one of the thing still missing.
2. This needs massive collaboration by thousands of volunteers, so a
wiki might be appropriate, however...
3. The data needs a structured web site, not a plain wiki like Mediawiki.
4. A big part of this data is already available, but scattered on
various databases, in various languages, with various protocols, etc. So
a big part of work needs as much database management knowledge as
librarian knowledge.
5. What most missing in these existing databases (IMO) is information
about translations: nowhere there are a general database of translated
works, at least not in English and French. It is very difficult to find
if a translation exists for a given work. Wikisource has some of this
information with interwiki links between work and author pages, but for
a (very) small number of works and authors.
6. It would be best not to duplicate work on several places.

Personally I don't find OL very practical. May be I am too much used too
Mediawiki. ;oD

We still need to create something, attractive to contributors and
readers alike.

Yann

Samuel Klein wrote:
...
   This thread
started out with a discussion of why it is so hard to
 start new projects within the Wikimedia Foundation.  My stance is
 that projects like OpenStreetMap.org and OpenLibrary.org are doing
 fine as they are, and there is no need to duplicate their effort
 within the WMF.  The example you gave was this:  
 I agree that there's no point in duplicating existing functionality.
 The best solution is probably for OL to include this explicitly in
 their scope and add the necessary functionality.   I suggested this on
 the OL mailing list in March.
    http://mail.archive.org/pipermail/ol-discuss/2009-March/000391.html

>>>> *A wiki for book metadata, with an entry for every published
>>>> work, statistics about its use and siblings, and discussion
>>>> about its usefulness as a citation (a collaboration with
>>>> OpenLibrary, merging WikiCite ideas)  To me, that sounds exactly as
what OpenLibrary already does (or
 could be doing in the near time), so why even set up a new project
 that would collaborate with it?  Later you added:  
 However, this is not what OL or its wiki do now.  And OL is not run by
 its community, the community helps support the work of a centrally
 directed group.  So there is only so much I feel I can contribute to
 the project by making suggestions.  The wiki built into the fiber of
 OL is intentionally not used for general discussion.

  I was talking about the metadata for all books
ever published,
 including the Swedish translations of Mark Twain's works, which
 are part of Mark Twain's bibliography, of the translator's
 bibliography, of American literature, and of Swedish language
 literature.  In OpenLibrary all of these are contained in one
 project.  In Wikisource, they are split in one section for English
 and another section for Swedish.  That division makes sense for
 the contents of the book, but not for the book metadata.  
 This is a problem that Wikisource needs to address, regardless of
 where the OpenLibrary metadata goes.  It is similar to the Wiktionary
 problem of wanting some content - the array of translations of a
 single definition - to exist in one place and be transcluded in each
 language.

  Now you write:

  However, the project I have in mind for OCR
cleaning and
 translation needs to  That is a change of subject. That sounds just like what
Wikisource
 (or PGDP.net) is about.  OCR cleaning is one thing, but it is an
 entirely different thing to set up "a wiki for book metadata, with
 an entry for every published work".  So which of these two project
 ideas are we talking about?  
 They are closely related.

 There needs to be a global authority file for works -- a [set of]
 universal identifier[s] for a given work in order for wikisource (as
 it currently stands) to link the German translation of the English
 transcription of OCR of the 1998 photos of the 1572 Rotterdam Codex...
 to its metadata entry [or entries].

 I would prefer for this authority file to be wiki-like, as the
 Wikipedia authority file is, so that it supports renames, merges, and
 splits with version history and minimal overhead; hence I wish to see
 a wiki for this sort of metadata.

 Currently OL does not quite provide this authority file, but it could.
  I do not know how easily.

  Every book ever published means more than 10
million records.
 (It probably means more than 100 million records.) OCR cleaning
 attracts hundreds or a few thousand volunteers, which is
 sufficient to take on thousands of books, but not millions.  
 Focusing efforts on notable works with verifiable OCR, and using the
 sorts of helper tools that Greg's paper describes, I do not doubt that
 we could effectively clean and publish OCR for all primary sources
 that are actively used and referenced in scholarship today (and more
 besides).  Though 'we' here is the world - certainly more than a few
 thousand volunteers have at least one book they would like to polish.
 Most of them are not currently Wikimedia contributors, that much is
 certain -- we don't provide any tools to make this work convenient or
 rewarding.

  Google scanned millions of books already, but I
haven't heard of
 any plans for cleaning all that OCR text.  
 Well, Google does not believe in distributed human effort.  (This came
 up in a recent Knol thread as well.)  I'm not sure that is the best
 comparison.

 SJ 
-- 
http://www.non-violence.org/ | Site collaboratif sur la non-violence
http://www.forget-me.net/ | Alternatives sur le Net
http://fr.wikisource.org/ | Bibliothèque libre
http://wikilivres.info | Documents libres

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Wikisource-l] Open Library, Wikisource, and cleaning and translating OCR of Classics