Re: [Foundation-l] Open Library, Wikisource, and cleaning and translating OCR of Classics

13 Aug 2009

DGG, I appreciate your points.  Would we be so motivated by this
thread if it weren't a complex problem?

The fact that all of this is quite new, and that there are so many
unknowns and gray areas, actually makes me consider it more likely
that a body of wikimedians, experienced with their own form of
large-scale authority file coordination, are in a position to say
something meaningful about how to achieve something similar for tens
of millions of metadata records.

...
  OL rather than Wikimedia has the advantage that more
of the people
 there understand the problems. 
In some areas that is certainly so.  In others, Wikimedia communities
have useful recent experience.  I hope that those who understand these
problems  on both sides recognize the importance of sharing what they
know openly -- and  showing others how to understand them as well.  We
will not succeed as a global community if we say that this class of
problems can only be solved by the limited group of people with an MLS
and a few years of focused training.  (how would you name the sort of
training you mean here, btw?)

SJ

On Thu, Aug 13, 2009 at 12:57 AM, David Goodman&lt;dgoodmanny(a)gmail.com&gt; wrote:
> Yann & Sam
>
> The problem is extraordinarily   complex. A database of all "books"
> (and other media) ever published is beyond the joint  capabilities of
> everyone interested. There are intermediate entities between "books"
> and "works", and important subordinate entities, such as
"article" ,
> "chapter" , and those like "poem" which could be at any of
several
> levels.  This is not a job for amateurs, unless they are prepared to
> first learn the actual standards of bibliographic description for
> different types of material, and to at least recognize the
> inter-relationships, and the many undefined areas. At research
> libraries, one allows a few years of training for a newcomer with just
> a MLS degree to work with a small subset of this. I have thirty years
> of experience in related areas of librarianship, and I know only
> enough to be aware of the problems.
> For an introduction to the current state of this, see
> http://www.rdaonline.org/constituencyreview/Phase1Chp17_11_2_08.pdf.
>
> The difficulty of merging the many thousands of partial correct and
> incorrect sources of available data typically requires the manual
> resolution of each of the tens of millions of instances.
>
...
  OL rather than Wikimedia has the advantage that more
of the people
 there understand the problems. >
> David Goodman, Ph.D, M.L.S.
> http://en.wikipedia.org/wiki/User_talk:DGG
>
>
>
> On Wed, Aug 12, 2009 at 1:15 PM, c&lt;yann(a)forget-me.net&gt; wrote:
>> Hello,
>>
>> This discussion is very interesting. I would like to make a summary, so
>> that we can go further.
>>
>> 1. A database of all books ever published is one of the thing still missing.
>> 2. This needs massive collaboration by thousands of volunteers, so a
>> wiki might be appropriate, however...
>> 3. The data needs a structured web site, not a plain wiki like Mediawiki.
>> 4. A big part of this data is already available, but scattered on
>> various databases, in various languages, with various protocols, etc. So
>> a big part of work needs as much database management knowledge as
>> librarian knowledge.
>> 5. What most missing in these existing databases (IMO) is information
>> about translations: nowhere there are a general database of translated
>> works, at least not in English and French. It is very difficult to find
>> if a translation exists for a given work. Wikisource has some of this
>> information with interwiki links between work and author pages, but for
>> a (very) small number of works and authors.
>> 6. It would be best not to duplicate work on several places.
>>
>> Personally I don't find OL very practical. May be I am too much used too
>> Mediawiki. ;oD
>>
>> We still need to create something, attractive to contributors and
>> readers alike.
>>
>> Yann
>>
>> Samuel Klein wrote:
>>>> This thread started out with a discussion of why it is so hard to
>>>> start new projects within the Wikimedia Foundation.  My stance is
>>>> that projects like OpenStreetMap.org and OpenLibrary.org are doing
>>>> fine as they are, and there is no need to duplicate their effort
>>>> within the WMF.  The example you gave was this:
>>>
>>> I agree that there's no point in duplicating existing functionality.
>>> The best solution is probably for OL to include this explicitly in
>>> their scope and add the necessary functionality.   I suggested this on
>>> the OL mailing list in March.
>>>    http://mail.archive.org/pipermail/ol-discuss/2009-March/000391.html
>>>
>>>>>>>>> *A wiki for book metadata, with an entry for every
published
>>>>>>>>> work, statistics about its use and siblings, and
discussion
>>>>>>>>> about its usefulness as a citation (a collaboration
with
>>>>>>>>> OpenLibrary, merging WikiCite ideas)
>>>> To me, that sounds exactly as what OpenLibrary already does (or
>>>> could be doing in the near time), so why even set up a new project
>>>> that would collaborate with it?  Later you added:
>>>
>>> However, this is not what OL or its wiki do now.  And OL is not run by
>>> its community, the community helps support the work of a centrally
>>> directed group.  So there is only so much I feel I can contribute to
>>> the project by making suggestions.  The wiki built into the fiber of
>>> OL is intentionally not used for general discussion.
>>>
>>>> I was talking about the metadata for all books ever published,
>>>> including the Swedish translations of Mark Twain's works, which
>>>> are part of Mark Twain's bibliography, of the translator's
>>>> bibliography, of American literature, and of Swedish language
>>>> literature.  In OpenLibrary all of these are contained in one
>>>> project.  In Wikisource, they are split in one section for English
>>>> and another section for Swedish.  That division makes sense for
>>>> the contents of the book, but not for the book metadata.
>>>
>>> This is a problem that Wikisource needs to address, regardless of
>>> where the OpenLibrary metadata goes.  It is similar to the Wiktionary
>>> problem of wanting some content - the array of translations of a
>>> single definition - to exist in one place and be transcluded in each
>>> language.
>>>
>>>> Now you write:
>>>>
>>>>> However, the project I have in mind for OCR cleaning and
>>>>> translation needs to
>>>> That is a change of subject. That sounds just like what Wikisource
>>>> (or PGDP.net) is about.  OCR cleaning is one thing, but it is an
>>>> entirely different thing to set up "a wiki for book metadata, with
>>>> an entry for every published work".  So which of these two project
>>>> ideas are we talking about?
>>>
>>> They are closely related.
>>>
>>> There needs to be a global authority file for works -- a [set of]
>>> universal identifier[s] for a given work in order for wikisource (as
>>> it currently stands) to link the German translation of the English
>>> transcription of OCR of the 1998 photos of the 1572 Rotterdam Codex...
>>> to its metadata entry [or entries].
>>>
>>> I would prefer for this authority file to be wiki-like, as the
>>> Wikipedia authority file is, so that it supports renames, merges, and
>>> splits with version history and minimal overhead; hence I wish to see
>>> a wiki for this sort of metadata.
>>>
>>> Currently OL does not quite provide this authority file, but it could.
>>>  I do not know how easily.
>>>
>>>> Every book ever published means more than 10 million records.
>>>> (It probably means more than 100 million records.) OCR cleaning
>>>> attracts hundreds or a few thousand volunteers, which is
>>>> sufficient to take on thousands of books, but not millions.
>>>
>>> Focusing efforts on notable works with verifiable OCR, and using the
>>> sorts of helper tools that Greg's paper describes, I do not doubt that
>>> we could effectively clean and publish OCR for all primary sources
>>> that are actively used and referenced in scholarship today (and more
>>> besides).  Though 'we' here is the world - certainly more than a few
>>> thousand volunteers have at least one book they would like to polish.
>>> Most of them are not currently Wikimedia contributors, that much is
>>> certain -- we don't provide any tools to make this work convenient or
>>> rewarding.
>>>
>>>> Google scanned millions of books already, but I haven't heard of
>>>> any plans for cleaning all that OCR text.
>>>
>>> Well, Google does not believe in distributed human effort.  (This came
>>> up in a recent Knol thread as well.)  I'm not sure that is the best
>>> comparison.
>>>
>>> SJ
>>
>> --
>> http://www.non-violence.org/ | Site collaboratif sur la non-violence
>> http://www.forget-me.net/ | Alternatives sur le Net
>> http://fr.wikisource.org/ | Bibliothèque libre
>> http://wikilivres.info | Documents libres
>>
>> _______________________________________________
>> foundation-l mailing list
>> foundation-l(a)lists.wikimedia.org
>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>>
>
> _______________________________________________
> foundation-l mailing list
> foundation-l(a)lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Foundation-l] Open Library, Wikisource, and cleaning and translating OCR of Classics