[Foundation-l] Open Library, Wikisource, and cleaning and translating OCR of Classics

David Goodman dgoodmanny at gmail.com
Fri Aug 14 21:23:53 UTC 2009


The training is typically an apprenticeship under the senior
cataloging librarians.

David Goodman, Ph.D, M.L.S.
http://en.wikipedia.org/wiki/User_talk:DGG



On Thu, Aug 13, 2009 at 1:48 AM, Samuel Klein<meta.sj at gmail.com> wrote:
> DGG, I appreciate your points.  Would we be so motivated by this
> thread if it weren't a complex problem?
>
> The fact that all of this is quite new, and that there are so many
> unknowns and gray areas, actually makes me consider it more likely
> that a body of wikimedians, experienced with their own form of
> large-scale authority file coordination, are in a position to say
> something meaningful about how to achieve something similar for tens
> of millions of metadata records.
>
>> OL rather than Wikimedia has the advantage that more of the people
>> there understand the problems.
>
> In some areas that is certainly so.  In others, Wikimedia communities
> have useful recent experience.  I hope that those who understand these
> problems  on both sides recognize the importance of sharing what they
> know openly -- and  showing others how to understand them as well.  We
> will not succeed as a global community if we say that this class of
> problems can only be solved by the limited group of people with an MLS
> and a few years of focused training.  (how would you name the sort of
> training you mean here, btw?)
>
> SJ
>
>
> On Thu, Aug 13, 2009 at 12:57 AM, David Goodman<dgoodmanny at gmail.com> wrote:
>> Yann & Sam
>>
>> The problem is extraordinarily   complex. A database of all "books"
>> (and other media) ever published is beyond the joint  capabilities of
>> everyone interested. There are intermediate entities between "books"
>> and "works", and important subordinate entities, such as "article" ,
>> "chapter" , and those like "poem" which could be at any of several
>> levels.  This is not a job for amateurs, unless they are prepared to
>> first learn the actual standards of bibliographic description for
>> different types of material, and to at least recognize the
>> inter-relationships, and the many undefined areas. At research
>> libraries, one allows a few years of training for a newcomer with just
>> a MLS degree to work with a small subset of this. I have thirty years
>> of experience in related areas of librarianship, and I know only
>> enough to be aware of the problems.
>> For an introduction to the current state of this, see
>> http://www.rdaonline.org/constituencyreview/Phase1Chp17_11_2_08.pdf.
>>
>> The difficulty of merging the many thousands of partial correct and
>> incorrect sources of available data typically requires the manual
>> resolution of each of the tens of millions of instances.
>>
>> OL rather than Wikimedia has the advantage that more of the people
>> there understand the problems.
>>
>> David Goodman, Ph.D, M.L.S.
>> http://en.wikipedia.org/wiki/User_talk:DGG
>>
>>
>>
>> On Wed, Aug 12, 2009 at 1:15 PM, c<yann at forget-me.net> wrote:
>>> Hello,
>>>
>>> This discussion is very interesting. I would like to make a summary, so
>>> that we can go further.
>>>
>>> 1. A database of all books ever published is one of the thing still missing.
>>> 2. This needs massive collaboration by thousands of volunteers, so a
>>> wiki might be appropriate, however...
>>> 3. The data needs a structured web site, not a plain wiki like Mediawiki.
>>> 4. A big part of this data is already available, but scattered on
>>> various databases, in various languages, with various protocols, etc. So
>>> a big part of work needs as much database management knowledge as
>>> librarian knowledge.
>>> 5. What most missing in these existing databases (IMO) is information
>>> about translations: nowhere there are a general database of translated
>>> works, at least not in English and French. It is very difficult to find
>>> if a translation exists for a given work. Wikisource has some of this
>>> information with interwiki links between work and author pages, but for
>>> a (very) small number of works and authors.
>>> 6. It would be best not to duplicate work on several places.
>>>
>>> Personally I don't find OL very practical. May be I am too much used too
>>> Mediawiki. ;oD
>>>
>>> We still need to create something, attractive to contributors and
>>> readers alike.
>>>
>>> Yann
>>>
>>> Samuel Klein wrote:
>>>>> This thread started out with a discussion of why it is so hard to
>>>>> start new projects within the Wikimedia Foundation.  My stance is
>>>>> that projects like OpenStreetMap.org and OpenLibrary.org are doing
>>>>> fine as they are, and there is no need to duplicate their effort
>>>>> within the WMF.  The example you gave was this:
>>>>
>>>> I agree that there's no point in duplicating existing functionality.
>>>> The best solution is probably for OL to include this explicitly in
>>>> their scope and add the necessary functionality.   I suggested this on
>>>> the OL mailing list in March.
>>>>    http://mail.archive.org/pipermail/ol-discuss/2009-March/000391.html
>>>>
>>>>>>>>>> *A wiki for book metadata, with an entry for every published
>>>>>>>>>> work, statistics about its use and siblings, and discussion
>>>>>>>>>> about its usefulness as a citation (a collaboration with
>>>>>>>>>> OpenLibrary, merging WikiCite ideas)
>>>>> To me, that sounds exactly as what OpenLibrary already does (or
>>>>> could be doing in the near time), so why even set up a new project
>>>>> that would collaborate with it?  Later you added:
>>>>
>>>> However, this is not what OL or its wiki do now.  And OL is not run by
>>>> its community, the community helps support the work of a centrally
>>>> directed group.  So there is only so much I feel I can contribute to
>>>> the project by making suggestions.  The wiki built into the fiber of
>>>> OL is intentionally not used for general discussion.
>>>>
>>>>> I was talking about the metadata for all books ever published,
>>>>> including the Swedish translations of Mark Twain's works, which
>>>>> are part of Mark Twain's bibliography, of the translator's
>>>>> bibliography, of American literature, and of Swedish language
>>>>> literature.  In OpenLibrary all of these are contained in one
>>>>> project.  In Wikisource, they are split in one section for English
>>>>> and another section for Swedish.  That division makes sense for
>>>>> the contents of the book, but not for the book metadata.
>>>>
>>>> This is a problem that Wikisource needs to address, regardless of
>>>> where the OpenLibrary metadata goes.  It is similar to the Wiktionary
>>>> problem of wanting some content - the array of translations of a
>>>> single definition - to exist in one place and be transcluded in each
>>>> language.
>>>>
>>>>> Now you write:
>>>>>
>>>>>> However, the project I have in mind for OCR cleaning and
>>>>>> translation needs to
>>>>> That is a change of subject. That sounds just like what Wikisource
>>>>> (or PGDP.net) is about.  OCR cleaning is one thing, but it is an
>>>>> entirely different thing to set up "a wiki for book metadata, with
>>>>> an entry for every published work".  So which of these two project
>>>>> ideas are we talking about?
>>>>
>>>> They are closely related.
>>>>
>>>> There needs to be a global authority file for works -- a [set of]
>>>> universal identifier[s] for a given work in order for wikisource (as
>>>> it currently stands) to link the German translation of the English
>>>> transcription of OCR of the 1998 photos of the 1572 Rotterdam Codex...
>>>> to its metadata entry [or entries].
>>>>
>>>> I would prefer for this authority file to be wiki-like, as the
>>>> Wikipedia authority file is, so that it supports renames, merges, and
>>>> splits with version history and minimal overhead; hence I wish to see
>>>> a wiki for this sort of metadata.
>>>>
>>>> Currently OL does not quite provide this authority file, but it could.
>>>>  I do not know how easily.
>>>>
>>>>> Every book ever published means more than 10 million records.
>>>>> (It probably means more than 100 million records.) OCR cleaning
>>>>> attracts hundreds or a few thousand volunteers, which is
>>>>> sufficient to take on thousands of books, but not millions.
>>>>
>>>> Focusing efforts on notable works with verifiable OCR, and using the
>>>> sorts of helper tools that Greg's paper describes, I do not doubt that
>>>> we could effectively clean and publish OCR for all primary sources
>>>> that are actively used and referenced in scholarship today (and more
>>>> besides).  Though 'we' here is the world - certainly more than a few
>>>> thousand volunteers have at least one book they would like to polish.
>>>> Most of them are not currently Wikimedia contributors, that much is
>>>> certain -- we don't provide any tools to make this work convenient or
>>>> rewarding.
>>>>
>>>>> Google scanned millions of books already, but I haven't heard of
>>>>> any plans for cleaning all that OCR text.
>>>>
>>>> Well, Google does not believe in distributed human effort.  (This came
>>>> up in a recent Knol thread as well.)  I'm not sure that is the best
>>>> comparison.
>>>>
>>>> SJ
>>>
>>> --
>>> http://www.non-violence.org/ | Site collaboratif sur la non-violence
>>> http://www.forget-me.net/ | Alternatives sur le Net
>>> http://fr.wikisource.org/ | Bibliothèque libre
>>> http://wikilivres.info | Documents libres
>>>
>>> _______________________________________________
>>> foundation-l mailing list
>>> foundation-l at lists.wikimedia.org
>>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>>>
>>
>> _______________________________________________
>> foundation-l mailing list
>> foundation-l at lists.wikimedia.org
>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>>
>
> _______________________________________________
> foundation-l mailing list
> foundation-l at lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>



More information about the foundation-l mailing list