[Foundation-l] Open Library, Wikisource, and cleaning and translating OCR of Classics

Samuel Klein meta.sj at gmail.com
Thu Aug 13 05:48:37 UTC 2009


DGG, I appreciate your points.  Would we be so motivated by this
thread if it weren't a complex problem?

The fact that all of this is quite new, and that there are so many
unknowns and gray areas, actually makes me consider it more likely
that a body of wikimedians, experienced with their own form of
large-scale authority file coordination, are in a position to say
something meaningful about how to achieve something similar for tens
of millions of metadata records.

> OL rather than Wikimedia has the advantage that more of the people
> there understand the problems.

In some areas that is certainly so.  In others, Wikimedia communities
have useful recent experience.  I hope that those who understand these
problems  on both sides recognize the importance of sharing what they
know openly -- and  showing others how to understand them as well.  We
will not succeed as a global community if we say that this class of
problems can only be solved by the limited group of people with an MLS
and a few years of focused training.  (how would you name the sort of
training you mean here, btw?)

SJ


On Thu, Aug 13, 2009 at 12:57 AM, David Goodman<dgoodmanny at gmail.com> wrote:
> Yann & Sam
>
> The problem is extraordinarily   complex. A database of all "books"
> (and other media) ever published is beyond the joint  capabilities of
> everyone interested. There are intermediate entities between "books"
> and "works", and important subordinate entities, such as "article" ,
> "chapter" , and those like "poem" which could be at any of several
> levels.  This is not a job for amateurs, unless they are prepared to
> first learn the actual standards of bibliographic description for
> different types of material, and to at least recognize the
> inter-relationships, and the many undefined areas. At research
> libraries, one allows a few years of training for a newcomer with just
> a MLS degree to work with a small subset of this. I have thirty years
> of experience in related areas of librarianship, and I know only
> enough to be aware of the problems.
> For an introduction to the current state of this, see
> http://www.rdaonline.org/constituencyreview/Phase1Chp17_11_2_08.pdf.
>
> The difficulty of merging the many thousands of partial correct and
> incorrect sources of available data typically requires the manual
> resolution of each of the tens of millions of instances.
>
> OL rather than Wikimedia has the advantage that more of the people
> there understand the problems.
>
> David Goodman, Ph.D, M.L.S.
> http://en.wikipedia.org/wiki/User_talk:DGG
>
>
>
> On Wed, Aug 12, 2009 at 1:15 PM, c<yann at forget-me.net> wrote:
>> Hello,
>>
>> This discussion is very interesting. I would like to make a summary, so
>> that we can go further.
>>
>> 1. A database of all books ever published is one of the thing still missing.
>> 2. This needs massive collaboration by thousands of volunteers, so a
>> wiki might be appropriate, however...
>> 3. The data needs a structured web site, not a plain wiki like Mediawiki.
>> 4. A big part of this data is already available, but scattered on
>> various databases, in various languages, with various protocols, etc. So
>> a big part of work needs as much database management knowledge as
>> librarian knowledge.
>> 5. What most missing in these existing databases (IMO) is information
>> about translations: nowhere there are a general database of translated
>> works, at least not in English and French. It is very difficult to find
>> if a translation exists for a given work. Wikisource has some of this
>> information with interwiki links between work and author pages, but for
>> a (very) small number of works and authors.
>> 6. It would be best not to duplicate work on several places.
>>
>> Personally I don't find OL very practical. May be I am too much used too
>> Mediawiki. ;oD
>>
>> We still need to create something, attractive to contributors and
>> readers alike.
>>
>> Yann
>>
>> Samuel Klein wrote:
>>>> This thread started out with a discussion of why it is so hard to
>>>> start new projects within the Wikimedia Foundation.  My stance is
>>>> that projects like OpenStreetMap.org and OpenLibrary.org are doing
>>>> fine as they are, and there is no need to duplicate their effort
>>>> within the WMF.  The example you gave was this:
>>>
>>> I agree that there's no point in duplicating existing functionality.
>>> The best solution is probably for OL to include this explicitly in
>>> their scope and add the necessary functionality.   I suggested this on
>>> the OL mailing list in March.
>>>    http://mail.archive.org/pipermail/ol-discuss/2009-March/000391.html
>>>
>>>>>>>>> *A wiki for book metadata, with an entry for every published
>>>>>>>>> work, statistics about its use and siblings, and discussion
>>>>>>>>> about its usefulness as a citation (a collaboration with
>>>>>>>>> OpenLibrary, merging WikiCite ideas)
>>>> To me, that sounds exactly as what OpenLibrary already does (or
>>>> could be doing in the near time), so why even set up a new project
>>>> that would collaborate with it?  Later you added:
>>>
>>> However, this is not what OL or its wiki do now.  And OL is not run by
>>> its community, the community helps support the work of a centrally
>>> directed group.  So there is only so much I feel I can contribute to
>>> the project by making suggestions.  The wiki built into the fiber of
>>> OL is intentionally not used for general discussion.
>>>
>>>> I was talking about the metadata for all books ever published,
>>>> including the Swedish translations of Mark Twain's works, which
>>>> are part of Mark Twain's bibliography, of the translator's
>>>> bibliography, of American literature, and of Swedish language
>>>> literature.  In OpenLibrary all of these are contained in one
>>>> project.  In Wikisource, they are split in one section for English
>>>> and another section for Swedish.  That division makes sense for
>>>> the contents of the book, but not for the book metadata.
>>>
>>> This is a problem that Wikisource needs to address, regardless of
>>> where the OpenLibrary metadata goes.  It is similar to the Wiktionary
>>> problem of wanting some content - the array of translations of a
>>> single definition - to exist in one place and be transcluded in each
>>> language.
>>>
>>>> Now you write:
>>>>
>>>>> However, the project I have in mind for OCR cleaning and
>>>>> translation needs to
>>>> That is a change of subject. That sounds just like what Wikisource
>>>> (or PGDP.net) is about.  OCR cleaning is one thing, but it is an
>>>> entirely different thing to set up "a wiki for book metadata, with
>>>> an entry for every published work".  So which of these two project
>>>> ideas are we talking about?
>>>
>>> They are closely related.
>>>
>>> There needs to be a global authority file for works -- a [set of]
>>> universal identifier[s] for a given work in order for wikisource (as
>>> it currently stands) to link the German translation of the English
>>> transcription of OCR of the 1998 photos of the 1572 Rotterdam Codex...
>>> to its metadata entry [or entries].
>>>
>>> I would prefer for this authority file to be wiki-like, as the
>>> Wikipedia authority file is, so that it supports renames, merges, and
>>> splits with version history and minimal overhead; hence I wish to see
>>> a wiki for this sort of metadata.
>>>
>>> Currently OL does not quite provide this authority file, but it could.
>>>  I do not know how easily.
>>>
>>>> Every book ever published means more than 10 million records.
>>>> (It probably means more than 100 million records.) OCR cleaning
>>>> attracts hundreds or a few thousand volunteers, which is
>>>> sufficient to take on thousands of books, but not millions.
>>>
>>> Focusing efforts on notable works with verifiable OCR, and using the
>>> sorts of helper tools that Greg's paper describes, I do not doubt that
>>> we could effectively clean and publish OCR for all primary sources
>>> that are actively used and referenced in scholarship today (and more
>>> besides).  Though 'we' here is the world - certainly more than a few
>>> thousand volunteers have at least one book they would like to polish.
>>> Most of them are not currently Wikimedia contributors, that much is
>>> certain -- we don't provide any tools to make this work convenient or
>>> rewarding.
>>>
>>>> Google scanned millions of books already, but I haven't heard of
>>>> any plans for cleaning all that OCR text.
>>>
>>> Well, Google does not believe in distributed human effort.  (This came
>>> up in a recent Knol thread as well.)  I'm not sure that is the best
>>> comparison.
>>>
>>> SJ
>>
>> --
>> http://www.non-violence.org/ | Site collaboratif sur la non-violence
>> http://www.forget-me.net/ | Alternatives sur le Net
>> http://fr.wikisource.org/ | Bibliothèque libre
>> http://wikilivres.info | Documents libres
>>
>> _______________________________________________
>> foundation-l mailing list
>> foundation-l at lists.wikimedia.org
>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>>
>
> _______________________________________________
> foundation-l mailing list
> foundation-l at lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>



More information about the foundation-l mailing list