Open Library, Wikisource, and cleaning and translating OCR of Classics

List overview All Threads
Download

newer

older

Re: [Wikisource-l] [Foundation-l]...

Re: [Wikisource-l] [Commons-l]...

Samuel Klein

11 Aug 2009 11 Aug '09

7 a.m.

Lars,

I think we agree on what needs to happen. The only thing I am not sure of is where you would like to see the work take place. I have raised versions of this issue with the Open Library list, which I copy again here (along with the people I know who work on that fine project - hello, Peter and Rebecca). This is why I listed it below as a good group to collaborate with.

However, the project I have in mind for OCR cleaning and translation needs to - accept public comments and annotation about the substance or use of a work (the wiki covering their millions of metadata entries is very low traffic and used mainly to address metadata issues in their records) - handle OCR as editable content, or translations of same - provide a universal ID for a work, with which comments and translations can be associated (see https://blueprints.launchpad.net/openlibrary/+spec/global-work-ids) - handle citations, with the possibility of developing something like WikiCite

Let's take a practical example. A classics professor I know (Greg Crane, copied here) has scans of primary source materials, some with approximate or hand-polished OCR, waiting to be uploaded and converted into a useful online resource for editors, translators, and classicists around the world.

Where should he and his students post that material?

Wherever they end up, the primary article about each article would surely link out to the OL and WS pages for each work (where one exists).

...

(Plus you would have to motivate why a copy of OpenLibrary should go into the English Wikisource and not the German or French one.)

I don't understand what you mean -- English source materials and metadata go on en:ws, German on de:ws, &c. How is this different from what happens today?

On Mon, Aug 3, 2009 at 1:18 PM, Lars Aronssonlars@aronsson.se wrote:

...

Samuel Klein wrote (in two messages):

...
...
...
*A wiki for book metadata, with an entry for every published work, statistics about its use and siblings, and discussion about its usefulness as a citation (a collaboration with OpenLibrary, merging WikiCite ideas)

...
I could see this happening on Wikisource.

Why could you not see this happening within the existing OpenLibrary? Is there anything wrong with that project? It sounds to me as you would just copy (fork) all their book data, but for what gain?

(Plus you would have to motivate why a copy of OpenLibrary should go into the English Wikisource and not the German or French one.)

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Show replies by date

Nemo_bis

11 Aug 11 Aug

8:16 a.m.

New subject: Open Library, Wikisource, and cleaning and translating OCR of Classics

Samuel Klein, 11/08/2009 07:00:

...

Let's take a practical example. A classics professor I know (Greg Crane, copied here) has scans of primary source materials, some with approximate or hand-polished OCR, waiting to be uploaded and converted into a useful online resource for editors, translators, and classicists around the world.

Where should he and his students post that material?

Slovene Wikisource did something similar: http://meta.wikimedia.org/wiki/Slovene_student_projects_in_Wikipedia_and_Wik...

Nemo

John Vandenberg

8:32 a.m.

New subject: [ol-discuss] Open Library, Wikisource, and cleaning and translating OCR of Classics

On Tue, Aug 11, 2009 at 3:00 PM, Samuel Kleinmeta.sj@gmail.com wrote:

...

... Let's take a practical example. A classics professor I know (Greg Crane, copied here) has scans of primary source materials, some with approximate or hand-polished OCR, waiting to be uploaded and converted into a useful online resource for editors, translators, and classicists around the world.

Where should he and his students post that material?

I am a bit confused. Are these texts currently hosted at the Perseus Digital Library?

If so, they are already a useful online resource. ;-)

If they would like to see these primary sources pushed into the Wikimedia community, they would need to upload the images (or DjVu) onto Commons, and the text onto Wikisource where the distributed proofreading software resides.

We can work with them to import a few texts in order to demonstrate our technology and preferred methods, and then they can decide whether they are happy with this technology, the community, and the potential for translations and commentary.

I made a start on creating a Perseus-to-Wikisource importer about a year ago...!

Or they can upload the djvu to Internet Archive.. or a similar depositories... and see where it goes from there.

...

Wherever they end up, the primary article about each article would surely link out to the OL and WS pages for each work (where one exists).

Wikisource has been adding OCLC numbers to pages, and adding links to archive.org when the djvu files came from there (these links contain an archive.org identifier). There are also links to LibraryThing and Open Library; we have very few rules ;-)

-- John Vandenberg

Lars Aronsson

12 Aug 12 Aug

3:16 a.m.

New subject: Open Library, Wikisource, and cleaning and translating OCR of Classics

Samuel Klein wrote:

...

I think we agree on what needs to happen. The only thing I am not sure of is where you would like to see the work take place.

I'm not so sure we agree. I think we're talking about two different things.

This thread started out with a discussion of why it is so hard to start new projects within the Wikimedia Foundation. My stance is that projects like OpenStreetMap.org and OpenLibrary.org are doing fine as they are, and there is no need to duplicate their effort within the WMF. The example you gave was this:

...

...
...
...
...
*A wiki for book metadata, with an entry for every published work, statistics about its use and siblings, and discussion about its usefulness as a citation (a collaboration with OpenLibrary, merging WikiCite ideas)

To me, that sounds exactly as what OpenLibrary already does (or could be doing in the near time), so why even set up a new project that would collaborate with it? Later you added:

...

...
...
I could see this happening on Wikisource.

That's when I asked why this couldn't be done inside OpenLibrary.

I added:

...

...
(Plus you would have to motivate why a copy of OpenLibrary should go into the English Wikisource and not the German or French one.)

You replied:

...

I don't understand what you mean -- English source materials and metadata go on en:ws, German on de:ws, &c. How is this different from what happens today?

I was talking about the metadata for all books ever published, including the Swedish translations of Mark Twain's works, which are part of Mark Twain's bibliography, of the translator's bibliography, of American literature, and of Swedish language literature. In OpenLibrary all of these are contained in one project. In Wikisource, they are split in one section for English and another section for Swedish. That division makes sense for the contents of the book, but not for the book metadata.

Now you write:

...

However, the project I have in mind for OCR cleaning and translation needs to

That is a change of subject. That sounds just like what Wikisource (or PGDP.net) is about. OCR cleaning is one thing, but it is an entirely different thing to set up "a wiki for book metadata, with an entry for every published work". So which of these two project ideas are we talking about?

Every book ever published means more than 10 million records. (It probably means more than 100 million records.) OCR cleaning attracts hundreds or a few thousand volunteers, which is sufficient to take on thousands of books, but not millions.

Google scanned millions of books already, but I haven't heard of any plans for cleaning all that OCR text.

...

Let's take a practical example. A classics professor I know (Greg Crane, copied here) has scans of primary source materials, some with approximate or hand-polished OCR, waiting to be uploaded and converted into a useful online resource for editors, translators, and classicists around the world.

Where should he and his students post that material?

On Wikisource. What's stopping them?

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Samuel Klein

11:54 a.m.

New subject: [ol-discuss] Open Library, Wikisource, and cleaning and translating OCR of Classics

On Tue, Aug 11, 2009 at 9:16 PM, Lars Aronssonlars@aronsson.se wrote:

...

...
Let's take a practical example. A classics professor I know (Greg Crane, copied here) has scans of primary source materials, some with approximate or hand-polished OCR, waiting to be uploaded and converted into a useful online resource for editors, translators, and classicists around the world.

Where should he and his students post that material?

On Wikisource. What's stopping them?

Greg: does Wikisource seem like the right place to post and revise OCR to you? If not, where? If so, what's stopping you?

...

I'm not so sure we agree. I think we're talking about two different things.

This thread started out with a discussion of why it is so hard to start new projects within the Wikimedia Foundation. My stance is that projects like OpenStreetMap.org and OpenLibrary.org are doing fine as they are, and there is no need to duplicate their effort within the WMF. The example you gave was this:

I agree that there's no point in duplicating existing functionality. The best solution is probably for OL to include this explicitly in their scope and add the necessary functionality. I suggested this on the OL mailing list in March. http://mail.archive.org/pipermail/ol-discuss/2009-March/000391.html

...

...
...
...
...
...
*A wiki for book metadata, with an entry for every published work, statistics about its use and siblings, and discussion about its usefulness as a citation (a collaboration with OpenLibrary, merging WikiCite ideas)

To me, that sounds exactly as what OpenLibrary already does (or could be doing in the near time), so why even set up a new project that would collaborate with it? Later you added:

However, this is not what OL or its wiki do now. And OL is not run by its community, the community helps support the work of a centrally directed group. So there is only so much I feel I can contribute to the project by making suggestions. The wiki built into the fiber of OL is intentionally not used for general discussion.

...

I was talking about the metadata for all books ever published, including the Swedish translations of Mark Twain's works, which are part of Mark Twain's bibliography, of the translator's bibliography, of American literature, and of Swedish language literature. In OpenLibrary all of these are contained in one project. In Wikisource, they are split in one section for English and another section for Swedish. That division makes sense for the contents of the book, but not for the book metadata.

This is a problem that Wikisource needs to address, regardless of where the OpenLibrary metadata goes. It is similar to the Wiktionary problem of wanting some content - the array of translations of a single definition - to exist in one place and be transcluded in each language.

...

Now you write:

...
However, the project I have in mind for OCR cleaning and translation needs to

That is a change of subject. That sounds just like what Wikisource (or PGDP.net) is about. OCR cleaning is one thing, but it is an entirely different thing to set up "a wiki for book metadata, with an entry for every published work". So which of these two project ideas are we talking about?

They are closely related.

There needs to be a global authority file for works -- a [set of] universal identifier[s] for a given work in order for wikisource (as it currently stands) to link the German translation of the English transcription of OCR of the 1998 photos of the 1572 Rotterdam Codex... to its metadata entry [or entries].

I would prefer for this authority file to be wiki-like, as the Wikipedia authority file is, so that it supports renames, merges, and splits with version history and minimal overhead; hence I wish to see a wiki for this sort of metadata.

Currently OL does not quite provide this authority file, but it could. I do not know how easily.

...

Every book ever published means more than 10 million records. (It probably means more than 100 million records.) OCR cleaning attracts hundreds or a few thousand volunteers, which is sufficient to take on thousands of books, but not millions.

Focusing efforts on notable works with verifiable OCR, and using the sorts of helper tools that Greg's paper describes, I do not doubt that we could effectively clean and publish OCR for all primary sources that are actively used and referenced in scholarship today (and more besides). Though 'we' here is the world - certainly more than a few thousand volunteers have at least one book they would like to polish. Most of them are not currently Wikimedia contributors, that much is certain -- we don't provide any tools to make this work convenient or rewarding.

...

Google scanned millions of books already, but I haven't heard of any plans for cleaning all that OCR text.

Well, Google does not believe in distributed human effort. (This came up in a recent Knol thread as well.) I'm not sure that is the best comparison.

Yann Forget

7:15 p.m.

New subject: Open Library, Wikisource, and cleaning and translating OCR of Classics

Hello,

This discussion is very interesting. I would like to make a summary, so that we can go further.

1. A database of all books ever published is one of the thing still missing. 2. This needs massive collaboration by thousands of volunteers, so a wiki might be appropriate, however... 3. The data needs a structured web site, not a plain wiki like Mediawiki. 4. A big part of this data is already available, but scattered on various databases, in various languages, with various protocols, etc. So a big part of work needs as much database management knowledge as librarian knowledge. 5. What most missing in these existing databases (IMO) is information about translations: nowhere there are a general database of translated works, at least not in English and French. It is very difficult to find if a translation exists for a given work. Wikisource has some of this information with interwiki links between work and author pages, but for a (very) small number of works and authors. 6. It would be best not to duplicate work on several places.

Personally I don't find OL very practical. May be I am too much used too Mediawiki. ;oD

We still need to create something, attractive to contributors and readers alike.

Yann

Samuel Klein wrote:

...

...
This thread started out with a discussion of why it is so hard to start new projects within the Wikimedia Foundation. My stance is that projects like OpenStreetMap.org and OpenLibrary.org are doing fine as they are, and there is no need to duplicate their effort within the WMF. The example you gave was this:

I agree that there's no point in duplicating existing functionality. The best solution is probably for OL to include this explicitly in their scope and add the necessary functionality. I suggested this on the OL mailing list in March. http://mail.archive.org/pipermail/ol-discuss/2009-March/000391.html

...
...
...
...
...
> *A wiki for book metadata, with an entry for every published > work, statistics about its use and siblings, and discussion > about its usefulness as a citation (a collaboration with > OpenLibrary, merging WikiCite ideas)

To me, that sounds exactly as what OpenLibrary already does (or could be doing in the near time), so why even set up a new project that would collaborate with it? Later you added:

However, this is not what OL or its wiki do now. And OL is not run by its community, the community helps support the work of a centrally directed group. So there is only so much I feel I can contribute to the project by making suggestions. The wiki built into the fiber of OL is intentionally not used for general discussion.

...
I was talking about the metadata for all books ever published, including the Swedish translations of Mark Twain's works, which are part of Mark Twain's bibliography, of the translator's bibliography, of American literature, and of Swedish language literature. In OpenLibrary all of these are contained in one project. In Wikisource, they are split in one section for English and another section for Swedish. That division makes sense for the contents of the book, but not for the book metadata.

This is a problem that Wikisource needs to address, regardless of where the OpenLibrary metadata goes. It is similar to the Wiktionary problem of wanting some content - the array of translations of a single definition - to exist in one place and be transcluded in each language.

...
Now you write:

...
However, the project I have in mind for OCR cleaning and translation needs to

That is a change of subject. That sounds just like what Wikisource (or PGDP.net) is about. OCR cleaning is one thing, but it is an entirely different thing to set up "a wiki for book metadata, with an entry for every published work". So which of these two project ideas are we talking about?

They are closely related.

There needs to be a global authority file for works -- a [set of] universal identifier[s] for a given work in order for wikisource (as it currently stands) to link the German translation of the English transcription of OCR of the 1998 photos of the 1572 Rotterdam Codex... to its metadata entry [or entries].

I would prefer for this authority file to be wiki-like, as the Wikipedia authority file is, so that it supports renames, merges, and splits with version history and minimal overhead; hence I wish to see a wiki for this sort of metadata.

Currently OL does not quite provide this authority file, but it could. I do not know how easily.

...
Every book ever published means more than 10 million records. (It probably means more than 100 million records.) OCR cleaning attracts hundreds or a few thousand volunteers, which is sufficient to take on thousands of books, but not millions.

Focusing efforts on notable works with verifiable OCR, and using the sorts of helper tools that Greg's paper describes, I do not doubt that we could effectively clean and publish OCR for all primary sources that are actively used and referenced in scholarship today (and more besides). Though 'we' here is the world - certainly more than a few thousand volunteers have at least one book they would like to polish. Most of them are not currently Wikimedia contributors, that much is certain -- we don't provide any tools to make this work convenient or rewarding.

...
Google scanned millions of books already, but I haven't heard of any plans for cleaning all that OCR text.

Well, Google does not believe in distributed human effort. (This came up in a recent Knol thread as well.) I'm not sure that is the best comparison.

SJ

-- http://www.non-violence.org/ | Site collaboratif sur la non-violence http://www.forget-me.net/ | Alternatives sur le Net http://fr.wikisource.org/ | Bibliothèque libre http://wikilivres.info | Documents libres

Lars Aronsson

17 Aug 17 Aug

11:26 p.m.

New subject: Open Library, Wikisource, and cleaning and translating OCR of Classics

Yann Forget wrote:

...

This discussion is very interesting. I would like to make a summary, so that we can go further.

A database of all books ever published is one of the thing still missing.

No, no, no, this is *not* missing. This is exactly the scope of OpenLibrary. Just as Wikipedia is not yet a complete encyclopedia, or OpenStreetMap is not yet a complete map of the world, some books are still missing from OpenLibrary's database, but it is a project aiming to compile a database of every book ever published.

...

Personally I don't find OL very practical. May be I am too much used too Mediawiki. ;oD

And therefore, you would not try to improve OpenLibrary, but rather start an entirely new project based on MediaWiki? I'm afraid that this ("not invented here") is a common sentiment, and a major reason that we will get nowhere.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Yann Forget

18 Aug 18 Aug

9:45 p.m.

New subject: Open Library, Wikisource, and cleaning and translating OCR of Classics

Hello,

Lars Aronsson wrote:

...

Yann Forget wrote:

...
This discussion is very interesting. I would like to make a summary, so that we can go further.

A database of all books ever published is one of the thing

still missing.

No, no, no, this is *not* missing. This is exactly the scope of OpenLibrary. Just as Wikipedia is not yet a complete encyclopedia, or OpenStreetMap is not yet a complete map of the world, some books are still missing from OpenLibrary's database, but it is a project aiming to compile a database of every book ever published.

At least Wikipedia can say that it has the most complete encyclopedia, and OpenStreetMap the most complete free maps that ever existed. AFAIK OpenLibrary is very very far to have anything comprensive, through I am curious to have the figures. As I already said, the first steps would be to import existing databases, and Wikimedians are very good at this job.

...

...
Personally I don't find OL very practical. May be I am too much used too Mediawiki. ;oD

And therefore, you would not try to improve OpenLibrary, but rather start an entirely new project based on MediaWiki? I'm afraid that this ("not invented here") is a common sentiment, and a major reason that we will get nowhere.

You are wrong here. I was delighted to see a project as OL and I inserted a few books and authors, but I have not been convinced. On books and authors, Wikimedia projects have already much more data than OL, and a lot of basic funtionalities are not available: tagging 2 entries as identical (redirect), multilinguism, links between related entries (interwiki), etc.

I don't really care who would host this "Universal Library", as long as it is freely available with a powerful search engine, and no restriction on reuse. What I say is that Mediawiki is really much better that anything else for any massive online cooperative work. The most important point for such a project is building a community. OpenLibrary has certainly done a good job, but I don't see _a community_. The tools and the social environment available on Wikimedia projects are missing. I believe the social environment is a consequence both of the software and the leadership. Once the community exists it may be self-sustaining if other conditions are met. OL lacks a good software as Mediawiki and a leader as Jimbo.

Yann

Lars Aronsson

21 Aug 21 Aug

3:53 a.m.

New subject: Open Library, Wikisource, and cleaning and translating OCR of Classics

Yann Forget wrote:

...

As I already said, the first steps would be to import existing databases, and Wikimedians are very good at this job.

Do you have a bibliographic database (library catalog) of French literature that you can upload? How many records? Convincing libraries to donate copies of their catalogs has been a bottleneck for OpenLibrary.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Yann Forget

1:33 p.m.

New subject: Open Library, Wikisource, and cleaning and translating OCR of Classics

Lars Aronsson wrote:

...

Yann Forget wrote:

...
As I already said, the first steps would be to import existing databases, and Wikimedians are very good at this job.

Do you have a bibliographic database (library catalog) of French literature that you can upload? How many records? Convincing libraries to donate copies of their catalogs has been a bottleneck for OpenLibrary.

No, I don't have such a database. There is a copyright on databases in Europe, which makes things complicated.

Probably we need to start with libraries which are already collaborating with open content projects. There was a GLAM-wiki meeting in Australia recently: there might be a possibility with an Australian library?

But even before that, if we could extract the data from Wikimedia projects, we could create a basic working frame. I have been collecting such data on Wikisource and Wikibooks, but the lack of a structured system is a bottleneck.

Examples: 1. Comprehensive bibliography of Gandhi in French http://fr.wikibooks.org/wiki/Bibliographie_de_Gandhi

2. French translations of Russian authors: http://fr.wikisource.org/wiki/Discussion_Auteur:L%C3%A9on_Tolsto%C3%AF http://fr.wikisource.org/wiki/Discussion_Auteur:F%C3%A9dor_Mikha%C3%AFlovitc...

Regards,

Yann

Yann Forget

1 Sep 1 Sep

10:35 p.m.

New subject: Universal Library

Hello,

I started a proposal on the Strategy Wiki: http://strategy.wikimedia.org/wiki/Proposal:Building_a_database_of_all_books...

IMO this should be a join project between Openlibrary and Wikimedia. Both have an interest and a capacity to work on this.

Regards,

Yann

Lars Aronsson

2 Sep 2 Sep

8:21 a.m.

New subject: [Foundation-l] Universal Library

Yann Forget wrote:

...

I started a proposal on the Strategy Wiki: http://strategy.wikimedia.org/wiki/Proposal:Building_a_database_of_all_books...

IMO this should be a join project between Openlibrary and Wikimedia.

Again, I don't understand why. What exactly is missing in OpenLibrary? Why does it need to be a new, joint project?

The page says "There is currently no database of all books ever published freely available." But OpenLibrary is a project already working towards exactly that goal. It's not done yet, and its methods are not yet fully developed. But neither would your new "joint" project be, for a very long time.

Wikipedia is also far from complete, far from containing "the sum of all human knowledge". But that doesn't create a need to start entirely new encyclopedia projects. It only means more contributors are needed in the existing Wikipedia.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Yann Forget

4:58 p.m.

New subject: [Foundation-l] Universal Library

Lars Aronsson wrote:

...

Yann Forget wrote:

...
I started a proposal on the Strategy Wiki: http://strategy.wikimedia.org/wiki/Proposal:Building_a_database_of_all_books...

IMO this should be a join project between Openlibrary and Wikimedia.

Again, I don't understand why. What exactly is missing in OpenLibrary? Why does it need to be a new, joint project?

The page says "There is currently no database of all books ever published freely available." But OpenLibrary is a project already working towards exactly that goal. It's not done yet, and its methods are not yet fully developed. But neither would your new "joint" project be, for a very long time.

Wikipedia is also far from complete, far from containing "the sum of all human knowledge". But that doesn't create a need to start entirely new encyclopedia projects. It only means more contributors are needed in the existing Wikipedia.

You just give again the same arguments, to which I have answered. Did you read my answer?

Regards,

Yann

5589

Age (days ago)

5611

Last active (days ago)

wikisource-l@lists.wikimedia.org

12 comments

5 participants

tags (0)

participants (5)

John Vandenberg
Lars Aronsson
Nemo_bis
Samuel Klein
Yann Forget