Greetings.
Off and on for many months been working on a project to import a large
collection of public domain historic scientific documents into
Wikimedia's collection.
My standing plan has been to pre-organize and catalog the collection,
then upload the document images as DJVU files (which are utterly tiny
compared to tiffs or pdfs) to commons including a OCRed Text layer
(for search and copy and paste).
I would then begin importing documents into Wikisource, starting with
the OCR but eventually having a full marked up output. From there the
documents could be extensively linked and referenced from the other
Wikimedia projects.
Most of the delays in my work have been waiting for free software OCR
technology to be able to handle documents from the 18th century. With
the recent beta releases of Ocropus and Tesseract from Google I feel
the results are finally good enough to move forward.
I do have some open questions though.
I'd really like it if the corrected text in wikisource could be
imported back into the djvu document images. What I'd like to do is
leave invisible markup generated by the ocr software in the page text,
like this:
<span class='ocr_line' title='bbox 551 4202 2666 4278 1'>The first
experiments were made on the absorption of carbonic</span> <span
class='ocr_line' title='bbox 474 4281 2668 4355 1'>acid gas by water:
and here a singular disagreement was observed</span> <span
class='ocr_line' title='bbox 471 4360 2668 4433 1'>in the first trials
made under exactly the same circumstances. It</span>
>From this the ocred text could be corrected, and markup could be
added, but I could still take the output and apply it back to the
original document. If people feel this would frustrate editing too
much we could make some Javascript hacks to the edit box to reduce the
span tags to nothing more than an immutable <S marker.
Would this be acceptable?
Hello,
This is now off topic for foundation-l.
Better to continue this on wikisource-l.
Klaus Graf wrote:
>> Hello,
>>
>> I agree with Ray here, and I think that Klaus' mail does not report
>> exactly the reality. The French Wikisource has the greatest numbers of
>> scanned texts so far,
>
> Is there a proof for this claim?
http://wikisource.org/wiki/Wikisource:ProofreadPage_Statistics
lists 40,043 pages for fr.ws and 16,939 for de.ws.
http://fr.wikisource.org/wiki/Wikisource:Livres_disponibles_en_mode_page
lists 62,326 scanned pages (not yet all ocred and proofread, and I am
not sure that this page is up to date).
> but does not make mandatory to have them to
>> publish a text there. It is only a suggestion, which many contributors
>> follow.
>>
>> I think that the important point is not scanned texts, but notation on
>> whether and how the texts are proofread by editors, whatever means the
>> editors use to proofread the texts.
>
> I am monitoring discussions on digitization projects as archival
> professional since years. It's standard to give not only e-texts but
> scans. Wikisource demands no scans when a permanent web adress (e.g.
> library project) for the scans outside Commons is given.
>
> I think the average quality of other Wikisource branches is very poor.
> In most cases there is no source given: one cannot know which source
> is used, and for scholarly purposes the e-text is worthless.
I think we already have had this discussion earlier, but the
misunderstanding continues.
There are two different issues here. It is important not to mix them:
1. Scans provided alongside texts.
2. Notation of quality.
Quality is not an absolute value. It is relative to the sources
available for a given text. Quality does not have the same meaning for a
text from 1920 and a text from the 15th century. So one should not talk
about quality, but about notation of quality.
I agree that giving the source is important and should be part of a
quality notation. The most important is to have a clear notation so that
readers know how and by whom the texts have been proofread. Scans alone
are not a proof of quality, but they help getting a better quality. They
are not the only way to get good quality texts. Some texts may be
proofread by several contributors, so of very good qualilty, but
Wikisource might not be able to have scanned images if a public domain
edition is not easily avalaible.
> Klaus Graf
Regards,
Yann
--
http://www.non-violence.org/ | Site collaboratif sur la non-violence
http://www.forget-me.net/ | Alternatives sur le Net
http://fr.wikisource.org/ | Bibliothèque libre
http://wikilivres.info | Documents libres
Hello,
I agree with Ray here, and I think that Klaus' mail does not report
exactly the reality. The French Wikisource has the greatest numbers of
scanned texts so far, but does not make mandatory to have them to
publish a text there. It is only a suggestion, which many contributors
follow.
I think that the important point is not scanned texts, but notation on
whether and how the texts are proofread by editors, whatever means the
editors use to proofread the texts.
Regards,
Yann
Ray Saintonge wrote:
> Klaus Graf wrote:
>> One can add de.Wikisource which is a project making historical Public
>> Domain texts in German available with high quality standards. These
>> standards are NOT (yet) shared by the other Wikisource projects, see
>> also
>>
>> http://wikisource.org/wiki/Wikisource:Scriptorium#The_huge_leap
>>
>> Only de.Wikisource demands scanned texts (or digital photos) for
>> contributions, most other Wikisource branches have a lot of texts
>> which are unsourced. De.Wikisource has notes commenting the texts for
>> lots of texts.
> Much of what you suggest is not about to happen any time soon. The fact
> is that splitting up the Wikisource communities created circumstances
> where each Wikisource develops its own standards and criteria. The
> discussions which may have taken place leading up to these policies on
> de:Wikisource either did not take place elsewhere or did not have the
> same results. At best, there have been few determined contributors
> willing to lead by example. Simply telling people to do these gets nowhere.
>
> There is a clear benefit to having to having our texts supported by
> scanned texts, but many of us who may work well with textual material,
> may not have the same technical ease when working with images of any
> kind. Even adding a small number of illustrations that may otherwise
> accompany a text can be a problematic chore. I am quite prepared to
> identify where I found my material, but I am quite content to have
> others do the work of digitization.
>
> Commenting on texts is a great idea that could stand to be encouraged more.
>
> I agree with the premise that we cannot hope to keep up with the massive
> digitization projects undertaken by well-funded institutions, but a lot
> of restrictive requirements is self-defeating. The need is really for a
> balance somewhere between the minutiae of quality and the feeling that
> contributors are seeing a lot of growth. Wikisource will not become
> great by trying to beat the big institutions at their own game. Thus we
> need to ask oursaelves what we can do to add value that no other similar
> project can do. In doing so we cannot afford to get bogged down in
> standardized headings that do not allow for easy expansion without a
> complete understanding of tranclusion technology. We need to allow our
> imaginations the freedom to find new ways of connecting data without
> being tied to formal structures that are so strict as to close off these
> paths.
>
> Ec
--
http://www.non-violence.org/ | Site collaboratif sur la non-violence
http://www.forget-me.net/ | Alternatives sur le Net
http://fr.wikisource.org/ | Bibliothèque libre
http://wikilivres.info | Documents libres
On Jan 3, 2008 11:28 AM, geni <geniice(a)gmail.com> wrote:
> On 02/01/2008, Erik Moeller <erik(a)wikimedia.org> wrote:
> > FYI
> >
> > http://www.zotero.org/blog/zotero-and-the-internet-archive-join-forces/
> >
>
> Nothing new there are a number of sites that accept text dumps of
> copyvios already.
and there are many projects that don't permit copyvios.
Internet Archive's digital repository is at least as clean as
Commons/Wikisource, probably more so.
Zotero is backed by a university, develops open source software, and
is receiving grants from a notable funding body: I doubt that they
have neglected to consider copyright.
--
John
FYI
http://www.zotero.org/blog/zotero-and-the-internet-archive-join-forces/
Recently the Andrew W. Mellon Foundation awarded the Center for
History and New Media and the Internet Archive $1.2 million dollars to
develop new services that will aid scholarly sharing, collaboration,
citation, and annotation.
In 2008, users will be able to drag and drop items into the "Zotero
Commons"—a dedicated part of the Internet Archive's servers—through
icon in the left column.
Zotero Commons
Items donated to the Commons will be stored in subdirectories of the
Commons named for the donors. In addition to encouraging donations to
the commons (since those donating will receive credit for their
contributions), this feature will also enable users to identify others
who are working with and/or annotating the same content, fostering new
collaboration opportunities. The benefits to the scholarly community
of the Common are thus threefold:
1) The availability of permanent, persistent archival, off-site
storage for long-term management and use of digital content.
2) The ability to share resources publicly for easy access by other scholars.
3) The simplified discovery of new, related resources and potential
collaboration opportunities.
As an added incentive to donate to the Commons, the Internet Archive
will provide free OCR for your contributions and send you the
transcribed text to help you search your personal library.
In addition, modifications will be made to Zotero to make it easier
for researchers to select already archived files and web pages from
the Internet Archive's existing collections rather than saving local
copies. This will enable better referencing of "born digital" items
and allow for the collaborative annotation of web documents.
Zotero Commons and Zotero 2.0
Zotero 2.0 will allow you to sync your library's metadata to the Zotero Server.
You will sync your metadata with the Zotero server
With Zotero Commons you will be able to contribute public domain
images, texts, audio and other files.
You can also contribute files to the Zotero Commons
In turn, the Internet Archive will send you any text extracted from
donated documents.
--
Erik Möller