Wikisource-l January 2008

wikisource-l@lists.wikimedia.org

11 participants
6 discussions

by Gregory Maxwell

Greetings. Off and on for many months been working on a project to import a large collection of public domain historic scientific documents into Wikimedia's collection. My standing plan has been to pre-organize and catalog the collection, then upload the document images as DJVU files (which are utterly tiny compared to tiffs or pdfs) to commons including a OCRed Text layer (for search and copy and paste). I would then begin importing documents into Wikisource, starting with the OCR but eventually having a full marked up output. From there the documents could be extensively linked and referenced from the other Wikimedia projects. Most of the delays in my work have been waiting for free software OCR technology to be able to handle documents from the 18th century. With the recent beta releases of Ocropus and Tesseract from Google I feel the results are finally good enough to move forward. I do have some open questions though. I'd really like it if the corrected text in wikisource could be imported back into the djvu document images. What I'd like to do is leave invisible markup generated by the ocr software in the page text, like this: The first experiments were made on the absorption of carbonic acid gas by water: and here a singular disagreement was observed in the first trials made under exactly the same circumstances. It >From this the ocred text could be corrected, and markup could be added, but I could still take the output and apply it back to the original document. If people feel this would frustrate editing too much we could make some Javascript hacks to the edit box to reduce the span tags to nothing more than an immutable <S marker. Would this be acceptable?

16 years, 3 months

by Yann Forget

Hello, I think the information was not spread here. http://wikisource.org/wiki/Wikisource:Scriptorium#Copyright_clarifications_… It seems that there are still a lot of unclear situations through. Yann -------- Original Message -------- Subject: [Foundation-l] Thank you for copyright clarifications Date: Fri, 18 Jan 2008 13:25:27 -0800 (PST) From: Birgitte SB <birgitte_sb(a)yahoo.com> Reply-To: Wikimedia Foundation Mailing List <foundation-l(a)lists.wikimedia.org> To: Wikimedia Foundation Mailing List <foundation-l(a)lists.wikimedia.org> I would like to publicly thank Anthere and Mike Godwin for the clarifications on public domain derterminations.[1] I know that WMF cannot make a habit of reviewing every disputed interpretation of copyright law, but these general clarifications will really help a number of wikis make better informed decisions about their policies. Thank you very much for making these opinions public. Birgitte SB [1] http://meta.wikimedia.org/w/index.php?title=User_talk:Anthere&diff=677612&o… http://meta.wikimedia.org/w/index.php?title=User_talk%3AAnthere&diff=767102… http://meta.wikimedia.org/w/index.php?title=User_talk%3AAnthere&diff=839006… -- http://www.non-violence.org/ | Site collaboratif sur la non-violence http://www.forget-me.net/ | Alternatives sur le Net http://fr.wikisource.org/ | Bibliothèque libre http://wikilivres.info | Documents libres

16 years, 3 months

Scanned texts

by Yann Forget

Hello, This is now off topic for foundation-l. Better to continue this on wikisource-l. Klaus Graf wrote: >> Hello, >> >> I agree with Ray here, and I think that Klaus' mail does not report >> exactly the reality. The French Wikisource has the greatest numbers of >> scanned texts so far, > > Is there a proof for this claim? http://wikisource.org/wiki/Wikisource:ProofreadPage_Statistics lists 40,043 pages for fr.ws and 16,939 for de.ws. http://fr.wikisource.org/wiki/Wikisource:Livres_disponibles_en_mode_page lists 62,326 scanned pages (not yet all ocred and proofread, and I am not sure that this page is up to date). > but does not make mandatory to have them to >> publish a text there. It is only a suggestion, which many contributors >> follow. >> >> I think that the important point is not scanned texts, but notation on >> whether and how the texts are proofread by editors, whatever means the >> editors use to proofread the texts. > > I am monitoring discussions on digitization projects as archival > professional since years. It's standard to give not only e-texts but > scans. Wikisource demands no scans when a permanent web adress (e.g. > library project) for the scans outside Commons is given. > > I think the average quality of other Wikisource branches is very poor. > In most cases there is no source given: one cannot know which source > is used, and for scholarly purposes the e-text is worthless. I think we already have had this discussion earlier, but the misunderstanding continues. There are two different issues here. It is important not to mix them: 1. Scans provided alongside texts. 2. Notation of quality. Quality is not an absolute value. It is relative to the sources available for a given text. Quality does not have the same meaning for a text from 1920 and a text from the 15th century. So one should not talk about quality, but about notation of quality. I agree that giving the source is important and should be part of a quality notation. The most important is to have a clear notation so that readers know how and by whom the texts have been proofread. Scans alone are not a proof of quality, but they help getting a better quality. They are not the only way to get good quality texts. Some texts may be proofread by several contributors, so of very good qualilty, but Wikisource might not be able to have scanned images if a public domain edition is not easily avalaible. > Klaus Graf Regards, Yann -- http://www.non-violence.org/ | Site collaboratif sur la non-violence http://www.forget-me.net/ | Alternatives sur le Net http://fr.wikisource.org/ | Bibliothèque libre http://wikilivres.info | Documents libres

16 years, 3 months

Re: [Wikisource-l] Promotion of lesser known projects

by Yann Forget

Hello, I agree with Ray here, and I think that Klaus' mail does not report exactly the reality. The French Wikisource has the greatest numbers of scanned texts so far, but does not make mandatory to have them to publish a text there. It is only a suggestion, which many contributors follow. I think that the important point is not scanned texts, but notation on whether and how the texts are proofread by editors, whatever means the editors use to proofread the texts. Regards, Yann Ray Saintonge wrote: > Klaus Graf wrote: >> One can add de.Wikisource which is a project making historical Public >> Domain texts in German available with high quality standards. These >> standards are NOT (yet) shared by the other Wikisource projects, see >> also >> >> http://wikisource.org/wiki/Wikisource:Scriptorium#The_huge_leap >> >> Only de.Wikisource demands scanned texts (or digital photos) for >> contributions, most other Wikisource branches have a lot of texts >> which are unsourced. De.Wikisource has notes commenting the texts for >> lots of texts. > Much of what you suggest is not about to happen any time soon. The fact > is that splitting up the Wikisource communities created circumstances > where each Wikisource develops its own standards and criteria. The > discussions which may have taken place leading up to these policies on > de:Wikisource either did not take place elsewhere or did not have the > same results. At best, there have been few determined contributors > willing to lead by example. Simply telling people to do these gets nowhere. > > There is a clear benefit to having to having our texts supported by > scanned texts, but many of us who may work well with textual material, > may not have the same technical ease when working with images of any > kind. Even adding a small number of illustrations that may otherwise > accompany a text can be a problematic chore. I am quite prepared to > identify where I found my material, but I am quite content to have > others do the work of digitization. > > Commenting on texts is a great idea that could stand to be encouraged more. > > I agree with the premise that we cannot hope to keep up with the massive > digitization projects undertaken by well-funded institutions, but a lot > of restrictive requirements is self-defeating. The need is really for a > balance somewhere between the minutiae of quality and the feeling that > contributors are seeing a lot of growth. Wikisource will not become > great by trying to beat the big institutions at their own game. Thus we > need to ask oursaelves what we can do to add value that no other similar > project can do. In doing so we cannot afford to get bogged down in > standardized headings that do not allow for easy expansion without a > complete understanding of tranclusion technology. We need to allow our > imaginations the freedom to find new ways of connecting data without > being tied to formal structures that are so strict as to close off these > paths. > > Ec -- http://www.non-violence.org/ | Site collaboratif sur la non-violence http://www.forget-me.net/ | Alternatives sur le Net http://fr.wikisource.org/ | Bibliothèque libre http://wikilivres.info | Documents libres

16 years, 3 months

Re: [Wikisource-l] [Commons-l] Zotero / Archive Cooperation

by John Vandenberg

On Jan 3, 2008 11:28 AM, geni <geniice(a)gmail.com> wrote: > On 02/01/2008, Erik Moeller <erik(a)wikimedia.org> wrote: > > FYI > > > > http://www.zotero.org/blog/zotero-and-the-internet-archive-join-forces/ > > > > Nothing new there are a number of sites that accept text dumps of > copyvios already. and there are many projects that don't permit copyvios. Internet Archive's digital repository is at least as clean as Commons/Wikisource, probably more so. Zotero is backed by a university, develops open source software, and is receiving grants from a notable funding body: I doubt that they have neglected to consider copyright. -- John

16 years, 3 months

Zotero / Archive Cooperation

by Erik Moeller

FYI http://www.zotero.org/blog/zotero-and-the-internet-archive-join-forces/ Recently the Andrew W. Mellon Foundation awarded the Center for History and New Media and the Internet Archive $1.2 million dollars to develop new services that will aid scholarly sharing, collaboration, citation, and annotation. In 2008, users will be able to drag and drop items into the "Zotero Commons"—a dedicated part of the Internet Archive's servers—through icon in the left column. Zotero Commons Items donated to the Commons will be stored in subdirectories of the Commons named for the donors. In addition to encouraging donations to the commons (since those donating will receive credit for their contributions), this feature will also enable users to identify others who are working with and/or annotating the same content, fostering new collaboration opportunities. The benefits to the scholarly community of the Common are thus threefold: 1) The availability of permanent, persistent archival, off-site storage for long-term management and use of digital content. 2) The ability to share resources publicly for easy access by other scholars. 3) The simplified discovery of new, related resources and potential collaboration opportunities. As an added incentive to donate to the Commons, the Internet Archive will provide free OCR for your contributions and send you the transcribed text to help you search your personal library. In addition, modifications will be made to Zotero to make it easier for researchers to select already archived files and web pages from the Internet Archive's existing collections rather than saving local copies. This will enable better referencing of "born digital" items and allow for the collaborative annotation of web documents. Zotero Commons and Zotero 2.0 Zotero 2.0 will allow you to sync your library's metadata to the Zotero Server. You will sync your metadata with the Zotero server With Zotero Commons you will be able to contribute public domain images, texts, audio and other files. You can also contribute files to the Zotero Commons In turn, the Internet Archive will send you any text extracted from donated documents. -- Erik Möller

16 years, 3 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Wikisource-l January 2008