Wikisource-l August 2012

wikisource-l@lists.wikimedia.org

11 participants
21 discussions

Scripto, free software for transcribing documents

by Lars Aronsson

Scripto is an alternative to the ProofreadPage extension used by Wikisource. It is based on Mediawiki but also on OpenLayers, the software used to zoom and pan in OpenStreetMap. The only website I have seen that uses Scripto is the U.K. War Department papers, and in many ways it is more clumsy than ProofreadPage. But there might be a few ideas that could be worth picking up. Take a look. The software is described at http://scripto.org/ As for reference installations, they mention http://wardepartmentpapers.org/transcribe.php -- Lars Aronsson (lars(a)aronsson.se) Aronsson Datateknik - http://aronsson.se

10 years, 7 months

Which books to scan to support Wikipedia

by Lars Aronsson

After Wikimania, I took a trip to Toronto, where the Internet Archive has a large book scanning center. If you have worked on Wikisource, I think you know that many of the books there come from the Internet Archive, which provides scanned images and OCR text in the form of a Djvu file, which can be proofread and presented in Wikisource. I went there to learn more about how to scan books, but another possibility is to use their existing facilities, which currently are not used at full capacity. The way the collaboration is set up, the room is provided by the University of Toronto, but the equipment and staff belong to the Internet Archive. The participating libraries around Canada decide which books to digitize and pay a small fee to the Internet Archive for scanning them. This opens up an interesting opportunity. If we want a particular book to be digitized, and we can find it in the library catalogs of the University of Toronto, it might be possible (in theory, at least) for us to present this wish and perhaps provide the money needed, either directly from the Wikimedia Foundation, or its chapters. Canada is a country with many immigrants and the library system has books in many languages. This brings me to the question: Which books would be the most important to scan, to help Wikipedia? For the Swedish and Danish Wikipedia, I know this are the out-of-copyright encyclopedias from the early 20th century, that I have already digitized in 2003 and 2008, respectively. The Czech "Ottuv slovnik" has already been digitized by Google, as the Wikipedia article points out, http://cs.wikipedia.org/wiki/Ott%C5%AFv_slovn%C3%ADk_nau%C4%8Dn%C3%BD But are the black-and-white scans good enough, or would the Czech community be interested in Internet Archive quality scans in color? If so, we have to figure out if the Internet Archive will scan it for free, if the UofT library will pay, if the Czech chapter can pay (some chapters have money, but are not allowed to send donated money abroad, because of tax deductions), or perhaps the WMF. I saw the Ottuv slovnik on the shelves, in the same building as the scanning stations. It's just waiting for somebody to piece the puzzle together. But let's begin with a wish list of which books to scan. Of course they need to be out-of-copyright to fit in the Internet Archive, Wikimedia Commons, and in Wikisource. Perhaps illustrated works are more interesting than text? This would be a GLAM + wiki cooperation that cuts across national borders. -- Lars Aronsson (lars(a)aronsson.se) Project Runeberg - free Nordic literature - http://runeberg.org/

11 years, 6 months

Norway's digital public library

by Lars Aronsson

In 2009, Norway's national library (www.nb.no) started a large-scale pilot project in book digitization, covering all Norwegian books from the 90s of each century, i.e. 1690-1699, 1790-1799, 1890-1899 and 1990-1999. Those in the latter range are of course still covered by copyright, and a contract was signed with Kopinor, an association that represents authors' and publishers' interests. The books under copyright can only be read from within Norway. Today, some 50,000 books are available on http://bokhylla.no/ The 2009 contract with Kopinor is available in English at http://www.nb.no/pressebilder/Contract_NationalLibraryandKopinor.pdf Many of these books are out of copyright, some are in other languages than Norwegian, and User:V85 has been very active in copying them to Wikimedia Commons and the corresponding language of Wikisource. Swedish Wikisource alone has 68 Index: pages originating from nb.no, http://sv.wikisource.org/wiki/Kategori:Nasjonalbiblioteket We often find the scanned images excellent, but the OCR not so good, so sometimes we add our own, improved OCR. One good example is Nansen's "Eskimo Life", that NB.no has scanned in Norwegian (1891), Swedish (1891), and English (1893), and all three are on Wikisource, http://en.wikisource.org/wiki/Eskimo_Life A Google search reveals that NB.no is indeed indexed, but with a less perfect OCR text, and Wikisource comes out on top, http://www.google.com/search?q=%22nansen+very+carefully+revised+the+text%22 ---Now for the news--- The other day, NB.no announced that they have signed a new agreement with Kopinor to continue this project and cover the whole of Norwegian literature until the year 2000. Some 250,000 titles will be available before 2017. The text of the new contract is not (yet) available. I haven't seen this announced in English yet. This is NB's own announcement in Norwegian, http://www.nb.no/aktuelt/norsk-litteratur-fra-hele-det-20.-aarhundre-paa-ne… The newspaper Aftenposten also wrote about it, http://www.aftenposten.no/6976583.html -- Lars Aronsson (lars(a)aronsson.se) Project Runeberg - free Nordic literature - http://runeberg.org/

11 years, 7 months

AJAX sharing of Commons data/metadata

by Alex Brollo

I sent this message into wikitech-l; I presume it's better to share content with wikisource people too. As you know, wikisource needs robust, well-defined data, and there's a > strict, deep relationship between wikisource and Commons since Commons > hosts images of books, in .djvu or .pdf files. Commons shares both images > and contents fo information page of images, so that any wiki project can > visualize a view-only "pseudo-page" accessing to a local page named as the > file name into Commons. > > Working into self-made data semantization into it.wikisouce using a lot of > creative tricks, we discovered that it's hard/almost impossible to read by > AJAX calls the contents of pages of other projects since well-known same > origin policy, but that File: local pages are considered as coming from > "same origin" so that they can be read as any other local page, and this > AJAX call asking for the content of > i.e. File:Die_Judenfrage_in_Deutchland_1936.djvu: > > html=$.ajax({url:" > http://wikisource.org/wiki/File:Die_Judenfrage_in_Deutchland_1936.djvu > ",async:false}).responseText; > > > gives back the html text of local File: view-only page, and this means > that any data stored into information page into Commons is freely > accessible by a javascript script and can easily used locally. In > particular, data stored into information and/or (much better) Book and > Creator templates can be retrieved and parsed > > Has this been described/used before? It seems a plain, simple way to share > and disseminate good, consistent metadata into any project; and this runs > from today, without any change on current wiki software. > > If you like, I'm sharing a practical test use of this trick into > wikisource.org too, you can import User:Alex brollo/Library.js and a lot > of smallo, original scripts will be loaded; click on "metadata" botton from > any page connected to a File: page ( namespaces Index, Page) and you'll see > a result coming from such an AJAX call. > > Alex brollo, from it.wikisource >

11 years, 7 months

Re: [Wikisource-l] @deWS Scans with image missing

by billinghurst

> What's the problem? If in a whole book single scans are lacking, you > can see them in the category easily with this red marker. > Feel free to provide the lacking page scan instead of posing silly > questions like the Commons administrator idiots. Thank you. Hi Klaus, The issue is at this point of time, you have a construct of images that are all duplicates of the same file. You are about to see the images deleted as duplicate <https://commons.wikimedia.org/wiki/Commons:Deletion_policy>, and when someone tried to illustrate an issue, you were insulting. Excellent. Regards, Andrew

11 years, 7 months

@deWS Scans with image missing

by billinghurst

Would someone at deWS be so kind to have a look at File:Der Sagenschatz des Königreichs Sachsen (Grässe) 360.gif https://commons.wikimedia.org/wiki/File:Der_Sagenschatz_des_K%C3%B6nigreich… and work out what is happening with that and likely to be the appropriate solution. At the moment, the only real choice is to delete it unless there is better guidance. Thanks. Regards, Andrew

11 years, 7 months

Re: [Wikisource-l] Roadmap Wikisource

by Dovi Jacobs

Ronnie, thanks for your comments. It's interesting to learn that you've chosen to publish modern versions of older texts on Wikisource. It would be nice to hear from a variety of our many dozens of languages as to how they handle these issues, so that our various projects can be aware of what each other are doing and learn from each other, and also so that the "Wikisource Roadmap" can better reflect the needs of projects in diverse languages and cultures. Lars wrote: >What you describe here might apply to any language, >primarily for texts from the time before the language >got a stable and modern orthography. >Russian Wikisource has a method for modernizing the >pre-1917 orthography, using the /?? page name suffix. >Swedish Wikisource has many texts in pre-1906 orthography, >but hasn't implemented any method for modernizing it; >readers are expected to be able to read the old spelling. Here too I am grateful for the examples from other languages. Though what I described might in principle apply to any language, I think it is clear that it applies much more to the literatures of some languages than to others. There is such a huge corpus of literature in English and other large European languages that it doesn't apply to, but less so in others. (Though even in English I can think of examples.) >Back in 2005, when I proposed to use scanned images in >Wikisource, I added two works in German and English >as examples, >http://meta.wikimedia.org/wiki/User:LA2/Digitizing_books_with_MediaWiki >I think you need to do something similar. Most of us >can't read Hebrew (or Swedish), and won't fully >understand any example given in such a small language. If I've understood you correctly, you mean that I should create a page that shows examples of other methods of editing? I'll have to think about how to accomplish making that understood for examples in a non-Latin alphabet... By the way, I've been aware of those pages and examples since you first published them, and I've always admired your work :-) Dovi

11 years, 7 months

Re: [Wikisource-l] Roadmap Wikisource

by Dovi Jacobs

Birgitte wrotw: >"despite our experimentation the only WP type edit wars I can remember were a few over stylistic issues and one translation" Birgitte, thanks for the information. In other words, at Hebrew Wikisource which is a smaller (but active) wiki, there has never once been an edit war over this in the 8 years of our history (at least to the best recollection of my admittedly faulty memory). At English Wikisource, which is the largest and most active wiki of all, there have been perhaps two over the many years, and they were satisfactorily resolved. However, even saying that, it is clear that Marc Galli is certainly quite correct in principle: The bottom line is that any activity that requires any amount of creativity could possibly result in an edit war. But not everything that is correct in theory is always borne out in practice. The Wikisources in English and Hebrew both indicate that the editing process involved in producing corrected/styled/annotated editions do not in practice produce a lot of edit-warring. As I suggested earlier, it may be both because of the people and processes involved: *In terms of the process, editing a text is far less likely to be a source of passionate controversy than are the many highly controversial topics covered in Wikipedia. *In terms of the people, those who enjoy editing such texts, even if they are passionate about what they do (in a positive sense), seem to be less argumentative on the whole than do the people who collaborate on Wikipedia articles that deal with highly controversial topics. Sébastien wrote: >"On the French WS there are some minor corrections I personally don?t consider as "too major" to qualify these of critical edition: - modernized (but not too much) version, e.g. the replacement of long S (?) by a modern S (s) (see e.g. [1] there is a gadget on the left column to change that: Options d?affichage > Texte modernis?) -- I have more concerns about rewritings of Ancient French to modern French and I even have concerns about rewritings of old spellings to modern spellings (e.g. in [1] a modern version could replace "toy" by "toi"), I don?t know the opinion/policies of the French community about that - very very obvious spelling mistakes (mostly typography errors I guess); there is a template on fr.ws for that [1] https://fr.wikisource.org/wiki/Le_Loup_et_l?Agneau" Just so people can get a better idea of what we are dealing with at Hebrew Wikisource, I would like to radically build upon Sébastien's example. Imagine a literature which until a century ago was mostly published in a fashion that lacked not just some updated spelling, but far more: Zero vowelization, zero punctuation (periods, commas, etc.), zero division into paragraphs of reasonable size, zero precise citation of exact sources (when an average work cites many thousands of sources by quoting them verbatum or as paraphrase but rarely provides the exact reference). Regarding the latter, the ability to easily put in wikilinks to sources is the ultimate tool for revolutionizing the entire body of literature as a whole, and not just the specific book at hand. Now on the one hand, any modern published edition of these same source texts adds all of these features to the great benefit of readers, but they are all of course copyrighted! On the other hand, to simply post the plain text on Wikisource without vowelization, punctuation, division into paragraphs and citation of sources provides little benefit to users. And I once again emphasize that this is the case for the *majority* of public domain literature in the language! (By the way, nothing I've written about here is a critical edition; that is something that goes beyond this. Rather, this is what is involved just to present an average text in a usable fashion.) There is no question that even adding this basic level of styling to a text involves creative effort, and therefore it is possible, even likely, that two different editors might differ sometimes on details. But in practice, we've found that cooperation and collaboration in wiki style, far from creating problems, is actually a congenial and enjoyable way to provide classic texts to the public in a useful way. Dovi

11 years, 8 months

Proper Namespace

by Dovi Jacobs

>I think that Wikisource communities could decide to *try* >implementing "critical editions" of texts. I would think it is >better to have a proper namespace for that, or at least >a clear template which warns users about the collaborative >nature of the edition. Thanks Aubrey, this point is crucial. I think it is important enough for a separate thread title. When at the very beginning we considered annotated editions at Hebrew Wikisource, the solution we found was quite simple: An "Annotation:" namespace. A separate namespace is a way to effectively address all of the legitimate concerns about creativity, neutrality, professionalism, etc. It immediately lets the user know that he is using a text that has been produced collaboratively and in a different way than basic typing and proofreading. It provides a clear separate space for people to produce valuable editions without allowing any confusion between the products of two different kinds of processes. For editions with serious editing guidelines (such as a critical edition), we include a link in the title template to a Wikisource page "about this edition". I suggested creating an "Annotation:" namespace some time ago at English Wikisource, but didn't get a response. And I am loathe to push the issue there myself, since I am hardly active in English anymore and so I don't see it as my place to deal with policy. But I do think, for those who are currently active in English doing things beyond proofreading, that it is highly important to create a safe and defined area for their activity through the creation of a namespace. It is wrong to leave them and their activity in limbo over the course of years, as it has been so far. So I highly urge those to whom this is important in English and other languages to seriously consider doing what we have done in Hebrew, namely to allocate a namespace for the purpose. It is simple, painless, and a highly intuitive solution to an inherent problem. And in our experience it works just fine. >I would also think that these critical editions would be for just >few texts, compared to the thousand of printed texts Wikisource >provides. And, of you think about, "neutrality" does not exists >neither in our proofreading work, there is always interpretatation... Correct. Basic proofreading will always provide many times more texts, and that is perfectly fine of course. And your point about neutrality is also correct in principle (though not everyone is fully aware of how true it is). >I'm interested in Wikisource critical editions (as I am in >annotation and hyperlinks), and as I explained before I think a >layer system should be better, but we are technologically far >from that. At Hebrew Wikisource we have created various templates for this purpose. They can do a lot, and actually make creating good new editions both possible and relatively easy, but they still cannot do everything that your proposed layer system would allow for. I truly hope that the technology will move in this direction (especially regarding TEI support), and would be happy to take part in that process. Dovi

11 years, 8 months

Re: [Wikisource-l] Roadmap Wikisource

by Marc Galli

Le 17/08/2012 07:54, Dovi Jacobs a écrit : > Ah, now I understand what you meant! But why do you think the editing > guidelines will be "without reference"? Just like a Wikipedia article > can and should be based on sources, the Wikisource guidelines for > editing a text should be written by people familiar with the > scholarship on that text, while referencing both that scholarship and > the relevant editions and manuscripts. Which is exactly what we try to > do. And it works quite well. I understand that in fact you do not strictly critical editions (I have try to see what you do on he:, but I am absolutely not familiar with this langage), but compilations of sources, which is quite different. In that sense, I have nothing to say. Good weekend.

11 years, 8 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Wikisource-l August 2012