Wikisource-l August 2014

wikisource-l@lists.wikimedia.org

21 participants
20 discussions

by David Cuenca

https://commons.wikimedia.org/wiki/Category:Wikimania_2014_Wikisourcemeetup Thanks to Vol'ha for taking them :) Cheers, Micru

9 years, 8 months

Re: [Wikisource-l] [tesseract-ocr] Outreach from the Wikisource community

by David Cuenca

PS: I forwarded Jim's message to one of the Belarusian Wikisourcers On Tue, Aug 12, 2014 at 11:12 PM, Jim O'Regan <joregan(a)gmail.com> wrote: > On 12 August 2014 17:25, Nick White <nick.white(a)durham.ac.uk> wrote: > > Dear Wikisourcerers, > > > > It's good to hear from you. Wikisource is awesome, as far as I am > > concerned. > > > >> One > >> of the most serious issues was raised by the Belarusian community which > uses 2 > >> different scripts with no commercial OCR support. This means that the > >> volunteers have to type each word manually. We wondered if it would be > possible > >> to train Tesseract to recognize these old texts using the text that has > been > >> already typed. > > > > Actually, Tesseract should already have support for Russian and > > Belarussian "out of the box"; see the 'rus' and 'bel' training data. > > > > 'bel' contains Cyrillic; there is also a Latin script ('Łacinka') for > Belarusian. (Russian is widely spoken in Belarus, but Russian texts > would be added to the Russian Wikisource). > > The question I'd have for the Belarusian Wikisourcers is: can they be > treated as having an exact mapping? (It doesn't need to be 1:1, I'm > aware that, e.g., 'нь' maps to 'ń'). I ask because, as I remember it, > there's very little text in Łacinka, and adapting Cyrillic material > could be useful. > > > One thing that wikisource could potentially do for us would be > > provide loads of proofread, freely reusable "ground truth" data to > > test Tesseract with. Are there programatic ways of getting at the > > data, for example downloading all page images and corresponding text > > that is marked as green, for a specific language / script? > > They're all added to a category, so that part should be pretty easy. > > -- > <Sefam> Are any of the mentors around? > <jimregan> yes, they're the ones trolling you > -- Etiamsi omnes, ego non

9 years, 8 months

Fwd: [tesseract-ocr] Outreach from the Wikisource community

by Thomas Tanon

Thanks a lot for this nice answer, A technical answer to the question: > Are there programatic ways of getting at the data, for example downloading all page images and corresponding text that is marked as green, for a specific language / script? Yes, you can get the list of Page: pages (the pages that contain the wikitext for a given scan image) using this API request: https://en.wikisource.org/w/api.php?action=query&generator=allpages&gapname… for en.wikisource where the Page: namespace id is 104 (this id is not the same in all Wikisources) (doc: https://www.mediawiki.org/wiki/API:Allpages ) Then you can just retrieve the content of a "green" page (the ones with "quality": 4) using this API request https://en.wikisource.org/w/api.php?action=query&prop=revisions&titles=Page… (doc: https://www.mediawiki.org/wiki/API:Properties#revisions_/_rv ). To get the image of a given Page: page, just use this API request https://en.wikisource.org/w/api.php?action=query&titles=Image:Albert%20Eins… that retrieves the url of a file from his title (the Page: pages has as page title "Page:NAME_OF_THE_FILE" with sometime after a "/PAGE_NUMBER_IN_A_MULTIPAGE_FILE" so you have in NAME_OF_THE_FILE the name of the image to use. Thanks again, Thomas > From: Nick White <nick.white(a)durham.ac.uk> > Date: Tue, Aug 12, 2014 at 6:25 PM > Subject: Re: [tesseract-ocr] Outreach from the Wikisource community > To: tesseract-ocr(a)googlegroups.com > Cc: "discussion list for Wikisource, the free library" <wikisource-l(a)lists.wikimedia.org>, David Cuenca <dacuetu(a)gmail.com> > > > Dear Wikisourcerers, > > It's good to hear from you. Wikisource is awesome, as far as I am > concerned. > > > One > > of the most serious issues was raised by the Belarusian community which uses 2 > > different scripts with no commercial OCR support. This means that the > > volunteers have to type each word manually. We wondered if it would be possible > > to train Tesseract to recognize these old texts using the text that has been > > already typed. > > Actually, Tesseract should already have support for Russian and > Belarussian "out of the box"; see the 'rus' and 'bel' training data. > > > We would like to know if you would be interested in exploring collaboration > > possibilities. I imagine that with your guidance we could prepare training data > > The first thing to do would be to take a look at the results you get > from Tesseract with the rus and bel training sets already available, > and let us know if they aren't appropriate. > > > not only in different languages, but also from different time > > periods, scripts, etc. > > As to training for specific scripts, time periods, etc., in theory > that is super cool, in practise probably one training set should be > able to cover more-or-less everything (except very different > scripts, like fraktur). That has been my experience with training > Ancient Greek (for which I have been interested in recognising > printing from a variety of time periods). > > So give Tesseract a whirl, and if it isn't appropriate, or doesn't > work for specific scripts, let us know and we can try to figure out > a plan. > > > At the moment it is not very clear how to achieve this. > > My plan is to rewrite the training documentation very soon, so > things should hopefully become clearer on that front. > > One thing that wikisource could potentially do for us would be > provide loads of proofread, freely reusable "ground truth" data to > test Tesseract with. Are there programatic ways of getting at the > data, for example downloading all page images and corresponding text > that is marked as green, for a specific language / script? > > Thanks for getting in touch! > > Nick > > > > -- > Etiamsi omnes, ego non

9 years, 8 months

Connecting Index pages to Wikidata items

by David Cuenca

Hi, I have started a new thread on the Wikiproject books about a possible way to connect the Index pages to Wikidata items: https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Books#Wikisource_in… Your input would be appreciated! Thanks! Micru

9 years, 8 months

Wikisource community at Wikimania 2014

by David Cuenca

Thanks to everyone who came to the Wikisource meetup at Wikimania 2014. We had a very nice talk and it was wonderful to get to know each other in person! Some of the topics that were discussed at the first Wikisource Community User Group meet up: - lack of OCR for Belarusian and other languages -> I just sent an email to the Tesseract list to see what could we do about it - what to do with "wikisource.org" (oldwikisource) -> probably it is worth creating a mul.* domain so we can connect it with Wikidata. It needs broader discussion / promotion - Wikidata -> not clear what to do with "Index:" pages, needs some thought about how to move that metadata to Wikidata. Consider that the relationship between the "index" (supporting/containing media, closer to Commons) and the edition item (version of a work) is not always 1:1. Perhaps, using "part of" or "published in" could be used here, but as said it needs thinking and work on some use cases (let's be brave creating properties and examples on test.wikidata.org) - tabular data (csv) -> wikisource could host such sources if there is enough interest and a technical solution to manage this kind of information. Needs more discussion, and gather tech support. - user engagement -> proposed to gamify proofreading. Identified potential interested partners, if we don't manage to get support, it can be thought as a GsoC next year. - advocacy -> wikisourcerors are encouraged to contact their nearest wikimedia chapter to look for support and raise awareness about the issues that need attention. These kind of trust relationships are necessary to organize the next proofreading contest. Chapters should be aware that with just 100eur in prizes and help reaching out (for instance banners on Wikipedia) a lot of things can be accomplished by volunteers like us. - outreach -> tell GLAM members about wikisource so when they contact with libraries they can promote wikisource as well (and not only wikipedia/commons) - open access -> identified as a high potential project for wikisource. It needs some automatic text verification so we don't waste volunteers time proofreading digital born documents. - social -> more boldness needed! Let's not be afraid of changing things to make the project more attractive for readers (for instance cover galleries on the main page, or try collaborative or experimental tools). More contact wished with the German community. Promote Wikisource on social platforms and on GLAM newsletters. - adminship -> it was proposed to promote Tpt to global steward so he can help better new communities. It was also proposed that experienced members could mentor wikisourcerors from new communities so they get to know all the tools/hidden gems that we have. Besides this year there were several wikisource related submissions - Wikisource technical infrastructure-> Tpt presented his advances on the visual editor integration and on the new beta feature "sister projects" to be deployed in the next weeks [1]. Xelgen presented the advances of his grant project for new Wikisource tools [2]. - Crowdsourcing the Digitization of Ben-Yehuda's Dictionary -> Asaf presented his Hebrew crowd-sourced dictionary project [3] - Wikisource panel -> Charles Matthews, Magnus Manske et al. explained what is Wikisource about, possible options to gamify it, and its possible role as a host of tabular data [4] - Reform citation structure -> where several panelists explained the new landscape of citations that is coming with Wikidata. I presented Wikisource very briefly, to raise awareness of our project and the role it plays supporting Wikipedia [5] - From paper to digital book on Wikisource -> where Xelgen described the best techniques to digitize a book and convert it into a digital one [6] I hope I didn't forget anything... there was *a lot* of Wikisource this year at Wikimania :) Thanks Micru [1] https://www.mediawiki.org/wiki/Beta_Features/Other_projects_sidebar [2] https://meta.wikimedia.org/wiki/Grants:IEG/Tools_for_Armenian_Wikisource_an… [3] https://wikimania2014.wikimedia.org/wiki/File:The_Old_New_Thing.pdf [4] https://wikimania2014.wikimedia.org/wiki/Submissions/Panel:_Wikisource,_fro… [5] https://wikimania2014.wikimedia.org/wiki/Submissions/Reform_of_citation_str… [6] https://wikimania2014.wikimedia.org/wiki/Submissions/From_paper_book_to_a_d…

9 years, 9 months

Outreach from the Wikisource community

by David Cuenca

Dear Tesseracters, At Wikisource, the free digital library and sister project of Wikipedia, we have founded a user group [1] to promote international coordination and partnerships with fellow organizations. We have thousands of high quality volunteer proofread pages [2] matched by scans in ca. 50 different languages [3]. Our editing interface of one single page looks like this [4], which has another view as "index" [5] or as text with all pages together [6]. There are several verification levels, the most important are "yellow" which means that one contributor proofread the page, and "green" which means that a second person verified the proofread text. This past weekend at Wikimania '14 in London we had a meeting were we discussed technical and social issues from several Wikisource language communities. One of the most serious issues was raised by the Belarusian community which uses 2 different scripts with no commercial OCR support. This means that the volunteers have to type each word manually. We wondered if it would be possible to train Tesseract to recognize these old texts using the text that has been already typed. We would like to know if you would be interested in exploring collaboration possibilities. I imagine that with your guidance we could prepare training data not only in different languages, but also from different time periods, scripts, etc. At the moment it is not very clear how to achieve this. Please let us know if you would like to have a hangout/skype conversation any day next week. Cheers, Micru [1] https://meta.wikimedia.org/wiki/Wikisource_Community_User_Group [2] https://wikisource.org/wiki/Wikisource:ProofreadPage_Statistics [3] http://stats.wikimedia.org/wikisource/EN/Sitemap.htm [4] https://en.wikisource.org/wiki/Page%3ATyrannosaurus_and_Other_Cretaceous_Ca… [5] https://en.wikisource.org/wiki/Index:Tyrannosaurus_and_Other_Cretaceous_Car… [6] https://en.wikisource.org/wiki/Tyrannosaurus_and_Other_Cretaceous_Carnivoro…

9 years, 9 months

Wikilivres.ca

by Nahum Wengrov

Hello all, does anyone know what happened to wikilivres.ca and why did it disappear?

9 years, 9 months

Wikisource meet-up on Saturday at Wikimania

by David Cuenca

Just a quick reminder for all wikisourcerors in London. The meet-up is tomorrow Saturday at 18:30 https://wikimania2014.wikimedia.org/wiki/Wikisource_Meetup See you there! Micru

9 years, 9 months

Add wiki data for old wikisoure

by Jayanta Nath

Hi, How to add wiki data for old wikisoure?? As an example Bankim Chandra Chattopadhyay ( https://www.wikidata.org/wiki/Q377881), there are no hindi wikisource, but we have a author page in old at https://wikisource.org/wiki/Author:%E0%A4%AC%E0%A4%82%E0%A4%95%E0%A4%BF%E0%… So how would we add this to wikidata?? Regards, Jayanta

9 years, 9 months

Custom progress level for Index pages

by Luiz Augusto

I've tried to create a new progress level for Index pages on pt.wikisource with no success (new level isn't being displayed on selection menu if I try to edit an Index page). I need to do more than those edits to get it working properly? https://pt.wikisource.org/w/index.php?diff=prev&oldid=278685 https://pt.wikisource.org/w/index.php?diff=prev&oldid=278686 https://pt.wikisource.org/w/index.php?diff=prev&oldid=278702

9 years, 9 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Wikisource-l August 2014