Hi,
For a long time Indic languages Wikisource projects depended totally
on manual proofreading, which not only wasted a lot of time, but also
a lot of energy. Recently Google has released OCR software for more
than 20 Indic languages, along with other Asian languages. This
software is far far better and accurate than the previous OCRs. But it
has many limitations. Uploading the same large file two times (one
time for Google OCR and another at Commons) is not an easy solution
for most of the contributors, as Internet connection is way slow in
India. Now if we develop a tool which can feed the uploaded pdf or
djvu files of Commons directly to Google OCRs, so that uploading them
2 times can be avoided.
This was proposed in 2015 community wishlist. Now, as the voting
procedure for the wishlist has been started, the proposal needs your
support. Please follow the link-
https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Wikisource#T…
FYI, this proposal was also accepted as a highest priority need at the
2015 Wikisource Conference in Vienna.
(https://etherpad.wikimedia.org/p/wscon2015needs)
Regards
--
Bodhisattwa Mandal
Administrator, Bengali Wikipedia
''Imagine a world in which every single person on the planet is given
free access to the sum of all human knowledge.''
Hi All,
I have checked for Bengali Images, its works fine with 100% accuracy. Any
how can it be implemented in Proofread extension?
Regards,
Jayanta
---------- Forwarded message ----------
From: Subhashish Panigrahi <subhashish(a)cis-india.org>
Date: Sat, Aug 29, 2015 at 3:22 PM
Subject: [Wikimediaindia-l] Google's Optical Character Recognition software
now works with all South Asian languages
To: wikimediaindia-l(a)lists.wikimedia.org
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Google's OCR which apparently is most accurate OCR
we have seen so far, works really good for all the major South Asian
scripts:
http://globalvoicesonline.org/2015/08/29/googles-optical-character-recog
nition-software-now-works-with-all-south-asian-languages
Here are test cases of many Indian scripts: https://goo.gl/3X75iR.
Except Gurmukhi most scripts are working really good.
This could be really useful for Indian language Wikimedians and will
come handy for digitization of printed and scanned text. Here is an
animated tutorial for Wikimedians to use this tool for
Wikisource/Wikipedia:
https://commons.wikimedia.org/wiki/File:Tutorial_to_use_Google_Optical_C
haracter_Recognition.gif
Please write to me if anyone wants to localize this tutorial in your
language.
- --
Best!
Subhashish Panigrahi
Programme Officer, Access To Knowledge
Centre for Internet and Society
@subhapa / https://cis-india.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAEBCAAGBQJV4YD0AAoJEHThehXZGxGO9ywP/RcJOXB3tFHJNF03X23x1jkY
vffu+1Iob6kLMZt/JD3nTmpXasXDlme6pbGzaT7/YZsC0VouN+4NE9HoEmZAksJF
3nn7HoEive4mDalXH5qyATOilezqIEYOG2c32LVYHnX6Co+fXPVa5WqsHn5js957
OionIc5t0V9zlGB6e5RLOacPWXsAhXyVunaeY6Ma33cOWHFdVnu1XpUGphJ+miVj
EWszTzjDOPlFiMsSsVonjWHvuz7hYPKXxvVXViXY1QAsoOT7wztvOepzM/hAPmYM
kGiODSaN8fU/e/2l4xdnMRymAt8hsz61hdye2UYx7xRjlda/23BKNZz0hiuWiqgO
FBntHycaHyqR8+fUK5EPE0vnqLp/7XdtRtQkRficuEDYlHz4PlMW8oiVEGhSZOaG
fdpgg02sojU1iMOGOs3h/ODWxkRrE3qpG+eT8n1mWJp6Tq7ZLEaQGxW1P6ytlPFF
qOz8JKl94D/MI7ybAtp+IsuUQk160H9wUPmaLxgemDRom7220xV6BysbmaMEWwww
hgO4fBNG6dPUMp825pTSxx18rY/Kw53sgHmUasixCL6Zv6xnM3rRuTxjZh8j77TR
gq2sKgoU+JkYt9eBpVRjrFO90xS5MxPrvL/lGH6P1smAODPull3o0tR681+NGKRp
C8vU5vJOlmL+HlNXBSh9
=lwbI
-----END PGP SIGNATURE-----
_______________________________________________
Wikimediaindia-l mailing list
Wikimediaindia-l(a)lists.wikimedia.org
To unsubscribe from the list / change mailing preferences visit
https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l
Hi,
As of now we don't have any separate Wikisource project for Punjabi. We have two Wikipedias as Punjabi is written in two scripts; One is LTR and the other one is RTL. I was wondering if it is possible to have one Wikisource for both the scripts. We have many texts especially the ones written before 1947, which are available in both the scripts. We can work one making the new ones available in the other script by using a transliteration tool and some proofreading. Is it possible to create an interface for these two different scripts?
Regards
Satdeep Gill
FYI: Danny was saying good things towards us in a Wikimedia-l thread, so I
replied as you see below.
Good job, guys.
Aubrey
On Wed, Dec 16, 2015 at 9:55 PM, Danny Horn <dhorn(a)wikimedia.org> wrote:
> The Wikisource community did a tremendous job in showing up and giving
> support to the Wikisource proposals. The top wishes in that category got 41
> and 39 votes, which is really impressive considering the relative size of
> the projects.
>
> The discussion on using Google's OCR in Indic language Wikisource is
> especially interesting -- a lively debate about finding the right solution
> to what is clearly a deeply-felt need from a community that's working
> really hard to add their languages' knowledge to the movement. I hope that
> having that debate here is a step towards a larger discussion about how we
> can support Wikisource projects.
>
Thanks, Danny.
In the Wikisource Conference, held in Vienna from 20 to 22 November, we
discussed a lot about what Wikisource needs to reach it's full potential as
a project. We decided to agree on a priority list (here:
https://etherpad.wikimedia.org/p/wscon2015needs) and also to participate in
the Survey.
But, if you feel brave enough, there is the whole 665 lines Etherpad here:
https://etherpad.wikimedia.org/p/wscon2015weekend [1]
Wikisource, as a project, is completely dependent on the Proofread Page
Extension [2].
Unfortunately, the extension is maintained by volunteers only (I think,
just one: Tpt).
Also, the extension doesn't support RTL languages: so Wikisources in
arabic, hebrew, farsi, indic languages don't really work as the others.
This is to be added to the fact that there is no good embedded OCR for
Indic languages, right now.
And, finally, to the simple fact that we'd love to have the Visual Editor,
*within* the ProofreadPage Extension, as Wikisource uses a *lot* of
formatting, and that could enable many, many more users in proofreading and
validating pages.
Of course, we are a small community, but we're trying really hard to make
our case.
At the moment, to the best of my knowledge, there is no, and there's never
been, any software development dedicated to Wikisource from the WMF.
Aubrey
(also a member of the Wikisource Community User Group)
[1] I hereby claim this as the longest Etherpad written by a group of
wikimedians (~40). I hope there is a prize for it. You can even read the
Wikisource mission forged and translated in real time in 21 languages (line
564).
[2] https://www.mediawiki.org/wiki/Extension:Proofread_Page
After I had another hissy fit at Commons administrators about WS works
being deleted and no means to track them, they created
https://commons.wikimedia.org/wiki/User:SteinsplitterBot/DR/enwikisource
I would encourage other wikis to talk to Steinsplitter to get a
similar page set up for their wikisource.
I am going to need a better means to monitor it than manually checking
from a watchlist, however, it is a start.
Regards, Billinghurst
I think the below is also interesting to the Wikisource community.
---------- Forwarded message ----------
From: Dario Taraborelli <dtaraborelli(a)wikimedia.org>
Date: 2015-12-04 16:43 GMT+01:00
Subject: [Wikidata] Fwd: "Wikipedia as the front matter to all research": A
brown bag on scholarly citations in Wikipedia this Friday 12/4 @ 12 PT
To: "Discussion list for the Wikidata project." <
wikidata(a)lists.wikimedia.org>
A reminder that this will be streamed today at 9pm CET / 12pm PST
We’ll be talking
<https://meta.wikimedia.org/wiki/Wikipedia_as_the_front_matter_to_all_resear…>
about
unique identifiers and bibliographic/citation data in general as well as
https://www.wikidata.org/wiki/Wikidata:WikiProject_Source_MetaData
You can join the conversation via IRC on #wikimedia-office
Dario
Begin forwarded message:
*From: *Dario Taraborelli <dtaraborelli(a)wikimedia.org>
*Date: *December 2, 2015 at 11:01:51 AM PST
*To: *wikimedia-l(a)lists.wikimedia.org, Research into Wikimedia content and
communities <wiki-research-l(a)lists.wikimedia.org>
*Subject: **"Wikipedia as the front matter to all research": A brown bag on
scholarly citations in Wikipedia this Friday 12/4 @ 12 PT *
Come and join us for a brown bag this *Friday* *December 4 *at 12 PT to
learn about *unique identifiers and* *scholarly citations in Wikipedia*,
why they matter and how we can bridge the gap between the Wikimedia,
research and librarian communities.
*Wikipedia as the front matter to all research*
YouTube stream: http://www.youtube.com/watch?v=mB_oexqz8pA
Event information on Meta:
https://meta.wikimedia.org/wiki/Wikipedia_as_the_front_matter_to_all_resear…
*Measuring citizen engagement with the scholarly literature through
Wikipedia citations.*
Geoffrey Bilder, CrossRef
Wikipedia (in toto) is probably the 5th largest referrer of citations to
the scholarly literature. That is, more Wikipedia users click on and follow
citations to the scholarly literature *from* Wikipedia domains than from
any single scholarly publisher in the world. What does this tell us about
general interest in the scholarly literature? What does this tell us about
scholarly engagement with editing Wikipedia articles? The short answer is
“we don’t know.” But we are actively working with Wikimedia to find out.
*Building the sum of all human citations*
Dario Taraborelli, WIkimedia Foundation
As sourcing and verifiability of online information are threatened
<http://www.slideshare.net/dartar/citing-as-a-public-service-building-the-su…>
by
the explosion of answer engines and the changing habits of web users,
Wikimedia has an outstanding opportunity to extract and store source data
for every conceivable statement and make it transparently verifiable by its
users. In this talk, I’ll present a grassroots effort
<https://www.wikidata.org/wiki/Wikidata:WikiProject_Source_MetaData> to
create a human-curated, comprehensive repository of all human citations in
Wikidata.
–––––––––––––
Bonus read: a real-time tracker of scholarly citations added to Wikipedia,
built with Raspberry Pi
http://blog.crossref.org/2015/12/crossref-labs-plays-with-the-raspberry-pi-…
*Dario Taraborelli *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
*Dario Taraborelli *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
I'm deeply convinced that splitting wikisource projects into variuos
languages has been a mistake.
Is anyone so bold to imagine that it is possible to revert that mistake?
Or, are we forced to travel along the* diabolicum* trail?
Alex
I'm playing with djvu files, luckily I found a "simple" way to build a GUI
and so I'm using a self-build djvu text editor with some features that
allow many developments (select and save cropped images from page images;
saving to ws nsPage text by pywikibot; aligning djvu text layer with ws
nsPage text).
Please don't ask me to put the scripts into Git or Github since I simply
can't do this.... it's a blame but I can't.
Here a draft doc about what I'm doing:
https://it.wikisource.org/wiki/Utente:Alex_brollo/Djvu_Editor_-_Documentazi…
; it's in Italian, but I found that Chrome translator from Italian into
English does a surplisingly good translation.
Obviously I'll be happy to share the code with any of you; consider that
I'm a DIY "programmer", and that therefore to read my code for a good
programmer would be a terrible pain.
Alex