Rajesh,
Proof readers will have to use a word processor or browser as other tools are not very
good at displaying Indic languages.
Googledoc is no good at spell checking.
OpenOffice (or LibreOffice) has a number of dictionaries including for Gujarati. I
suspect it doesn't work well & we have work to do. I have no desire or time to
work on openOffice -- it is massive -- but there may be a way....
[The rest is a bit too technical. Feel free to skip]
There are a number of open source standalong spell checking programs such as ispell,
aspell, hunspell etc. Most were derived from or influenced by the original unix spell
program written by S.C.Johnson. For the curious, here's a paper by Doug McIlroy about
it:
ispell was pre-unicode and only worked with western languages but it made some major
advances that seemed to be carried over to aspell. I dug into apell some and it seems to
support Gujarati.
Anyway, aspell can be used from other programs (has an API), can handle multiple
languages etc. Its documentation is not sufficient (IMHO) to understand affix rules.
ispell documentation has more details. I used to know ispell fairly well but that was 20+
years ago!
The dict-gu.oxt extension (used in OpenOffice) contains a file called gu_IN.dic that
contains a world list and gu_IN.aff that should have affix rules for Gujarati but it is
very small (compared to English) and seems to needs a bunch more work. I see that this
extension is maintained by Kartik Mistry (did I see an email from him in this thread?) so
may be he and I can figure out how to add more affix rules?
The basic idea with some example: given a rule like
BOTH/R
This can expand into BOTH and BOTHER, as -ER is a common english extension (cart, carter
and so on). Another example: may have
ACIDIFY/NR
This can expand into ACIDIFY ACIDIFICATION ACIDIFIER (Y-ER maps to IER). These rules make
the spellchecking dictionary quite compact as well as indicate how a word should be taken
apart for efficient matching.
aspell is capable of deriving such rules for English but I suspect it will need help in
Indian languages. This is where the .aff file comes in. So for example in Gujarati we
would like to render the following words in single rule
ગધેડો ગધેડી ગધેડું ગધેડા ગધેડાનું ગધેડાની ગધેડાનો ગધેડાના
etc. For this we can write something like
ગધેડ/XYZABC
where each letter denotes a particular suffix. And yet, there is no such word as ગધેડ --
And I am not sure these programs can handle this. And we have compund words such as
ઘોડાગાડી -- which will require more complex rules. In fact Indic languages should have a
much larger set of affix rules than English! We should also check out what is being done
for Hindi.
Next, we need rules for `similar' letters (or letters near each other on a keyboard)
so that if there is not an exact match, we first try such similar or neighbor letters.
Anyway, once we fix up the dictionary, very likely the same dictionary can be used with
word processors such as openOffice etc. An easier idea may be to do a web based frontend.
These programs do a lot of work: create dictionaries, read various file formats, update
screen, etc. etc. that make them complicated and hard to modify. ideally I would want a
single function for checking:
check(Speller, String)
That returns a quad: (correctly spelled prefix, misspelled word, list of suggestions,
remaining string). A separate program can generate the dictionaries. The Speller object
will read whatever dictionaries it needs. But I don't have time to implement this.
Bakul
On Sep 19, 2013, at 7:43 PM, Rajesh Mashruwala <mashru(a)gmail.com> wrote:
Has anyone tried Microsoft office Gujarati spell
checker? It is available with office 2010.
Sent from the old new iPad!
On Sep 18, 2013, at 11:43 AM, Bakul Shah <bakul(a)bitblocks.com> wrote:
> Googling "hindi spell checker algorithm" found a number of papers. The
basic idea is to compare how "similar" a word being checked is to a word known
to be correct, where similarity is computed using some algorithm. You don't store all
the ways people can misspell a word. plus logic is used to derive related words from a
root word, which depend on plurality, gender, tense, etc. These rules are more complex in
Indic languages than western. And i think we may need to look at "clusters"
instead of individual unicode points. But all this must have been worked years ago. May be
not for Gujarati but for Hindi, Marathi, Bengali. You should check with the usual suspects
(google, Microsoft, SIL, language researchers etc.).
>
> For OCR you may need something slightly different than spellcheckers that deal with
human errors. Here a more common problem will be mistaking similar looking letters and
joining or splitting of words due to too little of too much white space.
>
> Ultimately there should be support for language variations too (surati, kathiawadi,
amdavadi etc)!
>
> On Sep 18, 2013, at 4:12 AM, Rajesh Mashruwala <mashru(a)gmail.com> wrote:
>
>> Dhavalbhai,
>>
>> As we get text that is generated using OCR, I see need for a good Gujarati
dictionary. I tried to use GL dictionary. It was not effective because it has corpus of
words. It can not recognize any variation on the word. In that model, we need possibly
over ten times the corpus GL dictionary has to be useful. Otherwise, it finds error with
too many correct words.
>>
>> The same dictionary could be used for Gujarati proof readers.
>>
>> One way is to generate larger corpus by scrapping words from Gujarati Internet
pages (those in Unicode), a better way is to think about building better dictionary logic.
I may be able to interest exceptionally good volunteer developers if we can think of
smarter way of creating a dictionary. For example, we could codify grammar rules to form
derivative words.
>>
>> Should we pursue this course?
>>
>>
>>
>> Sent from the old new iPad!
>>
>> On Sep 18, 2013, at 2:48 AM, "Dhaval S. Vyas" <dsvyas(a)gmail.com>
wrote:
>>
>>> Dear Roopalben,
>>>
>>> I second your concern regarding the correct language. I often say that
Newspapers are the only LITERATURE most of us end up reading and have access to. The
language and (more becoming common Hindi) words used in them shapes the language of
society in present day and hence it is great that you are introducing this course.
>>>
>>> Unfortunately, on wiki we don't have spelling correction tool or
dictionary lookup facility. But, Vishal Monpara has been developing one. Gujarati Lexicon
has recently developed pop-up dictionary as well, which could be adapted for this
purpose.
>>>
>>> On gu.wikipedia, there is a lot of content translated from either English or
Hindi, and most of these lack the original Gujarati language. When read, these
translations look so artificial. For the course, it could be good idea to show such
examples and get the course attendees correct it, may be offline if they are not computer
savvy or hesitant to use wikipedia.
>>>
>>> Please let me and community here know if you have any suggestions on how we
can help with the task you are carrying out.
>>>
>>> Kind Regards,
>>> Dhaval
>>>
>>> On 18 Sep 2013 06:39, "Roopal Mehta" <roopal.mehta(a)gmail.com>
wrote:
>>>> Basically there are not many good proofreaders available in the
publishing industry - and the demand is high. That was the main reason for starting this
course.
>>>>
>>>> Wikipedia is an important source for information. However, the concern
here is about correct use of language too. Today we see a lot many errors in Gujarati
newspapers, publishing, media and almost everywhere. That is a high concern for us.
>>>>
>>>> If Wiki is going to be an important tool for the next generation, we Have
to make sure that it conveys correct language to the society.
>>>>
>>>> I would like to know, whether any auto-correction of spelling etc. are
available while editing an article in Wiki ?
>>>>
>>>> Thank you.
>>>>
>>>>
>>>> Roopal
>>>>
>>>>
>>>> On Tue, Sep 17, 2013 at 4:38 PM, Kartik Mistry
<kartik.mistry(a)gmail.com> wrote:
>>>>> On Tue, Sep 17, 2013 at 3:42 PM, Roopal Mehta
<roopal.mehta(a)gmail.com> wrote:
>>>>> > At Gujarati Sahitya Parishad, we are running proof reading
course and we are including a session of modern methods of proof reading, which includes
editing on (Guj) Wiki articles.
>>>>> >
>>>>> > Please send suggestions if you have. This is the first batch of
students from various fields.
>>>>>
>>>>> Few suggestions (some may be offtopic, sorry for that!)
>>>>> 1. Please follow Wikipedia's guideline for article.
>>>>> 2. Make sure person is logged in before making changes.
>>>>> 3. Please do not change anything other than spelling/grammar etc.
>>>>> 4. If you're that already, donating pictures of
'સાહિત્યકાર' in
>>>>> various articles from GSP, is good idea. Isn't it? :)
>>>>>
>>>>> Thanks for good work!
>>>>>
>>>>> --
>>>>> Kartik Mistry | IRC: kart_
>>>>> {0x1f1f,
kartikm}.wordpress.com
>>>>>
>>>>> _______________________________________________
>>>>> Wikipedia-gu mailing list
>>>>> Wikipedia-gu(a)lists.wikimedia.org
>>>>>
https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu
>>>>
>>>>
>>>> _______________________________________________
>>>> Wikipedia-gu mailing list
>>>> Wikipedia-gu(a)lists.wikimedia.org
>>>>
https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu
>>> _______________________________________________
>>> Wikipedia-gu mailing list
>>> Wikipedia-gu(a)lists.wikimedia.org
>>>
https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu