Wiki articles can be input via a web browser and they have an integrated spell checker.
Existing articles can be edited the same way. So need for anything special.
You know the saying બાર ગાંઉએ બોલી બદલાય! Using the fact that the same word means
different things in Surati, Amdavadi and Kathiawadi, I was once able to make a pun on the
spot when I saw a woman sleeping in a bullock cart her husband was driving: હુતી હુતી
હુતી!
The same is true for pronounciation. But an online guj dictionary with pronounciations in
your chosen regional accent would be very nice. May be Gala Publishers should be
involved?
On Sep 20, 2013, at 9:05 AM, Roopal Mehta <roopal.mehta(a)gmail.com> wrote:
It will be great if we all join hands in achieving
something. I donot know how this dictionary or spell checker can be incorporated in Wiki
articles, but I am hopeful.
There is a need for audio dictionary too -ઉચ્ચારણ શુધ્ધિ માટે.
Now now am I asking for too much? :-)
Sent from my iPad
On Sep 20, 2013, at 1:41 PM, Bakul Shah <bakul(a)bitblocks.com> wrote:
> Rajesh,
>
> Proof readers will have to use a word processor or browser as other tools are not
very good at displaying Indic languages.
>
> Googledoc is no good at spell checking.
>
> OpenOffice (or LibreOffice) has a number of dictionaries including for Gujarati. I
suspect it doesn't work well & we have work to do. I have no desire or time to
work on openOffice -- it is massive -- but there may be a way....
>
> [The rest is a bit too technical. Feel free to skip]
>
> There are a number of open source standalong spell checking programs such as ispell,
aspell, hunspell etc. Most were derived from or influenced by the original unix spell
program written by S.C.Johnson. For the curious, here's a paper by Doug McIlroy about
it:
>
http://unix-spell.googlecode.com/svn/trunk/McIlroy_spell_1982.pdf
>
> ispell was pre-unicode and only worked with western languages but it made some major
advances that seemed to be carried over to aspell. I dug into apell some and it seems to
support Gujarati.
>
> Anyway, aspell can be used from other programs (has an API), can handle multiple
languages etc. Its documentation is not sufficient (IMHO) to understand affix rules.
ispell documentation has more details. I used to know ispell fairly well but that was 20+
years ago!
>
> The dict-gu.oxt extension (used in OpenOffice) contains a file called gu_IN.dic that
contains a world list and gu_IN.aff that should have affix rules for Gujarati but it is
very small (compared to English) and seems to needs a bunch more work. I see that this
extension is maintained by Kartik Mistry (did I see an email from him in this thread?) so
may be he and I can figure out how to add more affix rules?
>
> The basic idea with some example: given a rule like
>
> BOTH/R
>
> This can expand into BOTH and BOTHER, as -ER is a common english extension (cart,
carter and so on). Another example: may have
>
> ACIDIFY/NR
>
> This can expand into ACIDIFY ACIDIFICATION ACIDIFIER (Y-ER maps to IER). These rules
make the spellchecking dictionary quite compact as well as indicate how a word should be
taken apart for efficient matching.
>
> aspell is capable of deriving such rules for English but I suspect it will need help
in Indian languages. This is where the .aff file comes in. So for example in Gujarati we
would like to render the following words in single rule
>
> ગધેડો ગધેડી ગધેડું ગધેડા ગધેડાનું ગધેડાની ગધેડાનો ગધેડાના
>
> etc. For this we can write something like
>
> ગધેડ/XYZABC
>
> where each letter denotes a particular suffix. And yet, there is no such word as ગધેડ
-- And I am not sure these programs can handle this. And we have compund words such as
ઘોડાગાડી -- which will require more complex rules. In fact Indic languages should have a
much larger set of affix rules than English! We should also check out what is being done
for Hindi.
>
> Next, we need rules for `similar' letters (or letters near each other on a
keyboard) so that if there is not an exact match, we first try such similar or neighbor
letters.
>
> Anyway, once we fix up the dictionary, very likely the same dictionary can be used
with word processors such as openOffice etc. An easier idea may be to do a web based
frontend.
>
> These programs do a lot of work: create dictionaries, read various file formats,
update screen, etc. etc. that make them complicated and hard to modify. ideally I would
want a single function for checking:
>
> check(Speller, String)
>
> That returns a quad: (correctly spelled prefix, misspelled word, list of suggestions,
remaining string). A separate program can generate the dictionaries. The Speller object
will read whatever dictionaries it needs. But I don't have time to implement this.
>
> Bakul
>
> On Sep 19, 2013, at 7:43 PM, Rajesh Mashruwala <mashru(a)gmail.com> wrote:
>
>> Has anyone tried Microsoft office Gujarati spell checker? It is available with
office 2010.
>>
>> Sent from the old new iPad!
>>
>> On Sep 18, 2013, at 11:43 AM, Bakul Shah <bakul(a)bitblocks.com> wrote:
>>
>>> Googling "hindi spell checker algorithm" found a number of papers.
The basic idea is to compare how "similar" a word being checked is to a word
known to be correct, where similarity is computed using some algorithm. You don't
store all the ways people can misspell a word. plus logic is used to derive related words
from a root word, which depend on plurality, gender, tense, etc. These rules are more
complex in Indic languages than western. And i think we may need to look at
"clusters" instead of individual unicode points. But all this must have been
worked years ago. May be not for Gujarati but for Hindi, Marathi, Bengali. You should
check with the usual suspects (google, Microsoft, SIL, language researchers etc.).
>>>
>>> For OCR you may need something slightly different than spellcheckers that
deal with human errors. Here a more common problem will be mistaking similar looking
letters and joining or splitting of words due to too little of too much white space.
>>>
>>> Ultimately there should be support for language variations too (surati,
kathiawadi, amdavadi etc)!
>>>
>>> On Sep 18, 2013, at 4:12 AM, Rajesh Mashruwala <mashru(a)gmail.com>
wrote:
>>>
>>>> Dhavalbhai,
>>>>
>>>> As we get text that is generated using OCR, I see need for a good
Gujarati dictionary. I tried to use GL dictionary. It was not effective because it has
corpus of words. It can not recognize any variation on the word. In that model, we need
possibly over ten times the corpus GL dictionary has to be useful. Otherwise, it finds
error with too many correct words.
>>>>
>>>> The same dictionary could be used for Gujarati proof readers.
>>>>
>>>> One way is to generate larger corpus by scrapping words from Gujarati
Internet pages (those in Unicode), a better way is to think about building better
dictionary logic. I may be able to interest exceptionally good volunteer developers if we
can think of smarter way of creating a dictionary. For example, we could codify grammar
rules to form derivative words.
>>>>
>>>> Should we pursue this course?
>>>>
>>>>
>>>>
>>>> Sent from the old new iPad!
>>>>
>>>> On Sep 18, 2013, at 2:48 AM, "Dhaval S. Vyas"
<dsvyas(a)gmail.com> wrote:
>>>>
>>>>> Dear Roopalben,
>>>>>
>>>>> I second your concern regarding the correct language. I often say
that Newspapers are the only LITERATURE most of us end up reading and have access to. The
language and (more becoming common Hindi) words used in them shapes the language of
society in present day and hence it is great that you are introducing this course.
>>>>>
>>>>> Unfortunately, on wiki we don't have spelling correction tool or
dictionary lookup facility. But, Vishal Monpara has been developing one. Gujarati Lexicon
has recently developed pop-up dictionary as well, which could be adapted for this
purpose.
>>>>>
>>>>> On gu.wikipedia, there is a lot of content translated from either
English or Hindi, and most of these lack the original Gujarati language. When read, these
translations look so artificial. For the course, it could be good idea to show such
examples and get the course attendees correct it, may be offline if they are not computer
savvy or hesitant to use wikipedia.
>>>>>
>>>>> Please let me and community here know if you have any suggestions on
how we can help with the task you are carrying out.
>>>>>
>>>>> Kind Regards,
>>>>> Dhaval
>>>>>
>>>>> On 18 Sep 2013 06:39, "Roopal Mehta"
<roopal.mehta(a)gmail.com> wrote:
>>>>> Basically there are not many good proofreaders available in the
publishing industry - and the demand is high. That was the main reason for starting this
course.
>>>>>
>>>>> Wikipedia is an important source for information. However, the
concern here is about correct use of language too. Today we see a lot many errors in
Gujarati newspapers, publishing, media and almost everywhere. That is a high concern for
us.
>>>>>
>>>>> If Wiki is going to be an important tool for the next generation, we
Have to make sure that it conveys correct language to the society.
>>>>>
>>>>> I would like to know, whether any auto-correction of spelling etc.
are available while editing an article in Wiki ?
>>>>>
>>>>> Thank you.
>>>>>
>>>>>
>>>>> Roopal
>>>>>
>>>>>
>>>>> On Tue, Sep 17, 2013 at 4:38 PM, Kartik Mistry
<kartik.mistry(a)gmail.com> wrote:
>>>>> On Tue, Sep 17, 2013 at 3:42 PM, Roopal Mehta
<roopal.mehta(a)gmail.com> wrote:
>>>>> > At Gujarati Sahitya Parishad, we are running proof reading
course and we are including a session of modern methods of proof reading, which includes
editing on (Guj) Wiki articles.
>>>>> >
>>>>> > Please send suggestions if you have. This is the first batch of
students from various fields.
>>>>>
>>>>> Few suggestions (some may be offtopic, sorry for that!)
>>>>> 1. Please follow Wikipedia's guideline for article.
>>>>> 2. Make sure person is logged in before making changes.
>>>>> 3. Please do not change anything other than spelling/grammar etc.
>>>>> 4. If you're that already, donating pictures of
'સાહિત્યકાર' in
>>>>> various articles from GSP, is good idea. Isn't it? :)
>>>>>
>>>>> Thanks for good work!
>>>>>
>>>>> --
>>>>> Kartik Mistry | IRC: kart_
>>>>> {0x1f1f,
kartikm}.wordpress.com
>>>>>
>>>>> _______________________________________________
>>>>> Wikipedia-gu mailing list
>>>>> Wikipedia-gu(a)lists.wikimedia.org
>>>>>
https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Wikipedia-gu mailing list
>>>>> Wikipedia-gu(a)lists.wikimedia.org
>>>>>
https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu
>>>>>
>>>>> _______________________________________________
>>>>> Wikipedia-gu mailing list
>>>>> Wikipedia-gu(a)lists.wikimedia.org
>>>>>
https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu
>