True, we do need. One for sure for teachers. I see here, school teachers emphasise a lot on pronunciations of V & w, F and P/Ph, and so on. While we Gujaratis don't care for hrasv and dirgh.

The language in which most Gujarati literature is written, needs to be digitised in audio. Whether we call it pure or Tal Gujaratni bhasha, that written language could be modelled for spoken words. Regional dialects or styles could adapt from that.

It is, though, a huge project but needs addressing the issue.

Regards,
Dhaval

On 20 Sep 2013 17:17, "Bakul Shah" <bakul@bitblocks.com> wrote:
Wiki articles can be input via a web browser and they have an integrated spell checker. Existing articles can be edited the same way. So need for anything special.

You know the saying બાર ગાંઉએ બોલી બદલાય! Using the fact that the same word means different things in Surati, Amdavadi and Kathiawadi, I was once able to make a pun on the spot when I saw a woman sleeping in a bullock cart her husband was driving: હુતી હુતી હુતી!

The same is true for pronounciation. But an online guj dictionary with pronounciations in your chosen regional accent would be very nice. May be Gala Publishers should be involved?

On Sep 20, 2013, at 9:05 AM, Roopal Mehta <roopal.mehta@gmail.com> wrote:

It will be great if we all join hands in achieving something. I donot know how this dictionary or spell checker can be incorporated in Wiki articles, but I am hopeful. 

There is a need for audio dictionary too -ઉચ્ચારણ શુધ્ધિ માટે.

Now now am I asking for too much? :-)



Sent from my iPad

On Sep 20, 2013, at 1:41 PM, Bakul Shah <bakul@bitblocks.com> wrote:

Rajesh,

Proof readers will have to use a word processor or browser as other tools are not very good at displaying Indic languages.

Googledoc is no good at spell checking.

OpenOffice (or LibreOffice) has a number of dictionaries including for Gujarati. I suspect it doesn't work well & we have work to do. I have no desire or time to work on openOffice -- it is massive -- but there may be a way....

[The rest is a bit too technical. Feel free to skip]

There are a number of open source standalong spell checking programs such as ispell, aspell, hunspell etc. Most were derived from or influenced by the original unix spell program written by S.C.Johnson. For the curious, here's a paper by Doug McIlroy about it:

ispell was pre-unicode and only worked with western languages but it made some major advances that seemed to be carried over to aspell. I dug into apell some and it seems to support Gujarati.

Anyway, aspell can be used from other programs (has an API), can handle multiple languages etc. Its documentation is not sufficient (IMHO) to understand affix rules. ispell documentation has more details. I used to know ispell fairly well but that was 20+ years ago!

The dict-gu.oxt extension (used in OpenOffice) contains a file called gu_IN.dic that contains a world list and gu_IN.aff that should have affix rules for Gujarati but it is very small (compared to English) and seems to needs a bunch more work. I see that this extension is maintained by Kartik Mistry (did I see an email from him in this thread?) so may be he and I can figure out how to add more affix rules?

The basic idea with some example: given a rule like

BOTH/R

This can expand into BOTH and BOTHER, as -ER is a common english extension (cart, carter and so on).  Another example: may have

ACIDIFY/NR

This can expand into ACIDIFY ACIDIFICATION ACIDIFIER (Y-ER maps to IER). These rules make the spellchecking dictionary quite compact as well as indicate how a word should be taken apart for efficient matching.

aspell is capable of deriving such rules for English but I suspect it will need help in Indian languages. This is where the .aff file comes in. So for example in Gujarati we would like to render the following words in single rule

ગધેડો ગધેડી ગધેડું ગધેડા ગધેડાનું ગધેડાની ગધેડાનો  ગધેડાના

etc. For this we can write something like

ગધેડ/XYZABC

where each letter denotes a particular suffix. And yet, there is no such word as ગધેડ -- And I am not sure these programs can handle this. And we have compund words such as ઘોડાગાડી  -- which will require more complex rules. In fact Indic languages should have a much larger set of affix rules than English! We should also check out what is being done for Hindi.

Next, we need rules for `similar' letters (or letters near each other on a keyboard) so that if there is not an exact match, we first try such similar or neighbor letters.
 
Anyway, once we fix up the dictionary, very likely the same dictionary can be used with word processors such as openOffice etc. An easier idea may be to do a web based frontend.

These programs do a lot of work: create dictionaries, read various file formats, update screen, etc. etc. that make them complicated and hard to modify. ideally I would want a single function for checking:
check(Speller, String)

That returns a quad: (correctly spelled prefix, misspelled word, list of suggestions, remaining string). A separate program can generate the dictionaries. The Speller object will read whatever dictionaries it needs. But I don't have time to implement this.

Bakul

On Sep 19, 2013, at 7:43 PM, Rajesh Mashruwala <mashru@gmail.com> wrote:

Has anyone tried Microsoft office Gujarati spell checker? It is available with office 2010.

Sent from the old new iPad!

On Sep 18, 2013, at 11:43 AM, Bakul Shah <bakul@bitblocks.com> wrote:

Googling "hindi spell checker algorithm" found a number of papers. The basic idea is to compare how "similar" a word being checked is to a word known to be correct, where similarity is computed using some algorithm. You don't store all the ways people can misspell a word. plus logic is used to derive related words from a root word, which depend on plurality, gender, tense, etc. These rules are more complex in Indic languages than western. And i think we may need to look at "clusters"  instead of individual unicode points. But all this must have been worked years ago. May be not for Gujarati but for Hindi, Marathi, Bengali. You should check with the usual suspects (google, Microsoft, SIL, language researchers etc.).

For OCR you may need something slightly different than spellcheckers that deal with human errors. Here a more common problem will be mistaking similar looking letters and joining or splitting of words due to too little of too much white space.

Ultimately there should be support for language variations too (surati, kathiawadi, amdavadi etc)!

On Sep 18, 2013, at 4:12 AM, Rajesh Mashruwala <mashru@gmail.com> wrote:

Dhavalbhai,

As we get text that is generated using OCR, I see need for a good Gujarati dictionary. I tried to use GL dictionary. It was not effective because it has corpus of words. It can not recognize any variation on the word. In that model, we need possibly over ten times the corpus GL dictionary has to be useful. Otherwise, it finds error with too many correct words.

The same dictionary could be used for Gujarati proof readers.

One way is to generate larger corpus by scrapping words from Gujarati Internet pages (those in Unicode), a better way is to think about building better dictionary logic. I may be able to interest exceptionally good volunteer developers if we can think of smarter way of creating a dictionary. For example, we could codify grammar rules to form derivative words.

Should we pursue this course?



Sent from the old new iPad!

On Sep 18, 2013, at 2:48 AM, "Dhaval S. Vyas" <dsvyas@gmail.com> wrote:

Dear Roopalben,

I second your concern regarding the correct language. I often say that Newspapers are the only LITERATURE most of us end up reading and have access to. The language and (more becoming common Hindi) words used in them shapes the language of society in present day and hence it is great that you are introducing this course.

Unfortunately, on wiki we don't have spelling correction tool or dictionary lookup facility. But, Vishal Monpara has been developing one. Gujarati Lexicon has recently developed pop-up dictionary as well, which could be adapted for this purpose.

On gu.wikipedia, there is a lot of content translated from either English or Hindi, and most of these lack the original Gujarati language. When read, these translations look so artificial. For the course, it could be good idea to show such examples and get the course attendees correct it, may be offline if they are not computer savvy or hesitant to use wikipedia.

Please let me and community here know if you have any suggestions on how we can help with the task you are carrying out.

Kind Regards,
Dhaval

On 18 Sep 2013 06:39, "Roopal Mehta" <roopal.mehta@gmail.com> wrote:
Basically there are not many good proofreaders available in the publishing industry - and the demand is high. That was the main reason for starting this course.

Wikipedia is an important source for information. However, the concern here is about correct use of language too. Today we see a lot many errors in Gujarati newspapers, publishing, media and almost everywhere. That is a high concern for us.

If Wiki is going to be an important tool for the next generation, we Have to make sure that it conveys correct language to the society.

I would like to know, whether any auto-correction of spelling etc. are available while editing an article in Wiki ?

Thank you.


Roopal


On Tue, Sep 17, 2013 at 4:38 PM, Kartik Mistry <kartik.mistry@gmail.com> wrote:
On Tue, Sep 17, 2013 at 3:42 PM, Roopal Mehta <roopal.mehta@gmail.com> wrote:
> At Gujarati Sahitya Parishad, we are running proof reading course and we are including a session of modern methods of proof reading, which includes editing on (Guj) Wiki articles.
>
> Please send suggestions if you have. This is the first batch of students from various fields.

Few suggestions (some may be offtopic, sorry for that!)
1. Please follow Wikipedia's guideline for article.
2. Make sure person is logged in before making changes.
3. Please do not change anything other than spelling/grammar etc.
4. If you're that already, donating pictures of 'સાહિત્યકાર' in
various articles from GSP, is good idea. Isn't it? :)

Thanks for good work!

--
Kartik Mistry | IRC: kart_
{0x1f1f, kartikm}.wordpress.com

_______________________________________________
Wikipedia-gu mailing list
Wikipedia-gu@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu


_______________________________________________
Wikipedia-gu mailing list
Wikipedia-gu@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu

_______________________________________________
Wikipedia-gu mailing list
Wikipedia-gu@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikipedia-gu