Cool :)
I've added those per your suggestion, although i just made the YEH as IE for now... I really have to look further into digraphs etc.
Do you have any bitext? (e.g. aligned translations -- one side Tajik, the other side Farsi?) That would make development a hell of a lot easier...
Thanks,
Fran
PS. I now have a ~8,000 entry wordlist and growing... I might play around with that in the future.
On Sun, 2006-05-28 at 18:04 -0700, Mark Williamson wrote:
Also, you're missing a few letters.
I'm not 100% certain about the *proper* conversion of the following passage, but:
оаضое доئме шварое омнета созмон млл мтаҳд дрборҳ арضҳ мшвақ ҳо бҳ ҷмҳваре осломе бҳ таваофқ нздеки шднд: ҳоваеҳ свалоно [ соата پيш ۴:۵۴ ] дрҳолекиҳ ҳоваеҳ свалоно мсвал сеоста خорҷе отаҳодеҳ орваپо др ншста вазерон омвар خорҷҳ отаҳодеҳ орваپо дрборҳ нздеки шдн оаضое доئме шварое омнета созмон млл мтаҳд бҳ таваофқе дрборҳ арضҳ мшвақ ҳо бҳ оерон др озое дста брдоштан таҳрон оз брномҳ ҳстаҳ ое خбр ме дҳд، мқомҳое оژонс бен олмлле онрژе отаме оз оҳтамол тавақф غне созе овароневам др оерон сخн вофтанд ва озҳор доштанд бшрط оенкиҳ омрекио таضмен кинд др صдд сқваط ҳкивамта осломе брнеоед، таҳрон брномҳ ҳстаҳ ое мтавақф ме созд. бо онкиҳ ҷрҷ бваш рئес ҷмҳваре омрекио оз оҳтамол бррсе мшвақ ҳое орваپо бҳ оерон др صварта дста брдоштан ҷмҳваре осломе оз брномҳ ҳстаҳ ое сخн вофта، киондвалезо роес вазер омвар خорҷҳ омрекио вофта оз омрекио خваостаҳ ншдҳ омнета рژем осломе ро таضмен кинд. др ҳмен ҳол рҳбр ҷмҳваре осломе вофта донш ҳстаҳ ое оендҳ блндмдта онрژе оерон ро таضмен ме кинд ва оен ро нбоед бҳ ҳеچ бҳоее оз дста дод. оз свае девор рвасеҳ др сфр оевовароеваонф дбер шварое омнета рвасеҳ бҳ таҳрон бор девор پешнҳод غне созе овароневам др خоки рвасеҳ ро бҳ оерон ороئҳ кирд.
As you can see, there're quite a few letters missing. One thing worth noting is that if you have the sequece "ое", it's actually much more likely to be "оя".
But while I'm at it, I'll give you the proper transliterations for the missed letters:
<ض > д <ئ > it means there is another vowel, as in اسرائیل (Esraail; Israel) <پ > п <خ > х as in سخن > сухан (speech) <ژ > ж as in نژادی > нажоди (ethnic) <غ > ғ as in غول > ғӯл (gigantic) <ص > с <ط > т as in شرط > шарт (wager) <چ > ч as in کوچ > кӯч (migration) <
Best, Mark On 28/05/06, Francis Tyers spectre@ivixor.net wrote:
Actually the opposite ;)
The italics were the ones that seemed to me to be discernable.
I'm working on messing around with the vowels. It would be helpful if there was an "official" transliteration standard, but I can't seem to find one.
Thanks for the link to RFE, I've actually already been trawling it to download as much as I can in Tajik, then I intend to produce a wordlist (possibly by frequency) and experiment with comparing "transliterated" Farsi with the wordlist by edit distance.
It would be helpful to have a bilingual dictionary, but I don't think any currently exist in machine tractable form.
Regards,
Fran
On Sun, 2006-05-28 at 14:37 -0700, Mark Williamson wrote:
Also, I noticed that you seemed to use italics for what you took to be incorrect transliterations.
In fact:
"бднео" is an incorrect transliteration of "ба дунё", same with "лҳоз" and "лиҳози"; "ҳқвақ" and "ҳуқуқ"; "боҳм" and "бо ҳам", "бробрнд" and "баробаранд"; "ҳмҳ" and "Ҳама"; "ваҷдон" and "виҷдонанд" (except for the -and suffix); "нсбта" and "нисбат"; "бекидевор" and "ба якдигар"; and possibly even "бо рваҳ бробре" and "бародарвор".
This is because not all vowels are always explicitly indicated in Farsi. The only way to know is to be a native speaker or to use a dictionary.
Mark
On 28/05/06, Mark Williamson node.ue@gmail.com wrote:
...having said that, that won't fix the fact that Tajik uses more Russian loanwords than Farsi.
Mark
On 28/05/06, Mark Williamson node.ue@gmail.com wrote:
Hi Francis,
I have a suggestion to improve this software: build a corpus from the Tajik RFE website http://www.ozodi.org/
As you can pretty obviously tell, not all vowels are indicated in Farsi, so in some cases there *should* be multiple candidates for transliteration. For example, Farsi "yeh" can be transliterated in a number of different ways.
In these cases, a simple search of the corpus should reveal which alternative is an actual word, or which is most frequent, and select it.
Mark
On 28/05/06, Francis Tyers spectre@ivixor.net wrote:
Are there actually any Tajik native speakers working on the Tajik Wikipedia at the moment?
I'd like to discuss some software I'm making with them...
http://82.133.33.43/~spectre/tajik/tajik.php
I've had a look over at tg. but it seems to be very inactive.
Regards,
Fran
Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
-- Refije dirije lanmè yo paske nou posede pwòp bato.
-- Refije dirije lanmè yo paske nou posede pwòp bato.
Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
Aha, here is some:
http://www.bbc.co.uk/persian/tajikistan/story/2006/05/060528_ivanov_larichan... http://www.bbc.co.uk/persian/iran/story/2006/05/060528_la-ivanov-larijani.sh...
I'll play around a bit with that...
Fran
On Mon, 2006-05-29 at 04:31 +0100, Francis Tyers wrote:
Cool :)
I've added those per your suggestion, although i just made the YEH as IE for now... I really have to look further into digraphs etc.
Do you have any bitext? (e.g. aligned translations -- one side Tajik, the other side Farsi?) That would make development a hell of a lot easier...
Thanks,
Fran
PS. I now have a ~8,000 entry wordlist and growing... I might play around with that in the future.
On Sun, 2006-05-28 at 18:04 -0700, Mark Williamson wrote:
Also, you're missing a few letters.
I'm not 100% certain about the *proper* conversion of the following passage, but:
оаضое доئме шварое омнета созмон млл мтаҳд дрборҳ арضҳ мшвақ ҳо бҳ ҷмҳваре осломе бҳ таваофқ нздеки шднд: ҳоваеҳ свалоно [ соата پيш ۴:۵۴ ] дрҳолекиҳ ҳоваеҳ свалоно мсвал сеоста خорҷе отаҳодеҳ орваپо др ншста вазерон омвар خорҷҳ отаҳодеҳ орваپо дрборҳ нздеки шдн оаضое доئме шварое омнета созмон млл мтаҳд бҳ таваофқе дрборҳ арضҳ мшвақ ҳо бҳ оерон др озое дста брдоштан таҳрон оз брномҳ ҳстаҳ ое خбр ме дҳд، мқомҳое оژонс бен олмлле онрژе отаме оз оҳтамол тавақф غне созе овароневам др оерон сخн вофтанд ва озҳор доштанд бшрط оенкиҳ омрекио таضмен кинд др صдд сқваط ҳкивамта осломе брнеоед، таҳрон брномҳ ҳстаҳ ое мтавақф ме созд. бо онкиҳ ҷрҷ бваш рئес ҷмҳваре омрекио оз оҳтамол бррсе мшвақ ҳое орваپо бҳ оерон др صварта дста брдоштан ҷмҳваре осломе оз брномҳ ҳстаҳ ое сخн вофта، киондвалезо роес вазер омвар خорҷҳ омрекио вофта оз омрекио خваостаҳ ншдҳ омнета рژем осломе ро таضмен кинд. др ҳмен ҳол рҳбр ҷмҳваре осломе вофта донш ҳстаҳ ое оендҳ блндмдта онрژе оерон ро таضмен ме кинд ва оен ро нбоед бҳ ҳеچ бҳоее оз дста дод. оз свае девор рвасеҳ др сфр оевовароеваонф дбер шварое омнета рвасеҳ бҳ таҳрон бор девор پешнҳод غне созе овароневам др خоки рвасеҳ ро бҳ оерон ороئҳ кирд.
As you can see, there're quite a few letters missing. One thing worth noting is that if you have the sequece "ое", it's actually much more likely to be "оя".
But while I'm at it, I'll give you the proper transliterations for the missed letters:
< <ض > д < <ئ > it means there is another vowel, as in اسرائیل (Esraail; Israel) < <پ > п < <خ > х as in سخن > сухан (speech) < <ژ > ж as in نژادی > нажоди (ethnic) < <غ > ғ as in غول > ғӯл (gigantic) < <ص > с < <ط > т as in شرط > шарт (wager) < <چ > ч as in کوچ > кӯч (migration)
<
Best, Mark On 28/05/06, Francis Tyers spectre@ivixor.net wrote:
Actually the opposite ;)
The italics were the ones that seemed to me to be discernable.
I'm working on messing around with the vowels. It would be helpful if there was an "official" transliteration standard, but I can't seem to find one.
Thanks for the link to RFE, I've actually already been trawling it to download as much as I can in Tajik, then I intend to produce a wordlist (possibly by frequency) and experiment with comparing "transliterated" Farsi with the wordlist by edit distance.
It would be helpful to have a bilingual dictionary, but I don't think any currently exist in machine tractable form.
Regards,
Fran
On Sun, 2006-05-28 at 14:37 -0700, Mark Williamson wrote:
Also, I noticed that you seemed to use italics for what you took to be incorrect transliterations.
In fact:
"бднео" is an incorrect transliteration of "ба дунё", same with "лҳоз" and "лиҳози"; "ҳқвақ" and "ҳуқуқ"; "боҳм" and "бо ҳам", "бробрнд" and "баробаранд"; "ҳмҳ" and "Ҳама"; "ваҷдон" and "виҷдонанд" (except for the -and suffix); "нсбта" and "нисбат"; "бекидевор" and "ба якдигар"; and possibly even "бо рваҳ бробре" and "бародарвор".
This is because not all vowels are always explicitly indicated in Farsi. The only way to know is to be a native speaker or to use a dictionary.
Mark
On 28/05/06, Mark Williamson node.ue@gmail.com wrote:
...having said that, that won't fix the fact that Tajik uses more Russian loanwords than Farsi.
Mark
On 28/05/06, Mark Williamson node.ue@gmail.com wrote:
Hi Francis,
I have a suggestion to improve this software: build a corpus from the Tajik RFE website http://www.ozodi.org/
As you can pretty obviously tell, not all vowels are indicated in Farsi, so in some cases there *should* be multiple candidates for transliteration. For example, Farsi "yeh" can be transliterated in a number of different ways.
In these cases, a simple search of the corpus should reveal which alternative is an actual word, or which is most frequent, and select it.
Mark
On 28/05/06, Francis Tyers spectre@ivixor.net wrote: > Are there actually any Tajik native speakers working on the Tajik > Wikipedia at the moment? > > I'd like to discuss some software I'm making with them... > > http://82.133.33.43/~spectre/tajik/tajik.php > > I've had a look over at tg. but it seems to be very inactive. > > Regards, > > Fran > > _______________________________________________ > Wikipedia-l mailing list > Wikipedia-l@Wikimedia.org > http://mail.wikipedia.org/mailman/listinfo/wikipedia-l >
-- Refije dirije lanmè yo paske nou posede pwòp bato.
-- Refije dirije lanmè yo paske nou posede pwòp bato.
Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
wikipedia-l@lists.wikimedia.org