There exists more free and openly accessible parallell texts beside the
EU ones. One bigger project is OPUS[1], which contains free software
translations and subtitles for example.
Another kind of text that is suitable for statistical machine
translation is comparable texts. They are texts written about the same
thing, but not necessary translation of each other. This kind of text is
harder to align into a translation dictionary model, but this kind of
texts might be easier to find. From one point of view, the whole
Wikipedia with it's language links can be seen as a huge corpus of
comparable texts. There exists free tools for aligning comparable texts,
one that pops into mind right now is Yalign[2], [3]. Another source for
comparable texts is news articles about the same event.
Best wishes!
Kristian
[1]
http://opus.lingfil.uu.se/
[2]
http://yalign.machinalis.com/
[3]
https://github.com/machinalis/yalign
22.05.2014 19:03, Lars Aronsson kirjutas:
On 05/22/2014 05:41 PM, Petr Bena wrote:
I was looking for a free (possibly open source)
provider of automatic
translations for my open source application I am working on and quite
had troubles finding some. Then I realized we have a project called
"wiktionary" which could possibly (I was assuming it's open
dictionary) help me here, but I was quite disappointed as I couldn't
find any simple way to perform simple queries like:
There are several open-source machine translation projects.
They are either rule-based or statistics-based. One of the
rule-based projects is Apertium.
When you start from zero, building a rule-based system
gives you a useful system quite fast, especially if the
two languages are similar. A statistics-based system (such
as Google Translate) requires enormous amounts of
data to become useful.
It's not something that you can start as a subproject
within Wiktionary, not even as a separate WMF project.
It's a very large task.
One naive approach is to base a statistics-based
machine translator (SMT) on the European Union's
freely available parallel text corpus. When you try
to translate Finnish "terve" (which means: hello!)
into English in such a system, it will say "health",
since the same word also means health, and EU
texts only talk about healthcare, never "hello".