There exists more free and openly accessible parallell texts beside the EU ones. One bigger project is OPUS[1], which contains free software translations and subtitles for example.
Another kind of text that is suitable for statistical machine translation is comparable texts. They are texts written about the same thing, but not necessary translation of each other. This kind of text is harder to align into a translation dictionary model, but this kind of texts might be easier to find. From one point of view, the whole Wikipedia with it's language links can be seen as a huge corpus of comparable texts. There exists free tools for aligning comparable texts, one that pops into mind right now is Yalign[2], [3]. Another source for comparable texts is news articles about the same event.
Best wishes! Kristian
[1] http://opus.lingfil.uu.se/ [2] http://yalign.machinalis.com/ [3] https://github.com/machinalis/yalign
22.05.2014 19:03, Lars Aronsson kirjutas:
On 05/22/2014 05:41 PM, Petr Bena wrote:
I was looking for a free (possibly open source) provider of automatic translations for my open source application I am working on and quite had troubles finding some. Then I realized we have a project called "wiktionary" which could possibly (I was assuming it's open dictionary) help me here, but I was quite disappointed as I couldn't find any simple way to perform simple queries like:
There are several open-source machine translation projects. They are either rule-based or statistics-based. One of the rule-based projects is Apertium.
When you start from zero, building a rule-based system gives you a useful system quite fast, especially if the two languages are similar. A statistics-based system (such as Google Translate) requires enormous amounts of data to become useful.
It's not something that you can start as a subproject within Wiktionary, not even as a separate WMF project. It's a very large task.
One naive approach is to base a statistics-based machine translator (SMT) on the European Union's freely available parallel text corpus. When you try to translate Finnish "terve" (which means: hello!) into English in such a system, it will say "health", since the same word also means health, and EU texts only talk about healthcare, never "hello".