Re: [Wikitech-l] Do we have any data in wikidata / wiktionary that could be used for mechanic translations?

23 May 2014

There exists more free and openly accessible parallell texts beside the 
EU ones. One bigger project is OPUS[1], which contains free software 
translations and subtitles for example.

Another kind of text that is suitable for statistical machine 
translation is comparable texts. They are texts written about the same 
thing, but not necessary translation of each other. This kind of text is 
harder to align into a translation dictionary model, but this kind of 
texts might be easier to find. From one point of view, the whole 
Wikipedia with it's language links can be seen as a huge corpus of 
comparable texts. There exists free tools for aligning comparable texts, 
one that pops into mind right now is Yalign[2], [3]. Another source for 
comparable texts is news articles about the same event.

Best wishes!
Kristian

[1] http://opus.lingfil.uu.se/
[2] http://yalign.machinalis.com/
[3] https://github.com/machinalis/yalign

22.05.2014 19:03, Lars Aronsson kirjutas:
...
  On 05/22/2014 05:41 PM, Petr Bena wrote:
  I was looking for a free (possibly open source)
provider of automatic
 translations for my open source application I am working on and quite
 had troubles finding some. Then I realized we have a project called
 "wiktionary" which could possibly (I was assuming it's open
 dictionary) help me here, but I was quite disappointed as I couldn't
 find any simple way to perform simple queries like: 
 There are several open-source machine translation projects.
 They are either rule-based or statistics-based. One of the
 rule-based projects is Apertium.

 When you start from zero, building a rule-based system
 gives you a useful system quite fast, especially if the
 two languages are similar. A statistics-based system (such
 as Google Translate) requires enormous amounts of
 data to become useful.

 It's not something that you can start as a subproject
 within Wiktionary, not even as a separate WMF project.
 It's a very large task.

 One naive approach is to base a statistics-based
 machine translator (SMT) on the European Union's
 freely available parallel text corpus. When you try
 to translate Finnish "terve" (which means: hello!)
 into English in such a system, it will say "health",
 since the same word also means health, and EU
 texts only talk about healthcare, never "hello".

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Do we have any data in wikidata / wiktionary that could be used for mechanic translations?