Re: [Wikisource-l] Parallel text alignment (was: Push translation)

List overview All Threads
Download

newer

older

Office hours with Sue Gardner

Back to the Scroll

Lars Aronsson

7 Aug 2010 7 Aug '10

11:10 p.m.

On 08/07/2010 02:23 AM, Andreas Kolbe wrote:

...

Word-processing the Google output to arrive at a readable, written text creates more work than it saves.

This is where our experience differs. I'm working faster with the Google Translator Toolkit than without.

...

If Google want to build up their translation memory, I suggest they pay publishers for permission to analyse existing, published translations, and read those into their memory. This will give them a database of translations that the market judged good enough to publish, written by people who (presumably) understood the subject matter they were working in.

If we forget Google for a while, this is actually something that we could do on our own. There are enough texts in Wikisource (out of copyright books) that are available in more than one language. In some cases, we will run into old spelling and use of language, but it will be better than nothing. The result could be good input to Wiktionary.

Here is the Norwegian original of Nansen's Eskimoliv, http://no.wikisource.org/wiki/Indeks:Nansen-Eskimoliv.djvu

And here is the Swedish translation, both from 1891, http://sv.wikisource.org/wiki/Index:Eskim%C3%A5lif.djvu

Norwegian: Grønland er paa en eiendommelig vis knyttet til vort land og folk.

Swedish: Grönland är på ett egendomligt sätt knutet till vårt land och vårt folk.

As you can see, there is one difference already in this first sentence: The original ends "to our country and people", while the translation ends "to our country and our people".

Is there any good free software for aligning parallel texts and extracting translations? Looking around, I found NAtools, TagAligner, and Bitextor, but they require texts to be marked up already. Are these the best and most modern tools available?

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Show replies by date

John Vandenberg

8 Aug 8 Aug

8:31 p.m.

New subject: Parallel text alignment (was: Push translation)

On Sun, Aug 8, 2010 at 2:10 PM, Lars Aronsson lars@aronsson.se wrote:

...

... Is there any good free software for aligning parallel texts and extracting translations? Looking around, I found NAtools, TagAligner, and Bitextor, but they require texts to be marked up already. Are these the best and most modern tools available?

there is a Mediawiki extension which is supposed to provide this:

http://wikisource.org/wiki/Wikisource:DoubleWiki_Extension

It is enabled on all wikisource subdomains.

http://en.wikisource.org/wiki/Crito?match=el

It doesn't work very well because our Wikisource projects have different layouts, esp. templates such as the header on each page.

-- John Vandenberg

Lars Aronsson

16 Aug 16 Aug

9:09 p.m.

New subject: Parallel text alignment

On August 9, John Vandenberg wrote:

...

On Sun, Aug 8, 2010 at 2:10 PM, Lars Aronssonlars@aronsson.se wrote:

...
Is there any good free software for aligning parallel texts and extracting translations? Looking around, I found NAtools, TagAligner, and Bitextor, but they require texts to be marked up already. Are these the best and most modern tools available?

there is a Mediawiki extension which is supposed to provide this: http://wikisource.org/wiki/Wikisource:DoubleWiki_Extension

It is enabled on all wikisource subdomains. http://en.wikisource.org/wiki/Crito?match=el

This is a wonderful feature I didn't know about until now. But it was not what I'm looking for. In computational linguistics and natural language processing (NLP), a "text aligner" is a piece of software that identifies which words and phrases correspond to which in a translation. The input is a translated text and the output is a dictionary. It's like a more advanced "diff" tool.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

thomasV1＠gmx.de

17 Aug 17 Aug

6:34 a.m.

New subject: Parallel text alignment

...

This is a wonderful feature I didn't know about until now. But it was not what I'm looking for. In computational linguistics and natural language processing (NLP), a "text aligner" is a piece of software that identifies which words and phrases correspond to which in a translation. The input is a translated text and the output is a dictionary. It's like a more advanced "diff" tool.

This extension is not working well. It requires users to manually insert tags in the text, that are used by the extension in order to align the text.

This approach has failed, because: *adding tags to the text is difficult. *the method requires coordination between subdomains. This is difficult to obtain, as you can see here: http://en.wikisource.org/wiki/Crito?match=it *the tags are often deleted because they are not self-explanatory enough *the alignment is sensitive to text formatting. Since most users do not know how the extension works, they destroy the alignment when they modify a page.

So I guess it would be better to remove all the alignment code from this extension, and to use an automated method for that. A text aligner, as you mention, could be running on the toolserver and called using ajax. Are there good free software text aligners?

Thomas

-- GMX DSL SOMMER-SPECIAL: Surf & Phone Flat 16.000 für nur 19,99 ¿/mtl.!* http://portal.gmx.net/de/go/dsl

John Vandenberg

18 Aug 18 Aug

1:16 a.m.

New subject: Parallel text alignment

On Tue, Aug 17, 2010 at 9:34 PM, thomasV1@gmx.de wrote:

...

...
This is a wonderful feature I didn't know about until now. But it was not what I'm looking for. In computational linguistics and natural language processing (NLP), a "text aligner" is a piece of software that identifies which words and phrases correspond to which in a translation. The input is a translated text and the output is a dictionary. It's like a more advanced "diff" tool.

This extension is not working well. It requires users to manually insert tags in the text, that are used by the extension in order to align the text.

This approach has failed, because: *adding tags to the text is difficult. *the method requires coordination between subdomains. This is difficult to obtain, as you can see here: http://en.wikisource.org/wiki/Crito?match=it

I think the doublewiki extension needs to ignore UI blocks in the content of the page, such as the header & footer template, which would mean that there are less non-textual differences between sub-domains.

The lack of coordination is 'fixable', especially if there is some grand goal we share.

Maybe we need to start a meta project to get this working, pick a few text which are available in a few languages, and focus our attention on getting one text that can be used as a demo for others to follow.

...

*the tags are often deleted because they are not self-explanatory enough *the alignment is sensitive to text formatting. Since most users do not know how the extension works, they destroy the alignment when they modify a page.

We may be able to automate the sync points in mainspace by adding sync points in the footers of the page namespace, where they are less susceptible to breakage, or in the body of the page namespace for precise sync points.

http://it.wikisource.org/wiki/Pagina:Critone.djvu/27?match=fr http://fr.wikisource.org/wiki/Page:Platon_-_%C5%92uvres,_trad._Cousin,_I_et_...

I expect that having hand-coded alignment at the page level will help any free 'automatic' tools, as they will have smaller chunks to work with, and any errors will be limited to a few paragraphs.

...

So I guess it would be better to remove all the alignment code from this extension, and to use an automated method for that. A text aligner, as you mention, could be running on the toolserver and called using ajax. Are there good free software text aligners?

Lars mentioned three which are all sf.net projects: NAtools,TagAligner, and Bitextor

The last one looks really useful for our purposes:

http://bitextor.sourceforge.net/

If we could ask it to index all wikisource sub-domains at once, and it can guess which pages are translations across the sub-domains, it may be able to be fairly autonomous, and may even help us find translations which are not linked via interwikis.

-- John Vandenberg

ThomasV

5:55 a.m.

New subject: Parallel text alignment

John Vandenberg a écrit :

...

On Tue, Aug 17, 2010 at 9:34 PM, thomasV1@gmx.de wrote:

The lack of coordination is 'fixable', especially if there is some grand goal we share.

It depends how much coordination is required by the software. The amount of coordination currently required by DoubleWiki is too high; we need to change that first. The grand goal has always been here.

...

We may be able to automate the sync points in mainspace by adding sync points in the footers of the page namespace, where they are less susceptible to breakage, or in the body of the page namespace for precise sync points.

http://it.wikisource.org/wiki/Pagina:Critone.djvu/27?match=fr http://fr.wikisource.org/wiki/Page:Platon_-_%C5%92uvres,_trad._Cousin,_I_et_...

I expect that having hand-coded alignment at the page level will help any free 'automatic' tools, as they will have smaller chunks to work with, and any errors will be limited to a few paragraphs.

I agree that manually adding sync points will be needed ; However, I do not think that sync points should be inserted directly into the text; this approach has failed already. It is too complicated because users need to search for the existing sync points in the text.

What I had in mind is a unique tag hook (or a hidden div) that centralizes all the sync points related to a given text. It would look like this:

<alignment target="fr"> "Why have you come at this hour, Crito" : "Pourquoi déjà venu, Criton" "Why, indeed, Socrates" : "Par Jupiter ! Je m'en serais bien gardé" "There can be no doubt about the meaning Crito, I think." : "Le sens est très clair, à ce qu'il me semble, Criton." </align>

This would centralize all the information needed for text alignment on the page being matched. It would thus avoid to have users mess around with dozen of pages in Page namespace, and prevent interference with the validation process that takes place there.

This centralized approach works ; we already use it at fr.ws for modernisation of old french texts. Instead of letting users mess with pages in Page namespace, we use a modernisation dictionary, which is complemented by a hidden div placed in the ns0 page to be modernised, where users translate expressions that cannot be added to the main dictionary because their translation is context-dependent. The result is that all the work that is related to modernisation takes place on a single page, so it is much easier to manage and it does not interfer with proofreading.

Thomas

5239

Age (days ago)

5249

Last active (days ago)

wikisource-l@lists.wikimedia.org

5 comments

4 participants

tags (0)

participants (4)

John Vandenberg
Lars Aronsson
ThomasV
thomasV1＠gmx.de