On Tue, Aug 17, 2010 at 9:34 PM, <thomasV1(a)gmx.de> wrote:
This is a
wonderful feature I didn't know about until now.
But it was not what I'm looking for. In computational
linguistics and natural language processing (NLP), a "text
aligner" is a piece of software that identifies which words
and phrases correspond to which in a translation. The
input is a translated text and the output is a dictionary.
It's like a more advanced "diff" tool.
This extension is not working well. It requires users
to manually insert tags in the text, that are used by
the extension in order to align the text.
This approach has failed, because:
*adding tags to the text is difficult.
*the method requires coordination between subdomains.
This is difficult to obtain, as you can see here:
http://en.wikisource.org/wiki/Crito?match=it
I think the doublewiki extension needs to ignore UI blocks in the
content of the page, such as the header & footer template, which would
mean that there are less non-textual differences between sub-domains.
The lack of coordination is 'fixable', especially if there is some
grand goal we share.
Maybe we need to start a meta project to get this working, pick a few
text which are available in a few languages, and focus our attention
on getting one text that can be used as a demo for others to follow.
*the tags are often deleted because they are not
self-explanatory enough
*the alignment is sensitive to text formatting. Since
most users do not know how the extension works, they
destroy the alignment when they modify a page.
We may be able to automate the sync points in mainspace by adding sync
points in the footers of the page namespace, where they are less
susceptible to breakage, or in the body of the page namespace for
precise sync points.
http://it.wikisource.org/wiki/Pagina:Critone.djvu/27?match=fr
http://fr.wikisource.org/wiki/Page:Platon_-_%C5%92uvres,_trad._Cousin,_I_et…
I expect that having hand-coded alignment at the page level will help
any free 'automatic' tools, as they will have smaller chunks to work
with, and any errors will be limited to a few paragraphs.
So I guess it would be better to remove all the
alignment
code from this extension, and to use an automated method
for that. A text aligner, as you mention, could be running
on the toolserver and called using ajax. Are there good
free software text aligners?
Lars mentioned three which are all
sf.net projects:
NAtools,TagAligner, and Bitextor
The last one looks really useful for our purposes:
http://bitextor.sourceforge.net/
If we could ask it to index all wikisource sub-domains at once, and it
can guess which pages are translations across the sub-domains, it may
be able to be fairly autonomous, and may even help us find
translations which are not linked via interwikis.
--
John Vandenberg