On Tue, Aug 17, 2010 at 9:34 PM, thomasV1@gmx.de wrote:
This is a wonderful feature I didn't know about until now. But it was not what I'm looking for. In computational linguistics and natural language processing (NLP), a "text aligner" is a piece of software that identifies which words and phrases correspond to which in a translation. The input is a translated text and the output is a dictionary. It's like a more advanced "diff" tool.
This extension is not working well. It requires users to manually insert tags in the text, that are used by the extension in order to align the text.
This approach has failed, because: *adding tags to the text is difficult. *the method requires coordination between subdomains. This is difficult to obtain, as you can see here: http://en.wikisource.org/wiki/Crito?match=it
I think the doublewiki extension needs to ignore UI blocks in the content of the page, such as the header & footer template, which would mean that there are less non-textual differences between sub-domains.
The lack of coordination is 'fixable', especially if there is some grand goal we share.
Maybe we need to start a meta project to get this working, pick a few text which are available in a few languages, and focus our attention on getting one text that can be used as a demo for others to follow.
*the tags are often deleted because they are not self-explanatory enough *the alignment is sensitive to text formatting. Since most users do not know how the extension works, they destroy the alignment when they modify a page.
We may be able to automate the sync points in mainspace by adding sync points in the footers of the page namespace, where they are less susceptible to breakage, or in the body of the page namespace for precise sync points.
http://it.wikisource.org/wiki/Pagina:Critone.djvu/27?match=fr http://fr.wikisource.org/wiki/Page:Platon_-_%C5%92uvres,_trad._Cousin,_I_et_...
I expect that having hand-coded alignment at the page level will help any free 'automatic' tools, as they will have smaller chunks to work with, and any errors will be limited to a few paragraphs.
So I guess it would be better to remove all the alignment code from this extension, and to use an automated method for that. A text aligner, as you mention, could be running on the toolserver and called using ajax. Are there good free software text aligners?
Lars mentioned three which are all sf.net projects: NAtools,TagAligner, and Bitextor
The last one looks really useful for our purposes:
http://bitextor.sourceforge.net/
If we could ask it to index all wikisource sub-domains at once, and it can guess which pages are translations across the sub-domains, it may be able to be fairly autonomous, and may even help us find translations which are not linked via interwikis.
-- John Vandenberg