Hi,

I am working on the mass migration tools project as a part of Google Summer of Code. One of the parts of project is to import old translations into the Translate Extension.

We are done with a basic import by splitting the old pages on double newlines (\n\n) and some more alignment based on h2 headers. We are now thinking of improving the alignment.

Is there some work done on the subject mentioned? For each of the unit, what I would like to do is clear all the linguistic elements and have the bare markup left. Then, I could compare the markup of the source and target units and align accordingly.

Are there any API's available which already do this? Please guide me to accomplish this task.

--
Warm Regards,
Pratik Lahoti
GSoC Intern | Wikimedia
User:BPositive