Dear Gerard, Scott, Lucie and Amir & everyone,
Thank you for the helpful responses!
Gerard - great to hear about your work and thank you for the reference on Cebuano Wikipedia. We weren't familiar with that case, but had similar fears. We are putting our machine translated content on a separate site (Wikibabel) rather than Wikipedia, and we never expect it to show up nearly as high in search engine results, and so the native content will always take precedence.
Scott - for the time being, the Wikibabel experiment will not have a Wikidata or Lexicographical portion, as for now this is an experiment to see if machine translated content in certain categories is useful given the state of machine translation currently (and we are doing this on a separate site in order to avoid any contamination). If we'll measure good user engagement and survey results, we'd love to think about how to integrate this better. We'd love to eventually have an editing interface on top of the translation and with existing native content if exists and has fewer details. If this editing interface becomes popular, then we'll start accumulating a dataset which may be useful to machine translation services. We, however, are not developing anything novel in the machine translation algorithm space.
Lucie - ah, that's what this is! We noticed the large number of recent one sentence articles and were wondering what project that was. Those are awesome! Both in terms of information availability and because it allows us to measure relative interest within the generated set. I would love to discuss further your plans to expand beyond the introductory sentence, and if we can be helpful in any way. Thank you for the publication links as well.
Amir - thank you for the pointers to these projects. Your 2 points of feedback, if I understand correctly:
1. Machine translation might not be good enough to yield useful information.
2. People can translate the pages for free.
Those are excellent points that we've thought about deeply before starting the projects (though we find a much higher translation quality recently than you perhaps). Here's what we think:
1. This is exactly the central question of our experiment, and very much still open. Machine translation (or at least Google translate) has improved significantly in the last 12 months or so. The quality of the translation, particularly when there is context (longer sentences) has improved leaps and bounds. For the Wikibabel project, we spot checked with Swahili speakers that some pages translate very well (not perfect human level, but very understandable with a few awkward turns of phrase) and some are bad enough to not be useful. Given how little information there is on the Internet in Swahili, particularly on technical topics (easier to translate in some ways) and that there aren't many participants in the Swahili Wikipedia, we hypothesize that the best X% of translations would be useful, and that we can measure and tell the difference between well translated pages and not from page analytics and surveys.
We are, however, careful to have this in a separate site (Wikibabel), rather than checking any of this into wikipedia, because as you mentioned that would be misleading.
2. That is absolutely true, and we're fundamentally solving a discoverability problem. If a Swahili speaker currently Searches for a term in Swahili, they will not land on the English results for it. They will get some potentially bad results in their language and potentially give up. In general, they would need to know about Wikipedia, or some other good source, that it's better in English and that Google translate exists. Some of the folks we're targeting are fairly new to the internet, so this is not a low bar.
Olya & the Wikibabel crew