Hello, I am willing to participate in GSOC this year for the first time, but I am a little bit worried about choosing the idea, I have one and I am not sure if it suits this program. I will be very glad if you will take a small look at my idea and tell your thoughts. Will be happy to every feedback. Thank you.
Project Idea
What is the purpose?
Help people in reading complex texts by providing inline translation for unknown words. For me as a non-native English speaker student sometimes is hard to read complicated texts or articles, that's why I need to search for translation or description every time. Why not to simplify this and change the flow from translate and understand to translate, learn and understand?
How inline translation will appear?
While user is reading an article, he could find some unknown words or words with confusing meaning for him. At this point he clicks on the selected word and the inline translation appears.
What should be included in inline translation?
Thus it is not just a translator, it should include not only one translation, but a couple or more. Also more data can be included such as synonyms, which can be discussed during project completion.
From which source gather the data?
Wiktionary is the best candidate, it is an open source and it has a wide database. It also suits for growing your project by adding different languages.
Evaluation needs
There are two ways in my mind right now. First is to make a web-site built on Node.js with open API for users. Parsoid could be used for parsing data from Wiktionary API which is suitable for Node. A small JavaScript widget is also required for front-end representation.
Second is to make a standalone library which can be used alone on other resources as an add-on or in browser extensions. Unfortunately, last option is more confusing for me at this point.
Growth opportunities
I am leaving in Finland right now and I don't know Finnish as I should to understand locals, therefore this project can be expanded by adding more languages support for helping people like me reading, learning and understanding texts in foreign languages.
On 2/28/14, Roman Zaynetdinov romanznet@gmail.com wrote:
Hello, I am willing to participate in GSOC this year for the first time, but I am a little bit worried about choosing the idea, I have one and I am not sure if it suits this program. I will be very glad if you will take a small look at my idea and tell your thoughts. Will be happy to every feedback. Thank you.
Project Idea
What is the purpose?
Help people in reading complex texts by providing inline translation for unknown words. For me as a non-native English speaker student sometimes is hard to read complicated texts or articles, that's why I need to search for translation or description every time. Why not to simplify this and change the flow from translate and understand to translate, learn and understand?
How inline translation will appear?
While user is reading an article, he could find some unknown words or words with confusing meaning for him. At this point he clicks on the selected word and the inline translation appears.
What should be included in inline translation?
Thus it is not just a translator, it should include not only one translation, but a couple or more. Also more data can be included such as synonyms, which can be discussed during project completion.
From which source gather the data?
Wiktionary is the best candidate, it is an open source and it has a wide database. It also suits for growing your project by adding different languages.
Evaluation needs
There are two ways in my mind right now. First is to make a web-site built on Node.js with open API for users. Parsoid could be used for parsing data from Wiktionary API which is suitable for Node. A small JavaScript widget is also required for front-end representation.
Second is to make a standalone library which can be used alone on other resources as an add-on or in browser extensions. Unfortunately, last option is more confusing for me at this point.
Growth opportunities
I am leaving in Finland right now and I don't know Finnish as I should to understand locals, therefore this project can be expanded by adding more languages support for helping people like me reading, learning and understanding texts in foreign languages. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Interesting.
I actually did something kind of like this a long time ago, where the user could double click on a word, and the definition would pop up from wiktionary. (The thing I made was very hacky and icky, and stopped working quite some time ago. Some people might like to have a similar tool, but a version that doesn't suck). You can see a screenshot at https://meta.wikimedia.org/wiki/Wiktionary/Look_Up_tool
Parsoid could be used for parsing data from Wiktionary API which is suitable for Node
Just as a warning, parsing data from wiktionary into usable form is a lot harder then it looks, so don't underestimate this step. (Or at least it was several years ago when I last tried)
--bawolff
Hi Roman!
On 02/28/2014 01:24 AM, Brian Wolff wrote:
On 2/28/14, Roman Zaynetdinov romanznet@gmail.com wrote:
Help people in reading complex texts by providing inline translation for unknown words. For me as a non-native English speaker student sometimes is hard to read complicated texts or articles, that's why I need to search for translation or description every time. Why not to simplify this and change the flow from translate and understand to translate, learn and understand?
This sounds like a great idea.
There are two ways in my mind right now. First is to make a web-site built on Node.js with open API for users. Parsoid could be used for parsing data from Wiktionary API which is suitable for Node. A small JavaScript widget is also required for front-end representation.
You could basically write a node service that pulls in the Parsoid HTML for a given wiktionary term and extracts the info you need from the DOM and returns it in a JSON response to a client-side library. Alternatively (or as a first step), you could download the Parsoid HTML of the wiktionary article on the client and extract the info there. This could even be implemented as a gadget. We recently set liberal CORS headers to make this easy.
Parsoid could be used for parsing data from Wiktionary API which is suitable for Node
Just as a warning, parsing data from wiktionary into usable form is a lot harder then it looks, so don't underestimate this step. (Or at least it was several years ago when I last tried)
The Parsoid rendering (e.g. [1]) has pretty much all semantic information in the DOM. There might still be wiktionary-specific issues that we don't know about yet, but tasks like extracting template parameters or the rendering of specific templates (IPA,..) are already straightforward. Also see the DOM spec [2] for background.
Gabriel
[1]: http://parsoid-lb.eqiad.wikimedia.org/enwiktionary/foo Other languages via frwiktionary, fiwiktionary, ... [2]: https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec
Thanks a lot for feedback, I think I can discuss these options with my mentor, I hope :).
2014-02-28 18:51 GMT+02:00 Gabriel Wicke gwicke@wikimedia.org:
Hi Roman!
On 02/28/2014 01:24 AM, Brian Wolff wrote:
On 2/28/14, Roman Zaynetdinov romanznet@gmail.com wrote:
Help people in reading complex texts by providing inline translation for unknown words. For me as a non-native English speaker student sometimes
is
hard to read complicated texts or articles, that's why I need to search
for
translation or description every time. Why not to simplify this and
change
the flow from translate and understand to translate, learn and
understand?
This sounds like a great idea.
There are two ways in my mind right now. First is to make a web-site
built
on Node.js with open API for users. Parsoid could be used for parsing
data
from Wiktionary API which is suitable for Node. A small JavaScript
widget
is also required for front-end representation.
You could basically write a node service that pulls in the Parsoid HTML for a given wiktionary term and extracts the info you need from the DOM and returns it in a JSON response to a client-side library. Alternatively (or as a first step), you could download the Parsoid HTML of the wiktionary article on the client and extract the info there. This could even be implemented as a gadget. We recently set liberal CORS headers to make this easy.
Parsoid could be used for parsing data from Wiktionary API which is suitable for Node
Just as a warning, parsing data from wiktionary into usable form is a lot harder then it looks, so don't underestimate this step. (Or at least it was several years ago when I last tried)
The Parsoid rendering (e.g. [1]) has pretty much all semantic information in the DOM. There might still be wiktionary-specific issues that we don't know about yet, but tasks like extracting template parameters or the rendering of specific templates (IPA,..) are already straightforward. Also see the DOM spec [2] for background.
Gabriel
Other languages via frwiktionary, fiwiktionary, ...
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Feb 28, 2014 12:52 PM, "Gabriel Wicke" gwicke@wikimedia.org wrote:
The Parsoid rendering (e.g. [1]) has pretty much all semantic information in the DOM. There might still be wiktionary-specific issues that we don't know about yet, but tasks like extracting template parameters or the rendering of specific templates (IPA,..) are already straightforward. Also see the DOM spec [2] for background.
Gabriel
Last time I tried doing anything like this was before parsoid existed, and i'll admit my approach was probably the worst possible. However, the issue was that each language formatted their pages differently, and some languages did not format things consistently. I think there is a limit to how much parsoid (or anything thats not AI) can help with that situation.
-bawolff
2014-02-28 11:09 GMT+02:00 Roman Zaynetdinov romanznet@gmail.com:
From which source gather the data?
Wiktionary is the best candidate, it is an open source and it has a wide database. It also suits for growing your project by adding different languages.
It's not obvious why you have reached this conclusion.
1) There are many Wiktionaries, and they do not all work the same or have the same content. 2) The Wiktionary data is relatively free form text, so it is hard to parse to find the relevant bits. 3) Dozens of people have mined Wiktionary already. It would make sense to see if they have put the resulting database available. 4) There are many sources of data, some of them also open, which can have better coverage, or coverage on speciality areas where Wiktionaries are lacking. 5) I expect that best results will be achieved by using multiple data sources.
Growth opportunities
I am leaving in Finland right now and I don't know Finnish as I should to understand locals, therefore this project can be expanded by adding more languages support for helping people like me reading, learning and understanding texts in foreign languages.
I hope you enjoyed your stay in here. I do not how much Finnish you have learned, but after a while it should be obvious that just searching for the exact string the user clicked or selected will not work because of the agglutinative nature of the language. I advocate for features which work in all languages (at least in many :). If you implement this for English only first, it is likely that you will have to rewrite it to support other languages.
-Niklas
Hi Niklas, I know that in Finnish each word should be changed the same as in Russian, that's why it causes the problems with translation. Right now I am looking for solutions which can help find the original word. I put this language as an example, which shows the purpose of using, of course after implementing English others languages could be added with wider support.
2014-02-28 19:30 GMT+02:00 Niklas Laxström niklas.laxstrom@gmail.com:
2014-02-28 11:09 GMT+02:00 Roman Zaynetdinov romanznet@gmail.com:
From which source gather the data?
Wiktionary is the best candidate, it is an open source and it has a wide database. It also suits for growing your project by adding different languages.
It's not obvious why you have reached this conclusion.
- There are many Wiktionaries, and they do not all work the same or
have the same content. 2) The Wiktionary data is relatively free form text, so it is hard to parse to find the relevant bits. 3) Dozens of people have mined Wiktionary already. It would make sense to see if they have put the resulting database available. 4) There are many sources of data, some of them also open, which can have better coverage, or coverage on speciality areas where Wiktionaries are lacking. 5) I expect that best results will be achieved by using multiple data sources.
Growth opportunities
I am leaving in Finland right now and I don't know Finnish as I should to understand locals, therefore this project can be expanded by adding more languages support for helping people like me reading, learning and understanding texts in foreign languages.
I hope you enjoyed your stay in here. I do not how much Finnish you have learned, but after a while it should be obvious that just searching for the exact string the user clicked or selected will not work because of the agglutinative nature of the language. I advocate for features which work in all languages (at least in many :). If you implement this for English only first, it is likely that you will have to rewrite it to support other languages.
-Niklas
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org