Hello!
I am an OPW Intern for round#09 and will be working on a spelling dictionary project, the proposal of which is available here https://www.mediawiki.org/wiki/User:Ankitashukla/Proposal. Also, we'd be using this https://github.com/ankitashukla/spelling-dictionary-opw github repo for version controlling.
Before we start off with the coding part, my mentors Kartik and Amir, and I thought it would be a great idea to have suggestions from everyone that might turnout to be very useful for us during the development of the project. We welcome all ideas of what your expectations are from the project, any specific design advice, any particular implementation or any advice, big or small, that might be useful to us.
Thanks and regards, Ankita Shukla
On 12/2/14, Ankita Shukla ankitashukla707@gmail.com wrote:
Hello!
I am an OPW Intern for round#09 and will be working on a spelling dictionary project, the proposal of which is available here https://www.mediawiki.org/wiki/User:Ankitashukla/Proposal. Also, we'd be using this https://github.com/ankitashukla/spelling-dictionary-opw github repo for version controlling.
Before we start off with the coding part, my mentors Kartik and Amir, and I thought it would be a great idea to have suggestions from everyone that might turnout to be very useful for us during the development of the project. We welcome all ideas of what your expectations are from the project, any specific design advice, any particular implementation or any advice, big or small, that might be useful to us.
Thanks and regards, Ankita Shukla _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
The most immediate thing that comes to mind is why create a new interface where users can "add" words, instead of just scrapping wiktionary? (I take it from your proposal you plan to create a new project where users can submit words for consideration for inclusion into the dictionary).
Additionally as for experts rejecting or accepting words: *Is that actually needed? *Do experts actually exist who would be willing to do that sort of thing? (This varries depending on your definition of "expert". For example, if you mean people with PhD's in said language who will verify the word is proper, the answer would be no. If you mean people who are XX-3 or XX-N in the language then maybe, but I'm not really sure how much of a benefit the review would provide relative to the costs)
I recognize scrapping is difficult for a whole host of reasons (Mostly the fact its semi-unstructured turns it into an NLP project, and that standards aren't consistent cross languages - However, in this case it seems like the information needed would not be that hard [famous last words] to extract simply by looking at categories). It seems like making users add data to a new project is duplicating effort going on in wiktionary.
Even if this project can't use wiktionary for some reason, it seems slightly overlapping with either wikidata or omegawiki, and could perhaps re-use some work for those projects in terms of storing data.
Last of all, In your proposal you give some potential db schemas. I imagine the schema should have a language column for what language the word is for (Not to mention things get more complicated with related languages e.g. EN vs EN-US vs EN-CA vs EN-GB)). Also words can have multiple meanings, perhaps you might want to split up meaning from the word. Its not really needed if the meaning is "immutable", but if meanings can be modified, you may want some way to be able to identify which individual meaning was edited (And then there's issues with history, etc, which again leads back to see if you can have an existing project that has already solved those issues for where the data comes from, instead of making a new one)
--bawolff
Responding inline:
Brian Wolff <bawolff <at> gmail.com> writes:
The most immediate thing that comes to mind is why create a new interface where users can "add" words, instead of just scrapping wiktionary? (I take it from your proposal you plan to create a new project where users can submit words for consideration for inclusion into the dictionary).
Additionally as for experts rejecting or accepting words: *Is that actually needed? *Do experts actually exist who would be willing to do that sort of thing? (This varries depending on your definition of "expert". For example, if you mean people with PhD's in said language who will verify the word is proper, the answer would be no. If you mean people who are XX-3 or XX-N in the language then maybe, but I'm not really sure how much of a benefit the review would provide relative to the costs)
This should not be a problem. Since we expect public to collaborate, the ones who know a certain language can help "verify" a word without any actual need of some official degree in hand.
I recognize scrapping is difficult for a whole host of reasons (Mostly the fact its semi-unstructured turns it into an NLP project, and that standards aren't consistent cross languages - However, in this case it seems like the information needed would not be that hard [famous last words] to extract simply by looking at categories). It seems like making users add data to a new project is duplicating effort going on in wiktionary.
Since the project is a "spelling-dictionary" our main concern is with the spellings and this distinction in providing details of the word (wiktionary) and providing spellings (our project) is what sets the two
Even if this project can't use wiktionary for some reason, it seems slightly overlapping with either wikidata or omegawiki, and could perhaps re-use some work for those projects in terms of storing data.
Yes, scraping wiktionary in different languages would be a cumbersome task. We are delving into other possibilities trying to ensure that no work is duplicated and we maintain the uniformity of the resources in the wikimedia community.
Last of all, In your proposal you give some potential db schemas. I imagine the schema should have a language column for what language the word is for (Not to mention things get more complicated with related languages e.g. EN vs EN-US vs EN-CA vs EN-GB)).
For this, I was considering separate tables for every different language. I am not sure if it would be a good idea to include a language column in a single given table.
Also words can have multiple meanings, perhaps you might want to split up meaning from the word. Its not really needed if the meaning is "immutable", but if meanings can be modified, you may want some way to be able to identify which individual meaning was edited (And then there's issues with history, etc, which again leads back to see if you can have an existing project that has already solved those issues for where the data comes from, instead of making a new one)
While browsing various (existing and proposed in research papers) dictionary structures, I came across dicollecte (this has also been previously mentioned once). They have quite elaborate structure which ensures to cover the possibilities of changing spellings as well.
We are really grateful for the wonderful feedback from your side! I am discussing the various possibilities you mentioned with my mentor now. We shall keep you all updated about the progress of the project. :)
Thanks a lot! Ankita Shukla
wikitech-l@lists.wikimedia.org