Re: [Wikitech-l] GSOC 2012 - Text Processing and Data Mining (Oren Bochman) - Wikitech-l

2 Apr 2012

      Dear Oren Bochman,
I am very pleased to hear from you.
My familiarity with the requirements *on a scale of 5* are as follows:
1. Java and other programming languages :: *  4.5   *...I have done
   courses on Java, C, C++. I have extensively used Python in my projects. I
   am very comfortable with the syntax and semantics and understanding
   different libraries won't be difficult
   2. PHP :: *   3.5   *...I have used php in my project and
   am undergoing a course on it in my university.
   3. Apache Lucene :: *   2   *...I was not very familiar with this
   library until recently. However, I am very much willing to learn this
   as soon as possible, and be comfortable with it before the
   coding period starts.
   4. Natural Language Processing:: *   4   *...Language processing and
   Data is my major interest and I have done all my projects on NLP. I have
   taken up the course on NLP being offered at coursera.org. NLP is what i
   discuss with my professors at my university too.
   5. Computational Linguistics and Word net :: *   4   *...I am using the
   principles of computational linguistics and the wordnet in my current
   project- Automatic essay grader. Also, I have chosen Data Mining as an
   elective and am comfortable with the field
I was looking for some clarifications regarding the proposed ideas:
1. Regarding the first project :: "a framework for handling different
   languages."...how exactly should we be looking at 'handling' languages?
   what kind of frame work is expected?
   2. Regarding the second project :: "Make a Lucene filter which uses such
   a wordnet to expand search terms."...does this project aim at building
   everything from scratch or revamping the existing code?
My understanding of the proposed idea 1 is : "To extract the corpus
from Wikipedia and and to apply the deliverables on them." Please correct
me if I am missing something.
Also, I was wondering if you were thinking of some specific approach or
would it be OK if i come up with an approach and propose the same in my
proposal.
Some more details regarding my Essay Grader project. The grader does take
care of the essay coherence. Spelling and grammar are, as you pointed out
important, but not too informative when it comes to the "relatedness" of
the essay. The essays are also graded based on the structure. We tried to
analyse the statistics of the essay to come up with a measure to grade the
essay structure.
I am very excited about this and am eagerly looking forward to hear from
you.
Thank you.
Best Regards,
Karthik
...
Date: Mon, 2 Apr 2012 11:46:21 +0200
From: "Oren Bochman" orenbochman@gmail.com
To: wikitech-l@lists.wikimedia.org
Subject: Re: [Wikitech-l] GSOC 2012 - Text Processing and Data Mining
Message-ID: 017401cd10b5$769f9fb0$63dedf10$@com
Content-Type: text/plain;       charset="us-ascii"
Dear, Karthik Prasad & Other GSOC candidates.
I was not getting this list but I am now.
The GSOC proposal should be specified by the student.
I'll can expand the details on these projects.
I can answer specific questions you have about expectation.
To optimally  match you with a suitable high impact project - to what
extent
are you familiar with :
*Java and other programming languages?
*PHP?
*Apache Lucene?
*Natural Language Processing?
*Corpus Linguistics?
*Word Net?
The listed projects would be either wrapped as services, or consumed by
downstream projects or both.
The corpus is the simplest but requires lots of attention to detail. When
successful, it would be picked up by lots of
researchers and companies who do not have the resources for doing such CPU
intensive tasks.
For WMF it would provide us with a standardized body for future NLP work. A
Part Of Speech tagged corpus would
be immediately useful for an 80% accurate word sense disambiguation in the
search engine.
Automatic Summaries are not a strategic priority AFAIK -

  most articles provide a kind of abstract in their intro and

  there are something like this already provided in the dumps for

yahoo.

  I have been using a great pop up preview widget in Wiktionary for

a
year or so.
I do think it would be a great project to learn how to become a MediaWiki
developer but is small for a GSOC.
However I cannot speak for Jebald and other mentors in cellular and other
teams who might be interested in this.
If your easy grader is working it could be the basis of another very
exciting GSOC project aimed at article quality.
A NLP savvy "smart" article quality assessment service could improve/expand
the current bots grading articles.
Grammar and spelling are two good indicators, features. However a full
assessment of Wikipedia articles would
require more details - both stylistic and information based. Once you have
covered sufficient features
building discriminators based on samples of graded articles would require
some data mining ability.
However since there is an Existing bot, undergoing upgrades  we would have
to check with its small dev team what it currently doing
And it would be subject to community oversight.
Yours Sincerely,
Oren Bochman
MediaWiki Search Developer