Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools

4 Apr 2012


      You do understand correctly!
The main idea about NLP components is with POS tagger as an example:
1. a fall back system that does unsupervised POS tagging.
2. the ability to plug in an existing POS tagger as these become  available for specific languages.
I would as supervisor would recommend working with 3 languages.
English, Hebrew, and the GSOC native language.
If we could get QA from other native speakers we would incorporate them into the workflow.
I think that by using a deletion/reversion based heuristic we may also be able to make a spam corpus to boost the accuracy of the corpuses.
Operation Manager 
E-mail: oren@romai-horizon.com
Mobil: +36 30 866 6706
Római Horizon Kft. 
H-1039 Budapest 
Királyok útja  291. D. ép. fszt. 2.
Tel:   +36 1 492 1492
Fax:  +36 1 266 5529
-----Original Message-----
From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Amir E. Aharoni
Sent: Tuesday, April 03, 2012 10:19 PM
To: Wikimedia developers
Subject: Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools
2012/4/3 karthik prasad karthikprasad008@gmail.com:
...
Hello,
I am a GSoC aspirant and have compiled a proposal for one of the 
project ideas - Wikipedia Corpus Tools. [Mentor : Oren Bochman] I 
would sincerely appreciate if you could kindly go through it and 
suggest corrections/additions so that I can settle with a coherent proposal.
Link to my proposal :
https://www.mediawiki.org/wiki/User:Karthikprasad/gsoc2012proposal
Nice, but why only English?
If i understand the proposal correctly, this project is supposed to be able to work with almost any language with very little effort.
--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com ‪“We're living in pieces, I want to live in peace.” – T. Moore‬
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] GSoC 2012: Proposal-Wikipedia Corpus Tools