Hi Amir,
In spite of all due enthusiasm, please evaluate your results (with humans!) before making automated edits. In fact, I would contradict Magnus here and say that such an approach would best be suited to provide meaningful (pre-filtered) *input* to people who play a Wikidata game, rather than bypassing the game (and humans) altogether. The expected error rates are quite high for such an approach, but it can still save a lot of works for humans.
As for the next steps, I would suggest that you have a look at the works that others have done already. Try Google Scholar:
https://scholar.google.com/scholar?q=machine+learning+wikipedia
As you can see, there are countless works on using machine learning techniques on Wikipedia, both for information extraction (e.g., understanding link semantics) and for things like vandalism detection. I am sure that one could get a lot of inspiration from there, both on potential applications and on technical hints on how to improve result quality.
You will find that people are using many different approaches in these works. The good old ANN is still a relevant algorithm in practice, but there are many other techniques, such as SVNs, Markov models, or random forests, which have been found to work better than ANNs in many cases. Not saying that a three-layer feed-forward ANN cannot do some jobs as well, but I would not restrict to one ML approach if you have a whole arsenal of algorithms available, most of them pre-implemented in libraries (the first Google hit has a lot of relevant projects listed: http://daoudclarke.github.io/machine%20learning%20in%20practice/2013/10/08/m...). I would certainly recommend that you don't implement any of the standard ML algorithms from scratch.
In practice, the most challenging task for successful ML is often feature engineering: the question which features you use as an input to your learning algorithm. This is far more important that the choice of algorithm. Wikipedia in particular offers you so many relevant pieces of information with each article that are not just mere keywords (links, categories, in-links, ...) and it is not easy to decide which of these to feed into your learner. This will be different for each task you solve (subject classification is fundamentally different from vandalism detection, and even different types of vandalism would require very different techniques). You should pick hard or very large tasks to make sure that the tweaking you need in each case takes less time than you would need as a human to solve the task manually ;-)
Anyway, it's an interesting field, and we could certainly use some effort to exploit the countless works in this field for Wikidata. But you should be aware that this is no small challenge and that there is no universal solution that will work well even for all the tasks that you have mentioned in your email.
Best wishes,
Markus
On 07.03.2015 18:21, Magnus Manske wrote:
Congratulations for this bold step towards the Singularity :-)
As for tasks, basically everything us mere humans do in the Wikidata game: https://tools.wmflabs.org/wikidata-game/
Some may require text parsing. Not sure how to get that working; haven't spent much time with (artificial) neural nets in a while.
On Sat, Mar 7, 2015 at 12:36 PM Amir Ladsgroup <ladsgroup@gmail.com mailto:ladsgroup@gmail.com> wrote:
Some useful tasks that I'm looking for a way to do are: *Anti-vandal bot (or how we can quantify an edit). *Auto labeling for humans (That's the next task). *Add more :) On Sat, Mar 7, 2015 at 3:54 PM, Amir Ladsgroup <ladsgroup@gmail.com <mailto:ladsgroup@gmail.com>> wrote: Hey, I spent last few weeks working on this lights off [1] and now it's ready to work! Kian is a three-layered neural network with flexible number of inputs and outputs. So if we can parametrize a job, we can teach him easily and get the job done. For example and as the first job. We want to add P31:5 (human) to items of Wikidata based on categories of articles in Wikipedia. The only thing we need to is get list of items with P31:5 and list of items of not-humans (P31 exists but not 5 in it). then get list of category links in any wiki we want[2] and at last we feed these files to Kian and let him learn. Afterwards if we give Kian other articles and their categories, he classifies them as human, not human, or failed to determine. As test I gave him categories of ckb wiki (a small wiki) and worked pretty well and now I'm creating the training set from German Wikipedia and the next step will be English Wikipedia. Number of P31:5 will drastically increase this week. I would love comments or ideas for tasks that Kian can do. [1]: Because I love surprises [2]: "select pp_value, cl_to from page_props join categorylinks on pp_page = cl_from where pp_propname = 'wikibase_item';" Best -- Amir -- Amir _________________________________________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org> https://lists.wikimedia.org/__mailman/listinfo/wikidata-l <https://lists.wikimedia.org/mailman/listinfo/wikidata-l>
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l