On Sat, Mar 7, 2015 at 10:25 PM, Emw <emw.wiki@gmail.com> wrote:
Amir,

What is the false positive rate of your algorithm when dealing with fictitious humans and (non-fictitious) non-human organisms?  That is, how often does your program classify such non-humans as humans?

I give you an exact number for German Wikipedia in several hours
Regarding the latter, note that items about individual dogs, elephants, chimpanzees and even trees can use properties that are otherwise extremely skewed towards humans.  For example, Prometheus (Q590010) [1], an extremely old tree, has claims for date of birth (P569), date of death (P570), even killed by (P157).  Non-human animals can also have kinship claims (e.g. mother, brother, child), among other properties typically used on humans.

The trick to avoid such errors is to give big negative score for having a group E, or D category.

Feature engineering for this task is a little complicated. At first I group categories of a Wiki by having human articles. If more than 80% members of a category are known to be humans, it's a group A category and so on. (D group= 0%). so an article can be parameterized by number of categories in each group it has. e.g. an article about human usually is like 5,3,2,0,0 and an article about a tree can be like 1,0,0,6,7 and having one or several group A category alongside with several group D category prevents the bot from making such false statements. How it's possible and how a bot can do that? it's because the huge set of data (training set) we have already and neural networks algorithms.

Best
 
Best,
1.  Prometheus.  https://www.wikidata.org/wiki/Q590010 

On Sat, Mar 7, 2015 at 1:44 PM, Amir Ladsgroup <ladsgroup@gmail.com> wrote:
Hey Markus,
Thanks for your insight :)

On Sat, Mar 7, 2015 at 9:52 PM, Markus Krötzsch <markus@semantic-mediawiki.org> wrote:
Hi Amir,

In spite of all due enthusiasm, please evaluate your results (with humans!) before making automated edits. In fact, I would contradict Magnus here and say that such an approach would best be suited to provide meaningful (pre-filtered) *input* to people who play a Wikidata game, rather than bypassing the game (and humans) altogether. The expected error rates are quite high for such an approach, but it can still save a lot of works for humans.

there is a "certainty factor" and It can save a lot without making such errors by using the certainty factor
 
As for the next steps, I would suggest that you have a look at the works that others have done already. Try Google Scholar:

https://scholar.google.com/scholar?q=machine+learning+wikipedia

As you can see, there are countless works on using machine learning techniques on Wikipedia, both for information extraction (e.g., understanding link semantics) and for things like vandalism detection. I am sure that one could get a lot of inspiration from there, both on potential applications and on technical hints on how to improve result quality.

Yes, definitely I would use them, thanks.
 
You will find that people are using many different approaches in these works. The good old ANN is still a relevant algorithm in practice, but there are many other techniques, such as SVNs, Markov models, or random forests, which have been found to work better than ANNs in many cases. Not saying that a three-layer feed-forward ANN cannot do some jobs as well, but I would not restrict to one ML approach if you have a whole arsenal of algorithms available, most of them pre-implemented in libraries (the first Google hit has a lot of relevant projects listed: http://daoudclarke.github.io/machine%20learning%20in%20practice/2013/10/08/machine-learning-libraries/). I would certainly recommend that you don't implement any of the standard ML algorithms from scratch.

I use backward propagation algorithm and I use octave in ML for my personal works, but in Wikipedia I use python (for two main reasons: integrating with with other wikipedia-related tools like pywikibot and bad performance of octave and Matlab in big sets of data) and I had to write that parts from scratch since I couldn't find any related library in python. Even algorithms like BFGS is not there (I could find in scipy but I wasn't sure it works correctly and because no documentation is there)
In practice, the most challenging task for successful ML is often feature engineering: the question which features you use as an input to your learning algorithm. This is far more important that the choice of algorithm. Wikipedia in particular offers you so many relevant pieces of information with each article that are not just mere keywords (links, categories, in-links, ...)  and it is not easy to decide which of these to feed into your learner. This will be different for each task you solve (subject classification is fundamentally different from vandalism detection, and even different types of vandalism would require very different techniques). You should pick hard or very large tasks to make sure that the tweaking you need in each case takes less time than you would need as a human to solve the task manually ;-)

Yes, feature engineering is the most important thing and it can be tricky but feature engineering in Wikidata is lot easier (it's easier than Wikipedia. Wikipedia itself it's easier than other places). Anti-Vandalism bots are lot easier in Wikidata than Wikipedia. Editing in Wikidata is limited to certain kinds (like removing a sitelink, etc.) but it's not easy in Wikipedia.
 
Anyway, it's an interesting field, and we could certainly use some effort to exploit the countless works in this field for Wikidata. But you should be aware that this is no small challenge and that there is no universal solution that will work well even for all the tasks that you have mentioned in your email.

Of course, I spent lots of time studying this and I would be happy if anyone who knows about neural networks or AI can contribute too.
 
Best wishes,

Markus


On 07.03.2015 18:21, Magnus Manske wrote:
Congratulations for this bold step towards the Singularity :-)

As for tasks, basically everything us mere humans do in the Wikidata game:
https://tools.wmflabs.org/wikidata-game/

Some may require text parsing. Not sure how to get that working; haven't
spent much time with (artificial) neural nets in a while.



On Sat, Mar 7, 2015 at 12:36 PM Amir Ladsgroup <ladsgroup@gmail.com
<mailto:ladsgroup@gmail.com>> wrote:

    Some useful tasks that I'm looking for a way to do are:
    *Anti-vandal bot (or how we can quantify an edit).
    *Auto labeling for humans (That's the next task).
    *Add more :)


    On Sat, Mar 7, 2015 at 3:54 PM, Amir Ladsgroup <ladsgroup@gmail.com
    <mailto:ladsgroup@gmail.com>> wrote:

        Hey,
        I spent last few weeks working on this lights off [1] and now
        it's ready to work!

        Kian is a three-layered neural network with flexible number of
        inputs and outputs. So if we can parametrize a job, we can teach
        him easily and get the job done.

        For example and as the first job. We want to add P31:5 (human)
        to items of Wikidata based on categories of articles in
        Wikipedia. The only thing we need to is get list of items with
        P31:5 and list of items of not-humans (P31 exists but not 5 in
        it). then get list of category links in any wiki we want[2] and
        at last we feed these files to Kian and let him learn.
        Afterwards if we give Kian other articles and their categories,
        he classifies them as human, not human, or failed to determine.
        As test I gave him categories of ckb wiki (a small wiki) and
        worked pretty well and now I'm creating the training set from
        German Wikipedia and the next step will be English Wikipedia.
        Number of P31:5 will drastically increase this week.

        I would love comments or ideas for tasks that Kian can do.


        [1]: Because I love surprises
        [2]: "select pp_value, cl_to from page_props join categorylinks
        on pp_page = cl_from where pp_propname = 'wikibase_item';"
        Best
        --
        Amir




    --
    Amir

    _________________________________________________
    Wikidata-l mailing list
    Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org>
    https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
    <https://lists.wikimedia.org/mailman/listinfo/wikidata-l>



_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l



_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l



--
Amir


_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l



_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




--
Amir