Amir,
What is the false positive rate of your algorithm when dealing with
fictitious humans and (non-fictitious) non-human organisms? That is, how
often does your program classify such non-humans as humans?
Regarding the latter, note that items about individual dogs, elephants,
chimpanzees and even trees can use properties that are otherwise extremely
skewed towards humans. For example, Prometheus (Q590010) [1], an extremely
old tree, has claims for *date of birth* (P569), *date of death* (P570),
even *killed by* (P157). Non-human animals can also have kinship claims
(e.g. *mother*, *brother, child*), among other properties typically used on
humans.
Best,
Eric
1. Prometheus.
On Sat, Mar 7, 2015 at 1:44 PM, Amir Ladsgroup <ladsgroup(a)gmail.com> wrote:
Hey Markus,
Thanks for your insight :)
On Sat, Mar 7, 2015 at 9:52 PM, Markus Krötzsch <
markus(a)semantic-mediawiki.org> wrote:
Hi Amir,
In spite of all due enthusiasm, please evaluate your results (with
humans!) before making automated edits. In fact, I would contradict Magnus
here and say that such an approach would best be suited to provide
meaningful (pre-filtered) *input* to people who play a Wikidata game,
rather than bypassing the game (and humans) altogether. The expected error
rates are quite high for such an approach, but it can still save a lot of
works for humans.
there is a "certainty factor" and It can save a lot without making such
errors by using the certainty factor
As for the next steps, I would suggest that you
have a look at the works
that others have done already. Try Google Scholar:
https://scholar.google.com/scholar?q=machine+learning+wikipedia
As you can see, there are countless works on using machine learning
techniques on Wikipedia, both for information extraction (e.g.,
understanding link semantics) and for things like vandalism detection. I am
sure that one could get a lot of inspiration from there, both on potential
applications and on technical hints on how to improve result quality.
Yes, definitely I would use them, thanks.
You will find that people are using many
different approaches in these
works. The good old ANN is still a relevant algorithm in practice, but
there are many other techniques, such as SVNs, Markov models, or random
forests, which have been found to work better than ANNs in many cases. Not
saying that a three-layer feed-forward ANN cannot do some jobs as well, but
I would not restrict to one ML approach if you have a whole arsenal of
algorithms available, most of them pre-implemented in libraries (the first
Google hit has a lot of relevant projects listed:
http://daoudclarke.github.io/machine%20learning%20in%
20practice/2013/10/08/machine-learning-libraries/). I would certainly
recommend that you don't implement any of the standard ML algorithms from
scratch.
I use backward propagation algorithm and I use octave in ML for my
personal works,
but in Wikipedia I use python (for two main reasons:
integrating with with other wikipedia-related tools like pywikibot and bad
performance of octave and Matlab in big sets of data) and I had to write
that parts from scratch since I couldn't find any related library in
python. Even algorithms like BFGS is not there (I could find in scipy but I
wasn't sure it works correctly and because no documentation is there)
In practice, the most challenging task for
successful ML is often feature
engineering: the question which features you use as an input to your
learning algorithm. This is far more important that the choice of
algorithm. Wikipedia in particular offers you so many relevant pieces of
information with each article that are not just mere keywords (links,
categories, in-links, ...) and it is not easy to decide which of these to
feed into your learner. This will be different for each task you solve
(subject classification is fundamentally different from vandalism
detection, and even different types of vandalism would require very
different techniques). You should pick hard or very large tasks to make
sure that the tweaking you need in each case takes less time than you would
need as a human to solve the task manually ;-)
Yes, feature engineering is the most important thing and it can be tricky
but
feature engineering in Wikidata is lot easier (it's easier than
Wikipedia. Wikipedia itself it's easier than other places). Anti-Vandalism
bots are lot easier in Wikidata than Wikipedia. Editing in Wikidata is
limited to certain kinds (like removing a sitelink, etc.) but it's not easy
in Wikipedia.
Anyway, it's an interesting field, and we
could certainly use some effort
to exploit the countless works in this field for Wikidata. But you should
be aware that this is no small challenge and that there is no universal
solution that will work well even for all the tasks that you have mentioned
in your email.
Of course, I spent lots of time studying this and I would be happy if
anyone who
knows about neural networks or AI can contribute too.
Best wishes,
Markus
On 07.03.2015 18:21, Magnus Manske wrote:
Congratulations for this bold step towards the
Singularity :-)
As for tasks, basically everything us mere humans do in the Wikidata
game:
https://tools.wmflabs.org/wikidata-game/
Some may require text parsing. Not sure how to get that working; haven't
spent much time with (artificial) neural nets in a while.
On Sat, Mar 7, 2015 at 12:36 PM Amir Ladsgroup <ladsgroup(a)gmail.com
<mailto:ladsgroup@gmail.com>> wrote:
Some useful tasks that I'm looking for a way to do are:
*Anti-vandal bot (or how we can quantify an edit).
*Auto labeling for humans (That's the next task).
*Add more :)
On Sat, Mar 7, 2015 at 3:54 PM, Amir Ladsgroup <ladsgroup(a)gmail.com
<mailto:ladsgroup@gmail.com>> wrote:
Hey,
I spent last few weeks working on this lights off [1] and now
it's ready to work!
Kian is a three-layered neural network with flexible number of
inputs and outputs. So if we can parametrize a job, we can teach
him easily and get the job done.
For example and as the first job. We want to add P31:5 (human)
to items of Wikidata based on categories of articles in
Wikipedia. The only thing we need to is get list of items with
P31:5 and list of items of not-humans (P31 exists but not 5 in
it). then get list of category links in any wiki we want[2] and
at last we feed these files to Kian and let him learn.
Afterwards if we give Kian other articles and their categories,
he classifies them as human, not human, or failed to determine.
As test I gave him categories of ckb wiki (a small wiki) and
worked pretty well and now I'm creating the training set from
German Wikipedia and the next step will be English Wikipedia.
Number of P31:5 will drastically increase this week.
I would love comments or ideas for tasks that Kian can do.
[1]: Because I love surprises
[2]: "select pp_value, cl_to from page_props join categorylinks
on pp_page = cl_from where pp_propname = 'wikibase_item';"
Best
--
Amir
--
Amir
_________________________________________________
Wikidata-l mailing list
Wikidata-l(a)lists.wikimedia.org <mailto:Wikidata-l@lists.
wikimedia.org>
https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
<https://lists.wikimedia.org/mailman/listinfo/wikidata-l>
_______________________________________________
Wikidata-l mailing list
Wikidata-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
_______________________________________________
Wikidata-l mailing list
Wikidata-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
--
Amir
_______________________________________________
Wikidata-l mailing list
Wikidata-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l