New version of Kian, faster, general, more accurate - Wikidata

28 Aug 2015


      Hello,
Yesterday I published new version of Kian
https://github.com/Ladsgroup/Kian. I ran it to add statement to claimless
items from Japanese Wikipedia and German Wikipedia and it is working
https://www.wikidata.org/w/index.php?title=Special:Contributions/Dexbot&offset=20150828171607&target=Dexbot
I'm planning to add French and English Wikipedia, You can install it and
run it too.
Another thing I did is reporting possible mistakes, when Wikipedia and
Wikidata don't agree on one statement, These are the results.
https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes and I
was able to find all kinds of errors such as: human in wikidata
https://www.wikidata.org/wiki/Q272267, disambiguation in German Wikipedia
https://de.wikipedia.org/wiki/%C3%86thelwald (it's in this list
https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes/deHuman)
or film in wikidata https://www.wikidata.org/wiki/Q699321, tv series in
ja.wp https://ja.wikipedia.org/wiki/%E3%81%94%E3%81%8F%E3%81%9B%E3%82%93 (the
item seems to be a mess actually, from this list
https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes/jaFilm
) or Iranian mythological character in several wikis
https://www.wikidata.org/wiki/Q1300562 but "actor and model from U.S." in
Wikidata (came from this list
https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes/faHuman).
Please go through the lists
https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes and
fix as much as you can, also give me suggestion on wiki and statements to
run this code.
How new version of Kian works? I introduced concept of "model" in Kian. A
model consists four properties: Wiki (such as "enwikivoyage" or "fawiki"),
name (an arbitrary name), property (like "P31", "P27") and value of that
property (like "Q5" for P31 or "Q31" for "P27"), then Kian goes and trains
that model and once we have that model ready, you can use it to add
statements on any kind of lists of articles (more technically page gens of
pywikibot) for example add this statement on new articles by running
something like this:
python scripts/parser.py -lang:ja -newpages:100 -n jaHuman
which jaHuman is name of that model. It caches all data related to that
model in data/jaHuman/
Or find possible mistakes in that wiki:
python scripts/possible_mistakes.py -n jaHuman
etc.
Another things worth mentioning are:
*scripts of Kian and the library (the part that actually does stuff) are
separated, so you can easily write your own scripts for Kian.
*Since it uses autolists to train and find possible mistakes, results are
live.
* Kian now caches results of SQL queries in different folder of model, so
first model you build for Spanish Wikipedia may take a while to complete
but the second model for Spanish Wikipedia would take so much less time.
* I doubled number of features in a way to made accuracy of Kian really
high [1] (e.g. P31:Q5 for German Wikipedia has AUC of 99.75% and precision
and recall are 99.11%, 98.31% at threshold 63%)
*Thresholds are being chosen automatically based on F-beta scores
https://en.wikipedia.org/wiki/F1_score to have optimum accuracy and high
recall
* It can give results in different classes of certainty, and we can send
these results to semi-automated tools. If anyone willing to help, please do
tell.
* I try to follow dependency injection principals, so it is possible to
train any kind of model using Kian and get the results (since we don't have
really good libraries to do ANN training
https://www.quora.com/What-is-the-best-neural-network-library-for-Python)
A crazy idea: What do you think If I make a webservice for Kian, so you can
go to a page in labs, register a model and after a while get results, or
use OAuth to add statements?
Last thing: Suggest me models and I will work on them :)
[1]: the old Kian worked this way: It labeled all categories based on
percentage of members that already has that statements then labels articles
based on number of categories in each class the article does have. The new
Kian does this but also labels categories based on percentage of members
that has that property but not that value (e.g. "Category:Fictional
characters" would have a high percentage in model of P31:Q5) and also
labels articles based on number of categories in each class.
Best