[Wikidata] New version of Kian, faster, general, more accurate

28 Aug 2015

Hello,
Yesterday I published new version of Kian
<https://github.com/Ladsgroup/Kian>. I ran it to add statement to claimless
items from Japanese Wikipedia and German Wikipedia and it is working
<https://www.wikidata.org/w/index.php?title=Special:Contributions/Dexbot&offset=20150828171607&target=Dexbot>
I'm planning to add French and English Wikipedia, You can install it and
run it too.

Another thing I did is reporting possible mistakes, when Wikipedia and
Wikidata don't agree on one statement, These are the results.
<https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes> and I
was able to find all kinds of errors such as: human in wikidata
<https://www.wikidata.org/wiki/Q272267>, disambiguation in German Wikipedia
<https://de.wikipedia.org/wiki/%C3%86thelwald> (it's in this list
<https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes/deHuman>)
or film in wikidata <https://www.wikidata.org/wiki/Q699321>, tv series in
ja.wp <https://ja.wikipedia.org/wiki/%E3%81%94%E3%81%8F%E3%81%9B%E3%82%93> (the
item seems to be a mess actually, from this list
<https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes/jaFilm>
) or Iranian mythological character in several wikis
<https://www.wikidata.org/wiki/Q1300562> but "actor and model from U.S."
in
Wikidata (came from this list
<https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes/faHuman>).
Please go through the lists
<https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes> and
fix as much as you can, also give me suggestion on wiki and statements to
run this code.

How new version of Kian works? I introduced concept of "model" in Kian. A
model consists four properties: Wiki (such as "enwikivoyage" or
"fawiki"),
name (an arbitrary name), property (like "P31", "P27") and value of
that
property (like "Q5" for P31 or "Q31" for "P27"), then Kian
goes and trains
that model and once we have that model ready, you can use it to add
statements on any kind of lists of articles (more technically page gens of
pywikibot) for example add this statement on new articles by running
something like this:
python scripts/parser.py -lang:ja -newpages:100 -n jaHuman
which jaHuman is name of that model. It caches all data related to that
model in data/jaHuman/
Or find possible mistakes in that wiki:
python scripts/possible_mistakes.py -n jaHuman
etc.

Another things worth mentioning are:
*scripts of Kian and the library (the part that actually does stuff) are
separated, so you can easily write your own scripts for Kian.
*Since it uses autolists to train and find possible mistakes, results are
live.
* Kian now caches results of SQL queries in different folder of model, so
first model you build for Spanish Wikipedia may take a while to complete
but the second model for Spanish Wikipedia would take so much less time.
* I doubled number of features in a way to made accuracy of Kian really
high [1] (e.g. P31:Q5 for German Wikipedia has AUC of 99.75% and precision
and recall are 99.11%, 98.31% at threshold 63%)
*Thresholds are being chosen automatically based on F-beta scores
<https://en.wikipedia.org/wiki/F1_score> to have optimum accuracy and high
recall
* It can give results in different classes of certainty, and we can send
these results to semi-automated tools. If anyone willing to help, please do
tell.
* I try to follow dependency injection principals, so it is possible to
train any kind of model using Kian and get the results (since we don't have
really good libraries to do ANN training
<https://www.quora.com/What-is-the-best-neural-network-library-for-Python>)

A crazy idea: What do you think If I make a webservice for Kian, so you can
go to a page in labs, register a model and after a while get results, or
use OAuth to add statements?

Last thing: Suggest me models and I will work on them :)

[1]: the old Kian worked this way: It labeled all categories based on
percentage of members that already has that statements then labels articles
based on number of categories in each class the article does have. The new
Kian does this but also labels categories based on percentage of members
that has that property but not that value (e.g. "Category:Fictional
characters" would have a high percentage in model of P31:Q5) and also
labels articles based on number of categories in each class.
Best

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Wikidata] New version of Kian, faster, general, more accurate