Hello, Yesterday I published new version of Kian https://github.com/Ladsgroup/Kian. I ran it to add statement to claimless items from Japanese Wikipedia and German Wikipedia and it is working https://www.wikidata.org/w/index.php?title=Special:Contributions/Dexbot&offset=20150828171607&target=Dexbot I'm planning to add French and English Wikipedia, You can install it and run it too.
Another thing I did is reporting possible mistakes, when Wikipedia and Wikidata don't agree on one statement, These are the results. https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes and I was able to find all kinds of errors such as: human in wikidata https://www.wikidata.org/wiki/Q272267, disambiguation in German Wikipedia https://de.wikipedia.org/wiki/%C3%86thelwald (it's in this list https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes/deHuman) or film in wikidata https://www.wikidata.org/wiki/Q699321, tv series in ja.wp https://ja.wikipedia.org/wiki/%E3%81%94%E3%81%8F%E3%81%9B%E3%82%93 (the item seems to be a mess actually, from this list https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes/jaFilm ) or Iranian mythological character in several wikis https://www.wikidata.org/wiki/Q1300562 but "actor and model from U.S." in Wikidata (came from this list https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes/faHuman). Please go through the lists https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes and fix as much as you can, also give me suggestion on wiki and statements to run this code.
How new version of Kian works? I introduced concept of "model" in Kian. A model consists four properties: Wiki (such as "enwikivoyage" or "fawiki"), name (an arbitrary name), property (like "P31", "P27") and value of that property (like "Q5" for P31 or "Q31" for "P27"), then Kian goes and trains that model and once we have that model ready, you can use it to add statements on any kind of lists of articles (more technically page gens of pywikibot) for example add this statement on new articles by running something like this: python scripts/parser.py -lang:ja -newpages:100 -n jaHuman which jaHuman is name of that model. It caches all data related to that model in data/jaHuman/ Or find possible mistakes in that wiki: python scripts/possible_mistakes.py -n jaHuman etc.
Another things worth mentioning are: *scripts of Kian and the library (the part that actually does stuff) are separated, so you can easily write your own scripts for Kian. *Since it uses autolists to train and find possible mistakes, results are live. * Kian now caches results of SQL queries in different folder of model, so first model you build for Spanish Wikipedia may take a while to complete but the second model for Spanish Wikipedia would take so much less time. * I doubled number of features in a way to made accuracy of Kian really high [1] (e.g. P31:Q5 for German Wikipedia has AUC of 99.75% and precision and recall are 99.11%, 98.31% at threshold 63%) *Thresholds are being chosen automatically based on F-beta scores https://en.wikipedia.org/wiki/F1_score to have optimum accuracy and high recall * It can give results in different classes of certainty, and we can send these results to semi-automated tools. If anyone willing to help, please do tell. * I try to follow dependency injection principals, so it is possible to train any kind of model using Kian and get the results (since we don't have really good libraries to do ANN training https://www.quora.com/What-is-the-best-neural-network-library-for-Python)
A crazy idea: What do you think If I make a webservice for Kian, so you can go to a page in labs, register a model and after a while get results, or use OAuth to add statements?
Last thing: Suggest me models and I will work on them :)
[1]: the old Kian worked this way: It labeled all categories based on percentage of members that already has that statements then labels articles based on number of categories in each class the article does have. The new Kian does this but also labels categories based on percentage of members that has that property but not that value (e.g. "Category:Fictional characters" would have a high percentage in model of P31:Q5) and also labels articles based on number of categories in each class. Best