Hello, Yesterday I published new version of Kian https://github.com/Ladsgroup/Kian. I ran it to add statement to claimless items from Japanese Wikipedia and German Wikipedia and it is working https://www.wikidata.org/w/index.php?title=Special:Contributions/Dexbot&offset=20150828171607&target=Dexbot I'm planning to add French and English Wikipedia, You can install it and run it too.
Another thing I did is reporting possible mistakes, when Wikipedia and Wikidata don't agree on one statement, These are the results. https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes and I was able to find all kinds of errors such as: human in wikidata https://www.wikidata.org/wiki/Q272267, disambiguation in German Wikipedia https://de.wikipedia.org/wiki/%C3%86thelwald (it's in this list https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes/deHuman) or film in wikidata https://www.wikidata.org/wiki/Q699321, tv series in ja.wp https://ja.wikipedia.org/wiki/%E3%81%94%E3%81%8F%E3%81%9B%E3%82%93 (the item seems to be a mess actually, from this list https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes/jaFilm ) or Iranian mythological character in several wikis https://www.wikidata.org/wiki/Q1300562 but "actor and model from U.S." in Wikidata (came from this list https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes/faHuman). Please go through the lists https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes and fix as much as you can, also give me suggestion on wiki and statements to run this code.
How new version of Kian works? I introduced concept of "model" in Kian. A model consists four properties: Wiki (such as "enwikivoyage" or "fawiki"), name (an arbitrary name), property (like "P31", "P27") and value of that property (like "Q5" for P31 or "Q31" for "P27"), then Kian goes and trains that model and once we have that model ready, you can use it to add statements on any kind of lists of articles (more technically page gens of pywikibot) for example add this statement on new articles by running something like this: python scripts/parser.py -lang:ja -newpages:100 -n jaHuman which jaHuman is name of that model. It caches all data related to that model in data/jaHuman/ Or find possible mistakes in that wiki: python scripts/possible_mistakes.py -n jaHuman etc.
Another things worth mentioning are: *scripts of Kian and the library (the part that actually does stuff) are separated, so you can easily write your own scripts for Kian. *Since it uses autolists to train and find possible mistakes, results are live. * Kian now caches results of SQL queries in different folder of model, so first model you build for Spanish Wikipedia may take a while to complete but the second model for Spanish Wikipedia would take so much less time. * I doubled number of features in a way to made accuracy of Kian really high [1] (e.g. P31:Q5 for German Wikipedia has AUC of 99.75% and precision and recall are 99.11%, 98.31% at threshold 63%) *Thresholds are being chosen automatically based on F-beta scores https://en.wikipedia.org/wiki/F1_score to have optimum accuracy and high recall * It can give results in different classes of certainty, and we can send these results to semi-automated tools. If anyone willing to help, please do tell. * I try to follow dependency injection principals, so it is possible to train any kind of model using Kian and get the results (since we don't have really good libraries to do ANN training https://www.quora.com/What-is-the-best-neural-network-library-for-Python)
A crazy idea: What do you think If I make a webservice for Kian, so you can go to a page in labs, register a model and after a while get results, or use OAuth to add statements?
Last thing: Suggest me models and I will work on them :)
[1]: the old Kian worked this way: It labeled all categories based on percentage of members that already has that statements then labels articles based on number of categories in each class the article does have. The new Kian does this but also labels categories based on percentage of members that has that property but not that value (e.g. "Category:Fictional characters" would have a high percentage in model of P31:Q5) and also labels articles based on number of categories in each class. Best
Thanks Nemo!
I added new reports: https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes
If you check them, you can easily find tons of errors, some of them are mis-categorization in Wikipedia, some of them are mistake in connecting article from Wikipedia to wrong item, some of them are vandalism in Wikidata, some of them are mistakes by bots or Widar users. Please check them if you want to have better quality in Wikidata
Best
On Sun, Aug 30, 2015 at 12:16 PM Federico Leva (Nemo) nemowiki@gmail.com wrote:
Amir Ladsgroup, 28/08/2015 20:17:
Another thing I did is reporting possible mistakes, when Wikipedia and Wikidata don't agree on one statement,
Nice, with this Wikidata has better quality control systems than Wikipedia. ;-)
Nemo
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
After glancing at https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes/frFilm, it doesn't appear to me that either Wikidata type hierarchy or Wikipedia category hierarchy is being considered when evaluating type mismatches. Is that intentional?
For example
Grave of the Fireflies (Q274520) https://www.wikidata.org/wiki/Q274520NoYes (0.731427666987)
is an instance of animated film which is a subtype of film.
Conversely, this telefilm d'horreur
Le Collectionneur de cerveaux (Q579355) https://www.wikidata.org/wiki/Q579355YesNo (0.239868037957
is part of a subcategory of film d'horreur -> film de fiction
The one other that I glanced at, https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes/frHuman, seems to have systematic issues with correct classification of Wikipedia pages about multiple people (e.g. brothers) which Wikidata correctly identifies as not people.
It also, strangely, seems to think that Wikidata atomic elements are humans and I can't see why:
calcium (Q706) https://www.wikidata.org/wiki/Q706YesNo (0.0225392419603)
Have you considered using other signals as inputs to your models? For example, Freebase types should be a pretty reliable signal for things like humans and films.
Tom
On Sun, Aug 30, 2015 at 11:56 AM, Amir Ladsgroup ladsgroup@gmail.com wrote:
Thanks Nemo!
I added new reports: https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes
If you check them, you can easily find tons of errors, some of them are mis-categorization in Wikipedia, some of them are mistake in connecting article from Wikipedia to wrong item, some of them are vandalism in Wikidata, some of them are mistakes by bots or Widar users. Please check them if you want to have better quality in Wikidata
Best
On Sun, Aug 30, 2015 at 12:16 PM Federico Leva (Nemo) nemowiki@gmail.com wrote:
Amir Ladsgroup, 28/08/2015 20:17:
Another thing I did is reporting possible mistakes, when Wikipedia and Wikidata don't agree on one statement,
Nice, with this Wikidata has better quality control systems than Wikipedia. ;-)
Nemo
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hey Tom, Thanks for you review. Note that this list list of *possible* errors and it doesn't mean all of entries are wrong :) (if it was like this, I would go ahead and removed them all)
On Mon, Aug 31, 2015 at 1:22 AM Tom Morris tfmorris@gmail.com wrote:
After glancing at https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes/frFilm, it doesn't appear to me that either Wikidata type hierarchy or Wikipedia category hierarchy is being considered when evaluating type mismatches. Is that intentional?
Not yet, it can be done with some programming hassle. If people like the
reports and they are willing to work on them I promise to take that into account.
For example
Grave of the Fireflies (Q274520) https://www.wikidata.org/wiki/Q274520NoYes (0.731427666987)
is an instance of animated film which is a subtype of film.
Conversely, this telefilm d'horreur
Le Collectionneur de cerveaux (Q579355) https://www.wikidata.org/wiki/Q579355YesNo (0.239868037957
is part of a subcategory of film d'horreur -> film de fiction
The one other that I glanced at, https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes/frHuman, seems to have systematic issues with correct classification of Wikipedia pages about multiple people (e.g. brothers) which Wikidata correctly identifies as not people.
It can be considered as mis-classifications in articles in Wikipedia. even though I'm not a big fan of this idea. It seems these articles in Wikipedia lack of proper categories like siblings and duo-related categories and if those categories were there Kian would know.
It also, strangely, seems to think that Wikidata atomic elements are humans and I can't see why:
calcium (Q706) https://www.wikidata.org/wiki/Q706YesNo (0.0225392419603)
That's a bug in autolist, I don't know why autolist included Q706 in
humans. Maybe Magnus can tell. I need to dig deeper
Have you considered using other signals as inputs to your models? For example, Freebase types should be a pretty reliable signal for things like humans and films.
No, but I think and investigate using them :)
Tom
On Sun, Aug 30, 2015 at 11:56 AM, Amir Ladsgroup ladsgroup@gmail.com wrote:
Thanks Nemo!
I added new reports: https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes
If you check them, you can easily find tons of errors, some of them are mis-categorization in Wikipedia, some of them are mistake in connecting article from Wikipedia to wrong item, some of them are vandalism in Wikidata, some of them are mistakes by bots or Widar users. Please check them if you want to have better quality in Wikidata
Best
On Sun, Aug 30, 2015 at 12:16 PM Federico Leva (Nemo) nemowiki@gmail.com wrote:
Amir Ladsgroup, 28/08/2015 20:17:
Another thing I did is reporting possible mistakes, when Wikipedia and Wikidata don't agree on one statement,
Nice, with this Wikidata has better quality control systems than Wikipedia. ;-)
Nemo
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata