Hello,
Yesterday I published new version of Kian
<https://github.com/Ladsgroup/Kian>. I ran it to add statement to claimless
items from Japanese Wikipedia and German Wikipedia and it is working
<https://www.wikidata.org/w/index.php?title=Special:Contributions/Dexbot&off…>
I'm planning to add French and English Wikipedia, You can install it and
run it too.
Another thing I did is reporting possible mistakes, when Wikipedia and
Wikidata don't agree on one statement, These are the results.
<https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes> and I
was able to find all kinds of errors such as: human in wikidata
<https://www.wikidata.org/wiki/Q272267>, disambiguation in German Wikipedia
<https://de.wikipedia.org/wiki/%C3%86thelwald> (it's in this list
<https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes/deHuman>)
or film in wikidata <https://www.wikidata.org/wiki/Q699321>, tv series in
ja.wp <https://ja.wikipedia.org/wiki/%E3%81%94%E3%81%8F%E3%81%9B%E3%82%93> (the
item seems to be a mess actually, from this list
<https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes/jaFilm>
) or Iranian mythological character in several wikis
<https://www.wikidata.org/wiki/Q1300562> but "actor and model from U.S." in
Wikidata (came from this list
<https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes/faHuman>).
Please go through the lists
<https://www.wikidata.org/wiki/User:Ladsgroup/Kian/Possible_mistakes> and
fix as much as you can, also give me suggestion on wiki and statements to
run this code.
How new version of Kian works? I introduced concept of "model" in Kian. A
model consists four properties: Wiki (such as "enwikivoyage" or "fawiki"),
name (an arbitrary name), property (like "P31", "P27") and value of that
property (like "Q5" for P31 or "Q31" for "P27"), then Kian goes and trains
that model and once we have that model ready, you can use it to add
statements on any kind of lists of articles (more technically page gens of
pywikibot) for example add this statement on new articles by running
something like this:
python scripts/parser.py -lang:ja -newpages:100 -n jaHuman
which jaHuman is name of that model. It caches all data related to that
model in data/jaHuman/
Or find possible mistakes in that wiki:
python scripts/possible_mistakes.py -n jaHuman
etc.
Another things worth mentioning are:
*scripts of Kian and the library (the part that actually does stuff) are
separated, so you can easily write your own scripts for Kian.
*Since it uses autolists to train and find possible mistakes, results are
live.
* Kian now caches results of SQL queries in different folder of model, so
first model you build for Spanish Wikipedia may take a while to complete
but the second model for Spanish Wikipedia would take so much less time.
* I doubled number of features in a way to made accuracy of Kian really
high [1] (e.g. P31:Q5 for German Wikipedia has AUC of 99.75% and precision
and recall are 99.11%, 98.31% at threshold 63%)
*Thresholds are being chosen automatically based on F-beta scores
<https://en.wikipedia.org/wiki/F1_score> to have optimum accuracy and high
recall
* It can give results in different classes of certainty, and we can send
these results to semi-automated tools. If anyone willing to help, please do
tell.
* I try to follow dependency injection principals, so it is possible to
train any kind of model using Kian and get the results (since we don't have
really good libraries to do ANN training
<https://www.quora.com/What-is-the-best-neural-network-library-for-Python>)
A crazy idea: What do you think If I make a webservice for Kian, so you can
go to a page in labs, register a model and after a while get results, or
use OAuth to add statements?
Last thing: Suggest me models and I will work on them :)
[1]: the old Kian worked this way: It labeled all categories based on
percentage of members that already has that statements then labels articles
based on number of categories in each class the article does have. The new
Kian does this but also labels categories based on percentage of members
that has that property but not that value (e.g. "Category:Fictional
characters" would have a high percentage in model of P31:Q5) and also
labels articles based on number of categories in each class.
Best
A few days ago I made the following post to Project Chat, looking at how
people are linking from Wikidata items to Commons categories and
galleries compared to a year ago, that some people on the list may have
seen, which has now been archived:
https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2015/08#Trends_…
A couple of headlines:
* Category <-> commonscat identifications :
** There was a net increase of 61,784 Commons categories that can now be
identified with category-like items, to 323,825 Commons categories in all
** 96.4% of category <-> commonscat identifications (312,266 items) now
have sitelinks. This represents a rise in sitelinks (60,463 items)
amounting to 97.8% of the increase in identifications
** 80.0% of category <-> commonscat identifications (259,164 items) now
have P373 statements. This represents a rise in P373 statements (8,774
items) amounting to 14.2% of the increase in identifications
* Article <-> commonscat identifications :
** There was a net increase of 176,382 Commons categories that can now
be identified with article-like items, to 884,439 Commons categories in all
** 23.4% of article <-> commonscat identifications (207,494 items) now
have (deprecated) sitelinks. This represents a rise in sitelinks
(112,595 items) amounting to 63.8% of the increase in identifications.
** 91.3% of article <-> commonscat identifications (807,776 items) now
have P373 statements. This represents a rise in P373 statements
(110,727 items) amounting to 62.8% of the increase in identifications
* In addition, a recent RfC showed considerable confusion as to what
actually was the current operational Wikidata policy on sitelinks to
Commons:
https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Category_common…
In view of the trends above; and the need for predictability and
consistency for queries and templates and scripts to depend on; and
particularly in view of the apparent confusion as to what the
operational policy currently actually is, can I suggest that the time
has come for a bot to monitor all new sitelinks to Commons categories,
* adding a corresponding P373 statement if there is not one already, and
* removing the sitelink if it is from an article-like item to a commonscat.
I believe we have clear policy on only sitelinking commons categories to
category-like items, and commons galleries to article-like items; but
there is currently confusion and unpredictability being caused because
these relationships are not being enforced -- breaking scripts and queries.
It's time to fix this.
All best,
James.
Hi, I can only say that I agree 100% with Steinsplitter.
Yann
2015-08-29 15:39 GMT+02:00 Steinsplitter Wiki <steinsplitter-wiki(a)live.com>:
> Wikidata needs to ask the Commons Community before doing commons related
> changes.
>
> It is so hard to understand what the wikidata people like to do with
> commons. Tons of text, hard to read. I don't understand what they like to
> do, but if this change is affecting commons then commons community consensus
> is needed.
[Splitting the general (Wikidata reasoning; this thread) from the
specific (Wikidata family relationships for horses; original thread).]
Many issues have been brought up, and we cannot solve all with one big
hammer. I have now started a WikiProject (see below) to address one of
the key points raised by Peter:
'''
Nobody has ever defined which inferences can/should be drawn from the
content of Wikidata.
'''
We do in fact use several properties that seem to ask for inferencing.
Probably the clearest is "subclass of" (P279). It has been related to
rdfs:subClassOf in many community discussions, so it seems clear that a
similar meaning is intended. This would lead to the following rule:
'''
If an item A has a "subclass of" statement with value B,
and if item B has a "subclass of" statement with value C,
then it should follow
that item A has a "subclass of" statement with value C."
'''
I think there is wide agreement on this idea. Constraints rely on it
(constraint checking travels the P279 hierarchy), and it's a main
motivation for why Wikidata Query has its "tree" feature. There are
similarly clear intentions for the properties "instance of" (P31) and
"subproperty of" (P1647). I am not spelling them out here.
Nevertheless, Peter is right that even in these cases, the intention is
not fully clear, because of two reasons:
(1) There is no machine-readable specification of the intended
behaviour. It's part of user discussions, not of the data or templates.
Even the user discussions are distributed over several pages, so a lot
of wiki archaeology is needed to get a full picture of what we, the
community, might have intended.
(2) The informal discussions on the intended semantics are not precise
about all relevant cases. Many questions remain open, such as what to do
if qualifiers are used on a statement (rarely the case for "subclass
of", but not so uncommon for "instance of").
To address these issues, I propose to come up with a format that allows
us to clearly specify inference rules such as the one for "subclass of"
above. Each rule should have one page where it is specified (for humans
and machines), explained (to humans), and discussed. It is not possible
to encode such rules as property values on data pages (for a start, it
would not be clear which page this should be on, because rules typically
refer to several properties and items). Therefore, the best we could do
now seems to have standard wiki pages for this. They could be linked
from all relevant properties/items (talk pages) though.
Even if we do not have any reasoner to compute all the results, writing
down the intended rules would be useful documentation for other users to
clarify what we expect (see the original family relationship discussion).
I propose to start by gathering use cases, that is, examples of rules
that we might want to express. From this, we can then extract suitable
template structure. I have created a WikiProject for getting us started:
https://www.wikidata.org/wiki/Wikidata:WikiProject_Reasoning
Feel free to contribute.
Best regards,
Markus
On 27.08.2015 06:26, Peter F. Patel-Schneider wrote:>
>
> On 08/26/2015 06:01 PM, Svavar Kjarrval wrote:
>> On mið 26.ágú 2015 23:05, James Heald wrote:
>>> There are a *lot* of problems with P279 (subclass), right across
>>> Wikidata.
>>>
>>> These will only be corrected once people start doing searches in a
>>> systematic way and addressing the anomalies they find.
>>>
>>> In this case, politician (Q82955) should *not* be a subclass of human
>>> (Q5), instead it should be a subclass of something like occupation
>>> (Q13516667), or alternatively perhaps profession (Q28640).
>>>
>>>
>>> My understanding is that currently there are a vast number of
>>> incorrect subclass relationships in the project, messing up tree
>>> searches, and so far it is something that has simply not yet been
>>> systematically addressed.
>>>
>>> -- James.
>>>
>>>
>> For now, what's the best way to find (and perhaps correct) incorrect
>> declarations like these?
>>
>> If I were to just change items for commonly used items like politician
>> (Q82955) it might be construed as vandalism or someone who doesn't care
>> about or understand the Stubbs-declared-as-a-human problem might just
>> add that declaration back later.
>>
>> When it comes to the gender property (P21), the human readable
>> description indicates that it's to define genders in general, yet it's
>> declared as an instance of an item (Q18608871) which only applies to
>> humans, which of course has consequences further up in the hierarchy
>> since the maintainers of item Q18608871 faithfully assume it only
>> applies to humans.
>
> Well, the situation with respect to Wikidata property for items
about people
> (Q18608871) is very difficult. There is absolutely no
machine-interpretable
> information associated with this class that can be used to deterimine
that
> instances of it are only supposed to be used for people. So, at the bare
> minimum, such machine-interpretable information needs to be added.
>
> Then there is the issue that there is no theory of how the
> machine-interpretable information that is associated with entities in
Wikidata
> is to be processed. All the processing is currently done using
> uninterpretable procedures. For example, on
> https://www.wikidata.org/wiki/Property_talk:P22 there is information
that is
> used to control some piece of code that checks to see that the subject of
> https://www.wikidata.org/wiki/Property:P21 belongs to person (Q215627) or
> fictional character (Q95074). However, there is no theory showing
how this
> interacts with other parts of Wikidata, even such inherent parts of
Wikidata
> as https://www.wikidata.org/wiki/Property:P31
>
> In fact, there is even difficulty of determining simple truth in
Wikidata.
> Two sources can conflict, and Wikidata is not in the position of being an
> arbiter for such conflicts, certainly not in general. To make the
situation
> even more complex, Wikidata has a temporal aspect as well and has a
need to
> admit exceptions to general statements.
>
> So what can be done? Any solution is going to be tricky. That is
not to say
> that some solutions cannot be found by looking at systems and
standards that
> are already being used for storing large amounts of complex information.
> However, any solution is going to have to be carefully tailored to
meet the
> requirements of Wikidata and Wikidatans. (Is there an official term
for the
> people who are putting Wikidata and Wikidata information together?)
>
> There is also a big chicken-and-egg problem here - a good solution to
reliable
> machine-interpretation of Wikidata information requires, for example,
> consistent use of instance of, subclass, and subproperty; but what
counts as a
> consistent use of these fundamental properties depends on a formal
theory of
> what they mean.
>
>
> I, for one, would find even just the attempt to solve this problem vastly
> interesting, and I have been doing some exploration as to what might be
> needed. My company is interested in using Wikidata as a source of
background
> information, but finds that the lack of a good theory of Wikidata
information
> is problematic, so I have some cover for spending time on this problem.
>
> Anyway, if there is interest in machine interpretation of Wikidata
> information, if only to detect potential anomalies, I, and probably
others,
> would be motivated to spend more time on trying to come up with potential
> solutions, hopefully in a collaborative effort that includes not just
> theoreticians but also Wikidatans.
>
>> In the case of the hierarchy Stubbs is associated with the maintainers
>> have assumed all mayors are, without exception, humans or they somehow
>> thought that if there were exceptions to this, the machines could
>> somehow detect and apply them in each case. Both of those methods are, I
>> think we agree, are wrong and we should find out why it's happening.
>>
>> Is there a tool where one can put in a Wikidata item and it extracts
>> declarations based on "higher" properties like subclass or instance of?
>> Like if I were to input the item for Stubbs, it would travel the
>> hierarchy and tell me what would be assumed about Stubbs based on the
>> declarations further up in the tree.
>
> Yes, it is called a reasoner. The design of a reasoner would very
likely be
> one result of the sort of work described above, but without such work
it is
> very hard to figure out just what is supposed to be done in any
except the
> simple cases.
>
>> - Svavar Kjarrval
>
> Peter F. Patel-Schneider
> Nuance Communications
>
>
> _______________________________________________
> Wikidata mailing list
> Wikidata(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
--
Markus Kroetzsch, Departmental Lecturer
Department of Computer Science, University of Oxford
Room 306, Parks Road, OX1 3QD Oxford, United Kingdom
+44 (0)1865 283529 http://korrekt.org/
Hi all,
I've recently been thinking about how we handle family/genealogical
relationships in Wikidata - this is, potentially, a really valuable
source of information for researchers to have available in a
structured form, especially now we're bringing together so many
biographical databases.
We currently have the following properties to link people together:
* spouses (P26) and cohabitants (P451) - not gendered
* parents (P22/P25) and step-parents (P43/P44) - gendered
* siblings (P7/P9) - gendered
* children (P40) - not gendered (and oddly no step-children?)
* a generic "related to" (P1038) for more distant relationships
There's two big things that jump out here.
** First, gender. Parents are split by gender while children are not
(we have mother/father not son/daughter). Siblings are likewise
gendered, and spouses are not. These are all very early properties -
does anyone remember how we got this way?
This makes for some odd results. For example, if we want to using our
data to identify all the male-line *descendants* of a person, we have
to do some complicated inference from [P40 + target is male]. However,
to identify all the male-line *ancestors*, we can just run back up the
P22 chain. It feels quite strange to have this difference, and I
wonder if we should standardise one way or the other - split P40 or
merge the others.
In some ways, merging seems more elegant. We do have fairly good
gender metadata (and getting better all the time!), so we can still do
gender-specific relationship searches where needed. It also avoids
having to force a binary gender approach - we are in the odd position
of being able to give a nuanced entry in P21 but can only say if
someone is a "sister" or "brother".
** Secondly, symmetry. Siblings, spouses, and parent-child pairs are
by definition symmetric. If A has P26:B, then B should also have
P26:A. The gendered cases are a little more complicated, as if A has
P40:B, then B has P22:A or P25:A, but there is still a degree of
symmetry - one of those must be true.
However, Wikidata doesn't really help us make use of this symmetry. If
I list A as spouse of B, I need to add (separately) that B is spouse
of A. If they have four children C, D, E, and F, this gets very
complicated - we have six articles with *30* links between them, all
of which need to be made manually. It feels like automatically making
symmetric links for these properties would save a lot of work, and
produce a much more reliable dataset.
I believe we decided early on not to do symmetric links because it
would swamp commonly linked articles (imagine what Q5 would look like
by now!). On the other hand, these are properties with a very narrowly
defined scope, and we actively *want* them to be comprehensively
symmetric - every parent article should list all their children on
Wikidata, and every child article should list their parent and all
their siblings.
Perhaps it's worth reconsidering whether to allow symmetry for a
specifically defined class of properties - would an automatically
symmetric P26 really swamp the system? It would be great if the system
could match up relationships and fill in missing parent/child,
sibling, and spouse links. I can't be the only one who regularly adds
one half of the relationship and forgets to include the other!
A bot looking at all of these and filling in the gaps might be a
useful approach... but it would break down if someone tries to remove
one of the symmetric entries without also removing the other, as the
bot would probably (eventually) fill it back in. Ultimately, an
automatic symmetry would seem best.
Thoughts on either of these? If there is interest I will write up a
formal proposal on-wiki.
--
- Andrew Gray
andrew.gray(a)dunelm.org.uk
In case you missed it, there is a great post by Magnus about descriptions
[1]
The case is made often that descriptions as they exist are evil. They are
atrocious and for whatever reason it does not make a difference that a much
better solution exists. It was discussed at the London Wikimania and it
seems as if people have a religious belief that people will do better.
Automated descriptions can easily be improved upon in two ways all the time
every time by
- improving the algorithm for automated description
- improving the algorithm for automated descriptions by considering
language specific issues
- improving the result of the algorithm by adding statements where they
are lacking on items
I have blogged about this issue in the past. The arguments against the
current crop of descriptions is convincing. Why do we not get rid of all
that rubbish. The only argument I know that has some merit is that people
invested time in them. The sad thing is it was as waste because the results
are not good, they are not convincing and they will never support all 280+
languages Wikidata supports.
Thanks,
GerardM
[1] http://magnusmanske.de/wordpress/?p=342