Hey folks :)
We have just deployed the entity suggester. This helps you with suggesting properties. So when you now add a new statement to an item it will suggest what should most likely be added to that item. One example: You are on an item about a person but it doesn't have a date of birth yet. Since a lot of other items about persons have a date of birth it will suggest you also add one to this item. This will make it a lot easier for you to figure out what the hell is missing on an item and which property to use.
Thank you so much to the student team who worked on this as part of their bachelor thesis over the last months as well as everyone who gave feedback and helped them along the way.
I'm really happy to see this huge improvement towards making Wikidata easier to use. I hope so are you.
Cheers Lydia
On 1 July 2014 20:20, Lydia Pintscher lydia.pintscher@wikimedia.de wrote:
We have just deployed the entity suggester. This helps you with suggesting properties. So when you now add a new statement to an item it will suggest what should most likely be added to that item. One example: You are on an item about a person but it doesn't have a date of birth yet. Since a lot of other items about persons have a date of birth it will suggest you also add one to this item.
This is a great idea, but I've just tried it on Q4810979 (about an historic building) and it prompted me for a date of birth, gender, taxon rank or taxon name.
Teething troubles?
On Tue, Jul 1, 2014 at 9:44 PM, Andy Mabbett andy@pigsonthewing.org.uk wrote:
On 1 July 2014 20:20, Lydia Pintscher lydia.pintscher@wikimedia.de wrote:
We have just deployed the entity suggester. This helps you with suggesting properties. So when you now add a new statement to an item it will suggest what should most likely be added to that item. One example: You are on an item about a person but it doesn't have a date of birth yet. Since a lot of other items about persons have a date of birth it will suggest you also add one to this item.
This is a great idea, but I've just tried it on Q4810979 (about an historic building) and it prompted me for a date of birth, gender, taxon rank or taxon name.
Teething troubles?
We still need to tweak it a bit here and there, yeah. We're working on that right now. Also it will get smarter as more statements are added to items.
Cheers Lydia
We still need to tweak it a bit here and there, yeah. We're working on that right now. Also it will get smarter as more statements are added to items.
Even with some somewhat off suggestions this will be a wonderful tool.
Thank you to everyone who worked on making this happen! This really is going to make Wikidata so much easier to use.
Is there any documentation on how it chooses which entities to suggest?
Thank you, Derric Atzrott
On Tue, Jul 1, 2014 at 9:52 PM, Derric Atzrott datzrott@alizeepathology.com wrote:
We still need to tweak it a bit here and there, yeah. We're working on that right now. Also it will get smarter as more statements are added to items.
Even with some somewhat off suggestions this will be a wonderful tool.
\o/ We've just changed the threshold a bit so it should give less but more fitting suggestions. We'll play around with that setting a bit more over the next days to find the one that's right for us.
Thank you to everyone who worked on making this happen! This really is going to make Wikidata so much easier to use.
Is there any documentation on how it chooses which entities to suggest?
It basically creates a table of correlations for properties over all items in Wikidata. So if say date of birth and place of birth are used together a lot they get a high correlation. When you then have an item with no place of birth but a date of birth it will suggest that because of the high correlation.
Cheers Lydia
I noticed that sometimes it is a bit sluggish when showing the suggestions, maybe it is my network, no idea. And it would be nice if it could suggest the three basic properties (instance of, sublcass of, part of) when the item is empty and suggest further properties based on that initial value,
Other than that, good job!
On Tue, Jul 1, 2014 at 10:00 PM, Lydia Pintscher < lydia.pintscher@wikimedia.de> wrote:
On Tue, Jul 1, 2014 at 9:52 PM, Derric Atzrott datzrott@alizeepathology.com wrote:
We still need to tweak it a bit here and there, yeah. We're working on that right now. Also it will get smarter as more statements are added to items.
Even with some somewhat off suggestions this will be a wonderful tool.
\o/ We've just changed the threshold a bit so it should give less but more fitting suggestions. We'll play around with that setting a bit more over the next days to find the one that's right for us.
Thank you to everyone who worked on making this happen! This really is going to make Wikidata so much easier to use.
Is there any documentation on how it chooses which entities to suggest?
It basically creates a table of correlations for properties over all items in Wikidata. So if say date of birth and place of birth are used together a lot they get a high correlation. When you then have an item with no place of birth but a date of birth it will suggest that because of the high correlation.
Cheers Lydia
-- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata
Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
On 01/07/14 22:00, Lydia Pintscher wrote: ...
Is there any documentation on how it chooses which entities to suggest?
It basically creates a table of correlations for properties over all items in Wikidata. So if say date of birth and place of birth are used together a lot they get a high correlation. When you then have an item with no place of birth but a date of birth it will suggest that because of the high correlation.
Oh! I have a suggestion to make ...
Looking at properties that co-occur is good, but for P31 and P279, you must use the values instead (assuming that you can cope with the size: there are about 20k different values for these properties right now; seems doable). It does not tell you much if an item has "instance of" (P31), but it is very informative to know that you have "instance of: historic house museum".
If you look at Q4810979, you can see that it really has no property that suggests that we are looking at an historic building: instance of, Commons category, coordinate location, country, Freebase identifier, image. Based on properties alone, this could really be anything, including a person. Note that even the new suggestions seem to miss most of the "typical" properties that I listed in my other email ("English Heritage list number" being the most obvious one for Q4810979).
My algorithm uses values of P31 as its main information. Maybe this is why it performs better at first sight. Should be fixable with some feature engineering using the infrastructure you have now (where I trust that your recommender system backend has no problem with a slightly bigger number of features).
Cheers,
Markus
On Tue, Jul 1, 2014 at 10:38 PM, Markus Krötzsch markus@semantic-mediawiki.org wrote:
On 01/07/14 22:00, Lydia Pintscher wrote: ...
Is there any documentation on how it chooses which entities to suggest?
It basically creates a table of correlations for properties over all items in Wikidata. So if say date of birth and place of birth are used together a lot they get a high correlation. When you then have an item with no place of birth but a date of birth it will suggest that because of the high correlation.
Oh! I have a suggestion to make ...
Looking at properties that co-occur is good, but for P31 and P279, you must use the values instead (assuming that you can cope with the size: there are about 20k different values for these properties right now; seems doable). It does not tell you much if an item has "instance of" (P31), but it is very informative to know that you have "instance of: historic house museum".
If you look at Q4810979, you can see that it really has no property that suggests that we are looking at an historic building: instance of, Commons category, coordinate location, country, Freebase identifier, image. Based on properties alone, this could really be anything, including a person. Note that even the new suggestions seem to miss most of the "typical" properties that I listed in my other email ("English Heritage list number" being the most obvious one for Q4810979).
My algorithm uses values of P31 as its main information. Maybe this is why it performs better at first sight. Should be fixable with some feature engineering using the infrastructure you have now (where I trust that your recommender system backend has no problem with a slightly bigger number of features).
Jep work is already under way to also take values into account :) Suggestions for qualifiers and sources will hopefully come very soon.
Cheers Lydia
On 01/07/14 21:47, Lydia Pintscher wrote:
On Tue, Jul 1, 2014 at 9:44 PM, Andy Mabbett andy@pigsonthewing.org.uk wrote:
On 1 July 2014 20:20, Lydia Pintscher lydia.pintscher@wikimedia.de wrote:
We have just deployed the entity suggester. This helps you with suggesting properties. So when you now add a new statement to an item it will suggest what should most likely be added to that item. One example: You are on an item about a person but it doesn't have a date of birth yet. Since a lot of other items about persons have a date of birth it will suggest you also add one to this item.
This is a great idea, but I've just tried it on Q4810979 (about an historic building) and it prompted me for a date of birth, gender, taxon rank or taxon name.
Teething troubles?
We still need to tweak it a bit here and there, yeah. We're working on that right now. Also it will get smarter as more statements are added to items.
I hope tweaking will suffice. At least it seems that there is already enough data to find slightly more related related properties ;-). Here is the list of properties that I get for the two classes of Q4810979 (recall that I compute related properties for each class).
(1) "historic house museum" http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Id=Q208...
Related properties: English Heritage list number, OS grid reference, owned by, inspired by, coordinate location, visitors per year, Commons category, architect, mother house, manager/director, country, commissioned by, architectural style, MusicBrainz place ID, use, date of foundation or creation, street
(2) "Grade I listed building" http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Id=Q157...
Related properties: English Heritage list number, masts, Minor Planet Center observatory code, home port, coordinate location, OS grid reference, mother house, architect, manager/director, Emporis ID, MusicBrainz place ID, country, architectural style, visitors per year, Commons category, Structurae ID (structure), officially opened by, floors above ground, inspired by, religious order, number of platforms, street, owned by, diocese
These are computed fully automatically from the data, with no manual filtering or user input. But don't get me wrong -- great work! Brilliant to have such a thing integrated into the UI. In any case, my algorithm for computing the related properties is certainly very different from theirs; I am sure it also has its glitches.
Cheers,
Markus
On 01/07/14 22:14, Markus Krötzsch wrote: ...
(2) "Grade I listed building" http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Id=Q157...
Related properties: English Heritage list number, masts, Minor Planet Center observatory code, home port, coordinate location, OS grid reference, mother house, architect, manager/director, Emporis ID, MusicBrainz place ID, country, architectural style, visitors per year, Commons category, Structurae ID (structure), officially opened by, floors above ground, inspired by, religious order, number of platforms, street, owned by, diocese
These are computed fully automatically from the data, with no manual filtering or user input. But don't get me wrong -- great work! Brilliant to have such a thing integrated into the UI. In any case, my algorithm for computing the related properties is certainly very different from theirs; I am sure it also has its glitches.
P.S. One weakness of my algorithm you can already see: it has troubles estimating the relevance of very rare properties, such as "Minor Planet Center observatory code" above. A single wrong annotation may then lead to wrong suggestions. Also, it seems from my list under (2) that some Grade I listed buildings are ships. This seems to be an error that is amplified by the fact that property "masts" is used only 11 times in the dataset I evaluated (last week's data). I guess the new property suggester rather errs on the other side, being tricked into suggesting very frequent properties even in places that don't need them.
-- Markus
Markus, could your algorithm work together with human direction? Like, if we entered which properties are common for a class, and then a user creates an instance of that class, would the algorithm be able to sort those properties based on how often they appear on the database?
Thanks, Micru
On Tue, Jul 1, 2014 at 10:23 PM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
On 01/07/14 22:14, Markus Krötzsch wrote: ...
(2) "Grade I listed building" http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Id= Q15700818
Related properties: English Heritage list number, masts, Minor Planet Center observatory code, home port, coordinate location, OS grid reference, mother house, architect, manager/director, Emporis ID, MusicBrainz place ID, country, architectural style, visitors per year, Commons category, Structurae ID (structure), officially opened by, floors above ground, inspired by, religious order, number of platforms, street, owned by, diocese
These are computed fully automatically from the data, with no manual filtering or user input. But don't get me wrong -- great work! Brilliant to have such a thing integrated into the UI. In any case, my algorithm for computing the related properties is certainly very different from theirs; I am sure it also has its glitches.
P.S. One weakness of my algorithm you can already see: it has troubles estimating the relevance of very rare properties, such as "Minor Planet Center observatory code" above. A single wrong annotation may then lead to wrong suggestions. Also, it seems from my list under (2) that some Grade I listed buildings are ships. This seems to be an error that is amplified by the fact that property "masts" is used only 11 times in the dataset I evaluated (last week's data). I guess the new property suggester rather errs on the other side, being tricked into suggesting very frequent properties even in places that don't need them.
-- Markus
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
On 01/07/14 22:33, David Cuenca wrote:
Markus, could your algorithm work together with human direction? Like, if we entered which properties are common for a class, and then a user creates an instance of that class, would the algorithm be able to sort those properties based on how often they appear on the database?
My algorithm is all about *detecting* "which properties are common for a class". If you want this to be entered by humans instead, that's fine too, but then you don't need an algorithm. Sorting a list of properties by how often they appear in the database is easy to do. My algorithm does not do this though, because the most often used property is usually not the most intersting one (for instance, many classes are related with Freebase IDs, but you don't want this to be the first suggestion you get; I want the things that are "special" for the instances of a class as compared to the rest of the data, not the things that are most common overall).
Cheers,
Markus
Thanks, Micru
On Tue, Jul 1, 2014 at 10:23 PM, Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:
On 01/07/14 22:14, Markus Krötzsch wrote: ... (2) "Grade I listed building" http://tools.wmflabs.org/__wikidata-exports/miga/?__classes#_cat=Classes/Id=__Q15700818 <http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Id=Q15700818> Related properties: English Heritage list number, masts, Minor Planet Center observatory code, home port, coordinate location, OS grid reference, mother house, architect, manager/director, Emporis ID, MusicBrainz place ID, country, architectural style, visitors per year, Commons category, Structurae ID (structure), officially opened by, floors above ground, inspired by, religious order, number of platforms, street, owned by, diocese These are computed fully automatically from the data, with no manual filtering or user input. But don't get me wrong -- great work! Brilliant to have such a thing integrated into the UI. In any case, my algorithm for computing the related properties is certainly very different from theirs; I am sure it also has its glitches. P.S. One weakness of my algorithm you can already see: it has troubles estimating the relevance of very rare properties, such as "Minor Planet Center observatory code" above. A single wrong annotation may then lead to wrong suggestions. Also, it seems from my list under (2) that some Grade I listed buildings are ships. This seems to be an error that is amplified by the fact that property "masts" is used only 11 times in the dataset I evaluated (last week's data). I guess the new property suggester rather errs on the other side, being tricked into suggesting very frequent properties even in places that don't need them. -- Markus _________________________________________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org> https://lists.wikimedia.org/__mailman/listinfo/wikidata-l <https://lists.wikimedia.org/mailman/listinfo/wikidata-l>-- Etiamsi omnes, ego non
On Tue, Jul 1, 2014 at 10:33 PM, David Cuenca dacuetu@gmail.com wrote:
Markus, could your algorithm work together with human direction? Like, if we entered which properties are common for a class, and then a user creates an instance of that class, would the algorithm be able to sort those properties based on how often they appear on the database?
The whole idea of this suggester is that we don't need to give it a class structure. It instead is using what is in the data already which imho is much nicer because it needs to human intervention beyond entering statements which we already do anyway. Once it takes into account values as well you should pretty much get what you want.
Cheers Lydia
Am 01.07.2014 22:23, schrieb Markus Krötzsch:
P.S. One weakness of my algorithm you can already see: it has troubles estimating the relevance of very rare properties, such as "Minor Planet Center observatory code" above. A single wrong annotation may then lead to wrong suggestions. Also, it seems from my list under (2) that some Grade I listed buildings are ships. This seems to be an error that is amplified by the fact that property "masts" is used only 11 times in the dataset I evaluated (last week's data). I guess the new property suggester rather errs on the other side, being tricked into suggesting very frequent properties even in places that don't need them.
However, it is obviously better if the algorithm performs well for frequently used properties. Isn't it possible to combine those two systems so they improve each other. One could check how often the property is used and then rely on Markus' or the students' algorithm.
Best regards, Bene
On 01/07/14 22:43, Bene* wrote:
Am 01.07.2014 22:23, schrieb Markus Krötzsch:
P.S. One weakness of my algorithm you can already see: it has troubles estimating the relevance of very rare properties, such as "Minor Planet Center observatory code" above. A single wrong annotation may then lead to wrong suggestions. Also, it seems from my list under (2) that some Grade I listed buildings are ships. This seems to be an error that is amplified by the fact that property "masts" is used only 11 times in the dataset I evaluated (last week's data). I guess the new property suggester rather errs on the other side, being tricked into suggesting very frequent properties even in places that don't need them.
However, it is obviously better if the algorithm performs well for frequently used properties. Isn't it possible to combine those two systems so they improve each other. One could check how often the property is used and then rely on Markus' or the students' algorithm.
My hope is that with my other suggestion (using P31 values as features to correlate with), the property suggester will already be able to outperform my little toy algorithm anyway. One could also combine the two (my algorithm is really simple [1]), but maybe this is not needed.
Cheers
Markus
[1] For each class C and property P, I count:
* #C: the number of items in class C * #P: the number of items using property P * #PC: the number of items in class C using the property P * #items: the total number of items
Then I compute two rates:
* rateCP = #PC / #C (fraction of items in a class with the property) * rateP = #P / #items (fraction of all items with the property)
I then rank the properties for each class by the ratio of rateCP/rateP (intuitively: by what factor does the property of P increase for items in C?). Moreover, I apply two sigmoid functions [2] to the rates as additional factors, so as to ensure that properties are less "relevant" if they have very high or very low values for the rates. I don't care about things that almost everything/almost nothing has. Obviously, one can tweak this if one wants to include properties that "almost everything" has anyway.
[2] https://www.google.com/search?sclient=psy-ab&q=1+%2F+%281+%2B+exp%286+*+...
On Tue, Jul 1, 2014 at 11:07 PM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
My hope is that with my other suggestion (using P31 values as features to correlate with), the property suggester will already be able to outperform my little toy algorithm anyway. One could also combine the two (my algorithm is really simple [1]), but maybe this is not needed.
Interesting. That could also help to identify values with a high deviation, and perhaps even do a better job than some template constraints. I was trying to check more classes, but the server seems to have trouble: "Error: could not load file 'classes/Classes.csv'" http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Id=Q208...
Anyhow, many thanks for working on this.
Micru
On 02/07/14 16:29, David Cuenca wrote:
On Tue, Jul 1, 2014 at 11:07 PM, Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:
My hope is that with my other suggestion (using P31 values as features to correlate with), the property suggester will already be able to outperform my little toy algorithm anyway. One could also combine the two (my algorithm is really simple [1]), but maybe this is not needed.Interesting. That could also help to identify values with a high deviation, and perhaps even do a better job than some template constraints. I was trying to check more classes, but the server seems to have trouble: "Error: could not load file 'classes/Classes.csv'" http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Id=Q208...
Strange. Works for me. But we had some temporary service problems at WMF Labs recently, so maybe there was some aftermath of these.
In any case, I should update the software -- Yaron has further improved Miga to lower the initial load times significantly. I'll send another email when I have new code/new data there.
Anyhow, many thanks for working on this.
My pleasure. :-)
Markus
Micru
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
"Markus Krötzsch" markus@semantic-mediawiki.org writes:
I guess the new property suggester rather errs on the other side, being tricked into suggesting very frequent properties even in places that don't need them.
I fund some, most notably "Date of death" for living people which is likely inevitable even if you add a filter of the type "too joung to be dead" I presume ;-)
Yet I am more wondering about properties not suggest where they cold be, such as various IDs for people (like VIAF; ORCID; etc.)
Purodha
Hi Lydia,
Two questions:
* Is it possible to see if an added statement was suggested to the user? * * Do you have some kind of versioning for the UI, to see if features you add have a positive effect on the way Wikidata is used?
I'm not familiar with the tag system, but it looks like it can be exploited to see where the users edit Wikidata. I can see edits made with the wikidata game or other tools from Magnus have a "Widar[1.3]" Tag, maybe this should be splitted into one tag for each application.
* Can you send a link to the thesis you mentioned?
Lukas
Am Di 01.07.2014 21:20, schrieb Lydia Pintscher:
Hey folks :)
We have just deployed the entity suggester. This helps you with suggesting properties. So when you now add a new statement to an item it will suggest what should most likely be added to that item. One example: You are on an item about a person but it doesn't have a date of birth yet. Since a lot of other items about persons have a date of birth it will suggest you also add one to this item. This will make it a lot easier for you to figure out what the hell is missing on an item and which property to use.
Thank you so much to the student team who worked on this as part of their bachelor thesis over the last months as well as everyone who gave feedback and helped them along the way.
I'm really happy to see this huge improvement towards making Wikidata easier to use. I hope so are you.
Cheers Lydia
On Wed, Jul 2, 2014 at 12:40 PM, Lukas Benedix benedix@zedat.fu-berlin.de wrote:
Hi Lydia,
Two questions:
- Is it possible to see if an added statement was suggested to the user?
Not currently and I don't think it'd be trivial to do unfortunately.
- Do you have some kind of versioning for the UI, to see if features
you add have a positive effect on the way Wikidata is used?
I have the dates we rolled out certain features and I can look at how some key metrics change over time. So yes I can see if this has the effect we all intend it to have.
I'm not familiar with the tag system, but it looks like it can be exploited to see where the users edit Wikidata. I can see edits made with the wikidata game or other tools from Magnus have a "Widar[1.3]" Tag, maybe this should be splitted into one tag for each application.
I think it is mostly a matter of convenience so you don't have to reauthenticate each of those individual applications. Technically it could be done. But this is something for Magnus to decide if he wants it.
- Can you send a link to the thesis you mentioned?
I don't think it is published yet. Once the students have published it they can send an email here.
Cheers Lydia
Hi Lukas,
the PropertySuggester extension is available on Github: https://github.com/Wikidata-lib/PropertySuggester It is documented on the corresponding wiki: https://github.com/Wikidata-lib/PropertySuggester/wiki
The thesis is currently under review but we will hopefully also put it online soon (german only though).
Cheers, Anja
On 02 Jul 2014, at 12:40, Lukas Benedix benedix@zedat.fu-berlin.de wrote:
Hi Lydia,
Two questions:
- Is it possible to see if an added statement was suggested to the user?
- Do you have some kind of versioning for the UI, to see if features
you add have a positive effect on the way Wikidata is used?
I'm not familiar with the tag system, but it looks like it can be exploited to see where the users edit Wikidata. I can see edits made with the wikidata game or other tools from Magnus have a "Widar[1.3]" Tag, maybe this should be splitted into one tag for each application.
- Can you send a link to the thesis you mentioned?
Lukas
Am Di 01.07.2014 21:20, schrieb Lydia Pintscher:
Hey folks :)
We have just deployed the entity suggester. This helps you with suggesting properties. So when you now add a new statement to an item it will suggest what should most likely be added to that item. One example: You are on an item about a person but it doesn't have a date of birth yet. Since a lot of other items about persons have a date of birth it will suggest you also add one to this item. This will make it a lot easier for you to figure out what the hell is missing on an item and which property to use.
Thank you so much to the student team who worked on this as part of their bachelor thesis over the last months as well as everyone who gave feedback and helped them along the way.
I'm really happy to see this huge improvement towards making Wikidata easier to use. I hope so are you.
Cheers Lydia
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l