Hi!
Today we are indexing in ElasticSearch almost all string properties (except a few) and select item properties (P31 and P279). We've been asked to extend this set and index more item properties (https://phabricator.wikimedia.org/T199884). We did not do it from the start because we did not want to add too much data to the index at once, and wanted to see how the index behaves. To evaluate what this change would mean, some statistics:
All usage of item properties in statements is about 231 million uses (according to sqid tool database). Of those, about 50M uses are "instance of" which we are already indexing. Another 98M uses belong to two properties - published in (P1433) and cites (P2860). Leaving about 86M for the rest of the properties.
So, if we index all the item properties except P2860 and P1433, we'll be a little more than doubling the amount of data we're storing for this field, which seems OK. But if we index those too, we'll be essentially quadrupling it - which may be OK too, but is bigger jump and one that may potentially cause some issues.
So, we have two questions: 1. Do we want to enable indexing for all item properties? Note that if you just want to find items with certain statement values, Wikidata Query Service matches this use case best. It's only in combination with actual fulltext search where on-wiki search is better.
2. Do we need to index P2860 and P1433 at all, and if so, would it be ok if we omit indexing for now?
Would be glad to hear thoughts on the matter.
Thanks,
I could definitely see a usecase for 1) and maybe for 2). For example, let's say i remember that one movie that Rutger Hauer played in, just searching for 'movie rutger hauer' gives back nothing:
https://www.wikidata.org/w/index.php?search=movie+rutger+hauer
While Wikipedia gives back quite a nice list of options:
https://en.wikipedia.org/w/index.php?search=movie+rutger+hauer
If we would index item properties as well, you could get back Blade Runner (Q184843) which has Rutger Hauer as one of its 'cast member' values.
Of course, you could also do this by using the Query Service, but writing SPARQL is a really high barrier, and it's a lot easier to just type in a few keywords if you're looking for something.
I can understand that 2) might be a bit too hard on the system, and i think there's some value in that too, but having only 1) would be a big improvement over the current state of things.
Groet, -- Hay On Thu, Jul 26, 2018 at 11:48 PM Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
Today we are indexing in ElasticSearch almost all string properties (except a few) and select item properties (P31 and P279). We've been asked to extend this set and index more item properties (https://phabricator.wikimedia.org/T199884). We did not do it from the start because we did not want to add too much data to the index at once, and wanted to see how the index behaves. To evaluate what this change would mean, some statistics:
All usage of item properties in statements is about 231 million uses (according to sqid tool database). Of those, about 50M uses are "instance of" which we are already indexing. Another 98M uses belong to two properties - published in (P1433) and cites (P2860). Leaving about 86M for the rest of the properties.
So, if we index all the item properties except P2860 and P1433, we'll be a little more than doubling the amount of data we're storing for this field, which seems OK. But if we index those too, we'll be essentially quadrupling it - which may be OK too, but is bigger jump and one that may potentially cause some issues.
So, we have two questions:
- Do we want to enable indexing for all item properties? Note that if
you just want to find items with certain statement values, Wikidata Query Service matches this use case best. It's only in combination with actual fulltext search where on-wiki search is better.
- Do we need to index P2860 and P1433 at all, and if so, would it be ok
if we omit indexing for now?
Would be glad to hear thoughts on the matter.
Thanks,
Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
I could definitely see a usecase for 1) and maybe for 2). For example, let's say i remember that one movie that Rutger Hauer played in, just searching for 'movie rutger hauer' gives back nothing:
https://www.wikidata.org/w/index.php?search=movie+rutger+hauer
While Wikipedia gives back quite a nice list of options:
https://en.wikipedia.org/w/index.php?search=movie+rutger+hauer
Well, this is not going to change with the work we're discussing. The reason you don't get anything from Wikidata is because "movie" and "rutger hauer" are labels from different documents and ElasticSearch does not do joins. We only index each document in itself, and possibly some additional data, but indexing labels from other documents is now beyond what we're doing. We could certainly discuss it but that would be separate (and much bigger) discussion.
If we would index item properties as well, you could get back Blade Runner (Q184843) which has Rutger Hauer as one of its 'cast member' values.
You could, but not by asking something like "movie rutger hauer", at least not without a lot of additional work. Indexing "cast member" would get you a step closer, but only a tiny step and there are a number of other steps to take before that can work.
Hi Stas and Hay,
On 28-07-18 02:12, Stas Malyshev wrote:
Hi!
I could definitely see a usecase for 1) and maybe for 2). For example, let's say i remember that one movie that Rutger Hauer played in, just searching for 'movie rutger hauer' gives back nothing:
https://www.wikidata.org/w/index.php?search=movie+rutger+hauer
While Wikipedia gives back quite a nice list of options:
https://en.wikipedia.org/w/index.php?search=movie+rutger+hauer
Well, this is not going to change with the work we're discussing. The reason you don't get anything from Wikidata is because "movie" and "rutger hauer" are labels from different documents and ElasticSearch does not do joins. We only index each document in itself, and possibly some additional data, but indexing labels from other documents is now beyond what we're doing. We could certainly discuss it but that would be separate (and much bigger) discussion.
Changing the topic because I would like to start this separate and bigger discussion. Query and search are quite similar, but also very different (if you search you'll run into nice articles like https://everypageispageone.com/2011/07/13/search-vs-query/ ). Currently our query service is a very strong and complete service, but Wikidata search is very poor. Let's take Blade Runner. * https://www.wikidata.org/wiki/Q184843 is what a human sees * http://www.wikidata.org/entity/Q184843.json our internal JSON structure * http://www.wikidata.org/entity/Q184843.rdf source for the query engine * https://www.wikidata.org/w/index.php?title=Q184843&action=cirrusdump what's indexed in the search engine
In my ideal world, everything I see as a human gets indexed into the search engine preferably in a per language index. For example for Dutch something like a text_nl field with the, label, description, aliases, statements and references in there. So index *everything* and never see a Qnumber or Pnumber in there (extra incentive for people to add labels in their language). Probably also everything duplicated in the text field to fall back to. In this index you would have the "movie Rutger Hauer", you would have the cast members ("rolverdeling: Harrison Ford" etc.). Yes, this will give a significant increase of index size, but will make it much more easier to actually find things.
As for implementation: We already have the logic to serialize our json to the RDF format. Maybe also add a serialization format for this that is easy to ingest by search engines? I noticed Google having a hard time indexing some of our items, see for example https://www.google.com/search?q=The+Feast+of+the+Seagods+site%3Awikidata.org... . Duck Duck Go seems to be doing a better job https://duckduckgo.com/?q=The+Feast+of+the+Seagods+site%3Awikidata.org&t... . Making it easier to index not only for our own search would be a nice added benefit.
How feasible is this? Do we already have one or multiple tasks for this on Phabricator? Phabricator has gotten a bit unclear when it comes to Wikidata search, I think because of misunderstanding between people what the goal of the task is. Might be worthwhile spending some time on structuring that.
Maarten
Hi!
https://everypageispageone.com/2011/07/13/search-vs-query/ ). Currently our query service is a very strong and complete service, but Wikidata search is very poor. Let's take Blade Runner.
I don't think it's *very* poor anymore, but it certainly can be better.
In my ideal world, everything I see as a human gets indexed into the search engine preferably in a per language index. For example for Dutch
Err.... The problem is that what you see as a human and what search engine uses for lookups are very different things. While for text articles it is similar, for structured data it's quite different, and treating structured data the same way as text is not going to produce good results, partially because most search algorithms make assumptions that come from text world, partially because we'd be ignoring useful clues present in structured data.
something like a text_nl field with the, label, description, aliases, statements and references in there. So index *everything* and never see
There are such fields, but it makes no sense to put references there, because there's no such thing as "Dutch reference". References do not change with language.
a Qnumber or Pnumber in there (extra incentive for people to add labels in their language). Probably also everything duplicated in the text
That presents a problem. While you see "instance of": "human", the data is P31:Q5. We can, of course, put "instance of": "human" in the index. But what if label for Q5 changes? Now we have to re-index 10 million records. And while we're doing it, what if another label for such item changes again? We'd have to start another million-size reindex. In a week, we'd have a backlog of hopeless size, or will require processing power that we just don't have. Note also that ElasticSearch doesn't really do document updates - it just writes a new document. So frequent updates to the same document is not its optimal scenario, and we're talking about propagating each label edit to each item that is linked to that one. I'm afraid that would explode on us very quickly.
The problem is not indexing labels, the problem is keeping them up-to-date on 50 million interlinked items.
When displaying, it's easy - you don't need to worry until you show it, and most items are shown only rarely. Even then you see a label out of date now and then. But with search, you can't update label on use - when you want to use it (i.e. look up), it should already be up-to-date, otherwise it's useless.
As for implementation: We already have the logic to serialize our json to the RDF format. Maybe also add a serialization format for this that is easy to ingest by search engines?
I don't know any such special format, do you? We of course have JSON updates to ElasticSearch, but as I noted before, updates are the problem there, not format. RDF of course also does not carry denormalized data, so we also update only entries that need updating, and fetch labels on use. We can not do it for search index. I don't think format here is the problem.
. Making it easier to index not only for our own search would be a nice added benefit.
Sure, but experience have shown that the strategy of "dump everything into one huge text" works very poorly in Wikidata. That's why we implemented specialized search that knows about how the structured data works. If the search sucks less now than it did before, that's the reason.
How feasible is this? Do we already have one or multiple tasks for this on Phabricator? Phabricator has gotten a bit unclear when it comes to Wikidata search, I think because of misunderstanding between people what the goal of the task is. Might be worthwhile spending some time on structuring that.
Wikidata search tasks would be under "Wikidata" + "Discovery-Search". There are multiple tasks for it, but if you want to add any, please feel welcome to browse and add.
Hoi, May I remind you all that as it is, particularly the "descriptions" are really problematic. They are often created based on Wikipedia categories and it is quite rare that they get updated. Compare this with the "automated descriptions" that have been around for years.
When new properties are added to an item, it may change the automated description as a result and, this is reflected in any language. These changed descriptions may be stored until the next update on the item, they may be generated when needed and obviously they may be cached. They may be used in the build up of a search and this will be a much bigger incentive for people to update labels.
Contrary what some think, labels are updated based on a "need", this need is hardly there because Wikidata only appeals to geeks. It is why the Reasonator approach to labelisation makes so much sense. You see the missing labels, you add them and the next item will show the new labels. Given that people work in domains, it is a sound approach and, this will also quite quickly improve the quality of "automated descriptions" in any language.
Did I tell you that I disambiguate items by adding labels and properties in Wikidata? In Reasonator when you refresh a "search" you will see for instance a date of birth death added making John Smith *that* John Smith,
Obviously, search could be a lot better and using "automated descriptions" will make a positive difference. Thanks, GerardM
On 15 August 2018 at 07:20, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
https://everypageispageone.com/2011/07/13/search-vs-query/ ). Currently our query service is a very strong and complete service, but Wikidata search is very poor. Let's take Blade Runner.
I don't think it's *very* poor anymore, but it certainly can be better.
In my ideal world, everything I see as a human gets indexed into the search engine preferably in a per language index. For example for Dutch
Err.... The problem is that what you see as a human and what search engine uses for lookups are very different things. While for text articles it is similar, for structured data it's quite different, and treating structured data the same way as text is not going to produce good results, partially because most search algorithms make assumptions that come from text world, partially because we'd be ignoring useful clues present in structured data.
something like a text_nl field with the, label, description, aliases, statements and references in there. So index *everything* and never see
There are such fields, but it makes no sense to put references there, because there's no such thing as "Dutch reference". References do not change with language.
a Qnumber or Pnumber in there (extra incentive for people to add labels in their language). Probably also everything duplicated in the text
That presents a problem. While you see "instance of": "human", the data is P31:Q5. We can, of course, put "instance of": "human" in the index. But what if label for Q5 changes? Now we have to re-index 10 million records. And while we're doing it, what if another label for such item changes again? We'd have to start another million-size reindex. In a week, we'd have a backlog of hopeless size, or will require processing power that we just don't have. Note also that ElasticSearch doesn't really do document updates - it just writes a new document. So frequent updates to the same document is not its optimal scenario, and we're talking about propagating each label edit to each item that is linked to that one. I'm afraid that would explode on us very quickly.
The problem is not indexing labels, the problem is keeping them up-to-date on 50 million interlinked items.
When displaying, it's easy - you don't need to worry until you show it, and most items are shown only rarely. Even then you see a label out of date now and then. But with search, you can't update label on use - when you want to use it (i.e. look up), it should already be up-to-date, otherwise it's useless.
As for implementation: We already have the logic to serialize our json to the RDF format. Maybe also add a serialization format for this that is easy to ingest by search engines?
I don't know any such special format, do you? We of course have JSON updates to ElasticSearch, but as I noted before, updates are the problem there, not format. RDF of course also does not carry denormalized data, so we also update only entries that need updating, and fetch labels on use. We can not do it for search index. I don't think format here is the problem.
. Making it easier to index not only for our own search would be a nice added benefit.
Sure, but experience have shown that the strategy of "dump everything into one huge text" works very poorly in Wikidata. That's why we implemented specialized search that knows about how the structured data works. If the search sucks less now than it did before, that's the reason.
How feasible is this? Do we already have one or multiple tasks for this on Phabricator? Phabricator has gotten a bit unclear when it comes to Wikidata search, I think because of misunderstanding between people what the goal of the task is. Might be worthwhile spending some time on structuring that.
Wikidata search tasks would be under "Wikidata" + "Discovery-Search". There are multiple tasks for it, but if you want to add any, please feel welcome to browse and add.
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On Wed, Aug 15, 2018 at 7:20 AM, Stas Malyshev smalyshev@wikimedia.org wrote:
a Qnumber or Pnumber in there (extra incentive for people to add labels in their language). Probably also everything duplicated in the text
That presents a problem. While you see "instance of": "human", the data is P31:Q5. We can, of course, put "instance of": "human" in the index. But what if label for Q5 changes? Now we have to re-index 10 million records.
I haven't thought this through, but would it be possible to index just Q5, and then when someone searches on "human" to see what are all the items with the label "human", so that the search becomes "human OR Q5"?
Hi!
That presents a problem. While you see "instance of": "human", the data is P31:Q5. We can, of course, put "instance of": "human" in the index. But what if label for Q5 changes? Now we have to re-index 10 million records.
I haven't thought this through, but would it be possible to index just Q5, and then when someone searches on "human" to see what are all the items with the label "human", so that the search becomes "human OR Q5"?
That has a potential to explode pretty quickly. Consider query like "movie Bruce Willis" - where obviously you want all movies where Bruce Willis starred. Now, if we search for "movie", we get tons of potential matches. If we search for "Bruce" and "Willis" - even more. Now if we stuff all those IDs we've received in our query we'll get something very far from what you intended, and the relevance would be pretty bad. Not to mention you have to actually run four queries instead of one (4x load) and the last one is pretty fat, stuffed with all the IDs we've gathered.
But that's not the end of it - you don't just want any item that is somehow related to movies - you want items that *are* movies. And you don't want any item that is somehow related to somebody named "Bruce" or "Willis". You want the ones where the famous actor Bruce Willis played (or maybe directed). But there's no such information in the query.
On 8/15/18 7:20 AM, Stas Malyshev wrote:
As for implementation: We already have the logic to serialize our json to the RDF format. Maybe also add a serialization format for this that is easy to ingest by search engines?
I don't know any such special format, do you?
This is a bit tangential to the topic, but isn’t that basically what schema.org was developed for? (I’m not sure if that’s still its primary purpose, but as far as I know it was started by a group of search engines to develop a unified format websites could use to make their semantics more accessible to those search engines.)
Hi!
This is a bit tangential to the topic, but isn’t that basically what schema.org was developed for? (I’m not sure if that’s still its primary purpose, but as far as I know it was started by a group of search engines to develop a unified format websites could use to make their semantics more accessible to those search engines.)
There are a number of schemas, like Dublin Core, that try to address issues like that. However, none is even close to what we're talking about - covering several thousands properties that change all the time. They have very basic things covered, but AFAIK not much beyond. And I think those vocabularies still do not solve our problem with updating labels in multiple languages and keeping them in sync.
That said, this would be quite offtopic for *this* thread, but still if anybody has any ideas on how to present Wikidata content better to search engines using well-known metadata vocabularies, I think it would be a very welcome effort.
Hi, and thanks for working on this!
My subjective view: * We don't need P2860/P1433 indexed, at least not at the moment * I would really like dates (mainly, born/died), especially if they work for "greater units", that is, I search for a year and get an item back, even though the statament is month- or day-precise
Cheers, Magnus
On Thu, Jul 26, 2018 at 10:48 PM Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
Today we are indexing in ElasticSearch almost all string properties (except a few) and select item properties (P31 and P279). We've been asked to extend this set and index more item properties (https://phabricator.wikimedia.org/T199884). We did not do it from the start because we did not want to add too much data to the index at once, and wanted to see how the index behaves. To evaluate what this change would mean, some statistics:
All usage of item properties in statements is about 231 million uses (according to sqid tool database). Of those, about 50M uses are "instance of" which we are already indexing. Another 98M uses belong to two properties - published in (P1433) and cites (P2860). Leaving about 86M for the rest of the properties.
So, if we index all the item properties except P2860 and P1433, we'll be a little more than doubling the amount of data we're storing for this field, which seems OK. But if we index those too, we'll be essentially quadrupling it - which may be OK too, but is bigger jump and one that may potentially cause some issues.
So, we have two questions:
- Do we want to enable indexing for all item properties? Note that if
you just want to find items with certain statement values, Wikidata Query Service matches this use case best. It's only in combination with actual fulltext search where on-wiki search is better.
- Do we need to index P2860 and P1433 at all, and if so, would it be ok
if we omit indexing for now?
Would be glad to hear thoughts on the matter.
Thanks,
Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
+1 with Magnus on years of birth and death (but perhaps /only/ years of birth and death, or close surrogates eg years of baptism and burial, and inception or publication date for things, otherwise the search specificity would become useless with too many other 'significant event' dates)
I have found in the last few weeks I have been using the External ID value search a lot, from its search-box on the talk page of the main page for a property.
I'm finding this works very well, so I wonder whether people think that the ability to search for one of these strings directly in the general search box would actually add anything, or is the custom search eg via the talk-page search box already enough?
-- James.
On 27/07/2018 12:49, Magnus Manske wrote:
Hi, and thanks for working on this!
My subjective view:
- We don't need P2860/P1433 indexed, at least not at the moment
- I would really like dates (mainly, born/died), especially if they work
for "greater units", that is, I search for a year and get an item back, even though the statament is month- or day-precise
Cheers, Magnus
On Thu, Jul 26, 2018 at 10:48 PM Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
Today we are indexing in ElasticSearch almost all string properties (except a few) and select item properties (P31 and P279). We've been asked to extend this set and index more item properties (https://phabricator.wikimedia.org/T199884). We did not do it from the start because we did not want to add too much data to the index at once, and wanted to see how the index behaves. To evaluate what this change would mean, some statistics:
All usage of item properties in statements is about 231 million uses (according to sqid tool database). Of those, about 50M uses are "instance of" which we are already indexing. Another 98M uses belong to two properties - published in (P1433) and cites (P2860). Leaving about 86M for the rest of the properties.
So, if we index all the item properties except P2860 and P1433, we'll be a little more than doubling the amount of data we're storing for this field, which seems OK. But if we index those too, we'll be essentially quadrupling it - which may be OK too, but is bigger jump and one that may potentially cause some issues.
So, we have two questions:
- Do we want to enable indexing for all item properties? Note that if
you just want to find items with certain statement values, Wikidata Query Service matches this use case best. It's only in combination with actual fulltext search where on-wiki search is better.
- Do we need to index P2860 and P1433 at all, and if so, would it be ok
if we omit indexing for now?
Would be glad to hear thoughts on the matter.
Thanks,
Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--- This email has been checked for viruses by AVG. https://www.avg.com
Please, any one let me know about Wikidata project because I dont know about this.
On Fri, Jul 27, 2018 at 5:29 PM, James Heald jpm.heald@gmail.com wrote:
+1 with Magnus on years of birth and death (but perhaps /only/ years of birth and death, or close surrogates eg years of baptism and burial, and inception or publication date for things, otherwise the search specificity would become useless with too many other 'significant event' dates)
I have found in the last few weeks I have been using the External ID value search a lot, from its search-box on the talk page of the main page for a property.
I'm finding this works very well, so I wonder whether people think that the ability to search for one of these strings directly in the general search box would actually add anything, or is the custom search eg via the talk-page search box already enough?
-- James.
On 27/07/2018 12:49, Magnus Manske wrote:
Hi, and thanks for working on this!
My subjective view:
- We don't need P2860/P1433 indexed, at least not at the moment
- I would really like dates (mainly, born/died), especially if they work
for "greater units", that is, I search for a year and get an item back, even though the statament is month- or day-precise
Cheers, Magnus
On Thu, Jul 26, 2018 at 10:48 PM Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
Today we are indexing in ElasticSearch almost all string properties (except a few) and select item properties (P31 and P279). We've been asked to extend this set and index more item properties (https://phabricator.wikimedia.org/T199884). We did not do it from the start because we did not want to add too much data to the index at once, and wanted to see how the index behaves. To evaluate what this change would mean, some statistics:
All usage of item properties in statements is about 231 million uses (according to sqid tool database). Of those, about 50M uses are "instance of" which we are already indexing. Another 98M uses belong to two properties - published in (P1433) and cites (P2860). Leaving about 86M for the rest of the properties.
So, if we index all the item properties except P2860 and P1433, we'll be a little more than doubling the amount of data we're storing for this field, which seems OK. But if we index those too, we'll be essentially quadrupling it - which may be OK too, but is bigger jump and one that may potentially cause some issues.
So, we have two questions:
- Do we want to enable indexing for all item properties? Note that if
you just want to find items with certain statement values, Wikidata Query Service matches this use case best. It's only in combination with actual fulltext search where on-wiki search is better.
- Do we need to index P2860 and P1433 at all, and if so, would it be ok
if we omit indexing for now?
Would be glad to hear thoughts on the matter.
Thanks,
Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
This email has been checked for viruses by AVG. https://www.avg.com
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Le ven. 27 juil. 2018 à 14:00, James Heald jpm.heald@gmail.com a écrit :
+1 with Magnus on years of birth and death (but perhaps /only/ years of birth and death, or close surrogates eg years of baptism and burial, and inception or publication date for things, otherwise the search specificity would become useless with too many other 'significant event' dates)
Maybe just dates as declarations on the item? On 'significant event', dates are qualifiers.
Ash.
On 27/07/2018 18:34, Stas Malyshev wrote:
Hi!
- I would really like dates (mainly, born/died), especially if they work
for "greater units", that is, I search for a year and get an item back, even though the statament is month- or day-precise
What would be the use case for this?
The use case is to be able to look up "John Smith 1820 1897" and have some hope of finding the one you want...
--- This email has been checked for viruses by AVG. https://www.avg.com
Hi!
- I would really like dates (mainly, born/died), especially if they work
for "greater units", that is, I search for a year and get an item back, even though the statament is month- or day-precise
This is something I've been thinking about for a while, mainly because the way we index dates now does not serve some important use cases. Even in the Query Service we treat dates as fixed instants on the time scale, whereas some dates are not instants but intervals (which in captured in wikidata Precision but we are currently not paying any attention to it), in fact many of the dates we use are more of interval-y nature than instant-y.
This makes searching for "somebody that was born in 1820" possible but laborious (you need to do intervals manually) and inefficient since we can't just look up by year.
There are certainly improvement possible in this area, not yet sure how to do it though.
Hi,
I think we already index way more than P31 and P279. For instance we have 102.301.706 (approximation) distinct values in the term lexicon for statement_keywords. Sadly I can't extract the list of unique PIDs used (we'd have to enable field_data on statement_keywords.property). The top 1000 is: https://docs.google.com/spreadsheets/d/1E58W_t_o6vTNUAx_TG3ifW6-eZE4KJ2VGEaB... I think this is because we not only index statements by PID but also by data type. So I think that the increase is smaller than what you anticipate. What I'd try to avoid in general is indexing terms that have only doc since they are pretty useless. I think we should investigate what kind of data we may have here, and at least for statement_keywords I would not index data that contain random text (esp. natural language) since they are prone to be unique and impossible to search.
On Thu, Jul 26, 2018 at 11:48 PM Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
Today we are indexing in ElasticSearch almost all string properties (except a few) and select item properties (P31 and P279). We've been asked to extend this set and index more item properties (https://phabricator.wikimedia.org/T199884). We did not do it from the start because we did not want to add too much data to the index at once, and wanted to see how the index behaves. To evaluate what this change would mean, some statistics:
All usage of item properties in statements is about 231 million uses (according to sqid tool database). Of those, about 50M uses are "instance of" which we are already indexing. Another 98M uses belong to two properties - published in (P1433) and cites (P2860). Leaving about 86M for the rest of the properties.
So, if we index all the item properties except P2860 and P1433, we'll be a little more than doubling the amount of data we're storing for this field, which seems OK. But if we index those too, we'll be essentially quadrupling it - which may be OK too, but is bigger jump and one that may potentially cause some issues.
So, we have two questions:
- Do we want to enable indexing for all item properties? Note that if
you just want to find items with certain statement values, Wikidata Query Service matches this use case best. It's only in combination with actual fulltext search where on-wiki search is better.
- Do we need to index P2860 and P1433 at all, and if so, would it be ok
if we omit indexing for now?
Would be glad to hear thoughts on the matter.
Thanks,
Stas Malyshev smalyshev@wikimedia.org
discovery-private mailing list discovery-private@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery-private
On Fri, Jul 27, 2018 at 3:31 PM David Causse dcausse@wikimedia.org wrote:
What I'd try to avoid in general is indexing terms that have only doc since they are pretty useless.
I meant: that have only *one* doc
Hi!
I think we already index way more than P31 and P279.
Oh yes, all the string properties.
So I think that the increase is smaller than what you anticipate. What I'd try to avoid in general is indexing terms that have only doc since they are pretty useless.
For unique string properties, that would be a frequent occurrence. But I am not sure why it's useless - won't it be a legit use case to look up something by external ID?
I think we should investigate what kind of data we may have here, and at least for statement_keywords I would not index data that contain random text (esp. natural language) since they are prone to be unique and impossible to search.
Yes, we definitely should not do that. I tried to exclude such properties but if you notice more of them, let's add them to exclusion config.
Hi!
The top 1000 is: https://docs.google.com/spreadsheets/d/1E58W_t_o6vTNUAx_TG3ifW6-eZE4KJ2VGEaB...
This one is pretty interesting, how do I extract this data? It may be useful independently of what we're discussing here.
On Sat, Jul 28, 2018 at 2:02 AM Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
The top 1000 is:
https://docs.google.com/spreadsheets/d/1E58W_t_o6vTNUAx_TG3ifW6-eZE4KJ2VGEaB...
This one is pretty interesting, how do I extract this data? It may be useful independently of what we're discussing here.
This can be extracted from elastic using aggregations, to obtain a top1000 of the terms that do match P21= or P279 you can run this: curl -XPOST 'localhost:9200/wikidatawiki_content/_search?size=0&pretty' -d '{"aggs": {"item_usage": { "terms": { "field": "statement_keywords", "exclude": "P(31|279)=.*", "size": 1000 }}}}' > top1k.json
To obtain an approximation of the cardinality (unique terms) of a field:
curl -XPOST localhost:9200/wikidatawiki_content/_search?size=0 -d '{"aggs": {"item_usage": { "cardinality": { "field": "statement_keywords" }}}}'
Note that I used the spare cluster to run these. As for Property usage I just realized that we the outgoing_link which contains a array like: outgoing_link": ["Q1355298","Q1379672","Q15241312","Q8844594","Property:P18" ,"Property:P1889","Property:P248","Property:P2612","Property:P279"," Property:P3221","Property:P3417","Property:P373","Property:P3827"," Property:P577","Property:P646","Property:P910"], We don't have doc values enabled for this one so we can't extract aggregations but if the list of terms is known it could be easily extracted by running X count queries where X is the number of possible possible properties.
Thanks a lot for looking into this, Stas!
On Thu, Jul 26, 2018 at 11:49 PM Stas Malyshev smalyshev@wikimedia.org wrote:
So, we have two questions:
- Do we want to enable indexing for all item properties? Note that if
you just want to find items with certain statement values, Wikidata Query Service matches this use case best. It's only in combination with actual fulltext search where on-wiki search is better.
I would say yes.
- Do we need to index P2860 and P1433 at all, and if so, would it be ok
if we omit indexing for now?
Yes it should be perfectly fine to go without these for now - maybe always. They're mostly (only?) used on the large corpus of scientific papers.
Cheers Lydia
Is this indexing now complete?
I tried searching for a few DOIs today which are string properties (i.e. 10.1371/JOURNAL.PCBI.1002947) and didn't get any results.
Is this the phabricator task for this: https://phabricator.wikimedia.org/T163642 ?
Cheers, Marielle
On Sat, Jul 28, 2018 at 2:29 PM Lydia Pintscher < Lydia.Pintscher@wikimedia.de> wrote:
Thanks a lot for looking into this, Stas!
On Thu, Jul 26, 2018 at 11:49 PM Stas Malyshev smalyshev@wikimedia.org wrote:
So, we have two questions:
- Do we want to enable indexing for all item properties? Note that if
you just want to find items with certain statement values, Wikidata Query Service matches this use case best. It's only in combination with actual fulltext search where on-wiki search is better.
I would say yes.
- Do we need to index P2860 and P1433 at all, and if so, would it be ok
if we omit indexing for now?
Yes it should be perfectly fine to go without these for now - maybe always. They're mostly (only?) used on the large corpus of scientific papers.
Cheers Lydia
-- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata
Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
I tried searching for a few DOIs today which are string properties (i.e. 10.1371/JOURNAL.PCBI.1002947) and didn't get any results.
Statements are indexed, but you have to use haswbstatement with specific property to look for them.
Is this the phabricator task for this: https://phabricator.wikimedia.org/T163642 ?
This is the task to make strings searchable _without_ haswbstatement keyword.
On 15 August 2018 at 18:32, Stas Malyshev smalyshev@wikimedia.org wrote:
I tried searching for a few DOIs today which are string properties (i.e. 10.1371/JOURNAL.PCBI.1002947) and didn't get any results.
Statements are indexed, but you have to use haswbstatement with specific property to look for them.
Specifically, search for:
haswbstatement:P356=10.1371/JOURNAL.PCBI.1002947
like this:
https://www.wikidata.org/wiki/Special:Search?search=haswbstatement%3AP356%3D...