On Thu, Mar 7, 2013 at 12:50 PM, Denny Vrandečić denny.vrandecic@wikimedia.de wrote:
As you probably know, the search in Wikidata sucks big time.
Until we have created a proper Solr-based search and deployed on that infrastructure, we would like to implement and set up a reasonable stopgap solution.
The simplest and most obvious signal for sorting the items would be to
- make a prefix search
- weight all results by the number of Wikipedias it links to
This should usually provide the item you are looking for. Currently, the search order is random. Good luck with finding items like California, Wellington, or Berlin.
Now, what I want to ask is, what would be the appropriate index structure for that table. The data is saved in the wb_terms table, which would need to be extended by a "weight" field. There is already a suggestion (based on discussions between Tim and Daniel K if I understood correctly) to change the wb_terms table index structure (see here < https://bugzilla.wikimedia.org/show_bug.cgi?id=45529%3E ), but since we are changing the index structure anyway it would be great to get it right this time.
Anyone who can jump in? (Looking especially at Asher and Tim)
Any help would be appreciated.
Cheers, Denny
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
AFAIK sql isn't particularly good for indexing that type of query.
You could maybe have a bunch of indexes for the first couple letters of a term, and then after some point hope that things are narrowed down enough that just doing a prefix search is acceptable. For example, you might have an indexes on (wb_term(1), wb_weight), (wb_term(2), wb_weight), ..., (wb_term(7), wb_weight) and one on just wb_term. That way (I believe) you would be able to do efficient searches for a prefix ordered by weight, provided the prefix is less than 7 characters. (7 was chosen arbitrarily out of a hat. Performance goes down as you add more indexes from what I understand. I'm not sure how far you would be able to take this scheme before that becomes an issue. You could maybe enhance this by only showing search suggestion updates for every 2 characters the user enters or something).
--bawolff
p.s. Have not tested this, and talking a bit outside my knowledge area, so ymmv