Re: [Wikitech-l] Indexing structures for Wikidata

8 Mar 2013


      On Thu, Mar 7, 2013 at 12:50 PM, Denny Vrandečić
denny.vrandecic@wikimedia.de wrote:
...
As you probably know, the search in Wikidata sucks big time.
Until we have created a proper Solr-based search and deployed on that
infrastructure, we would like to implement and set up a reasonable stopgap
solution.
The simplest and most obvious signal for sorting the items would be to

make a prefix search
weight all results by the number of Wikipedias it links to

This should usually provide the item you are looking for. Currently, the
search order is random. Good luck with finding items like California,
Wellington, or Berlin.
Now, what I want to ask is, what would be the appropriate index structure
for that table. The data is saved in the wb_terms table, which would need
to be extended by a "weight" field. There is already a suggestion (based on
discussions between Tim and Daniel K if I understood correctly) to change
the wb_terms table index structure (see here <
https://bugzilla.wikimedia.org/show_bug.cgi?id=45529%3E ), but since we are
changing the index structure anyway it would be great to get it right this
time.
Anyone who can jump in? (Looking especially at Asher and Tim)
Any help would be appreciated.
Cheers,
Denny
--
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
AFAIK sql isn't particularly good for indexing that type of query.
You could maybe have a bunch of indexes for the first couple letters
of a term, and then after some point hope that things are narrowed
down enough that just doing a prefix search is acceptable. For
example, you might have an indexes on (wb_term(1), wb_weight),
(wb_term(2), wb_weight), ..., (wb_term(7), wb_weight) and one on just
wb_term. That way (I believe) you would be able to do efficient
searches for a prefix ordered by weight, provided the prefix is less
than 7 characters. (7 was chosen arbitrarily out of a hat. Performance
goes down as you add more indexes from what I understand. I'm not sure
how far you would be able to take this scheme before that becomes an
issue. You could maybe enhance this by only showing search suggestion
updates for every 2 characters the user enters or something).
--bawolff
p.s. Have not tested this, and talking a bit outside my knowledge area, so ymmv

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Indexing structures for Wikidata