Re: [Wikidata-l] [Multimedia] Commons file-topic searching and storage

13 Sep 2014


      A very simplified version of the post:
We need a query result cache on the tag or item level? That is a design
question, and the answer is: Yes.
If we have it, we can and should pre-fill it with server generated data
for all kinds of queries that weren't received and executed yet.
Whether we assign idle resoources or an enire high performace server
cluser to it, is an ops question.
Purodha
"James Heald" j.heald@ucl.ac.uk wrote:
...
Hi Thomas,
I'm not really talking about the specific query *engine* that will work 
on the file topic data.  (Well, maybe a little, in general terms about 
some of the functionality we might want in such a search).
What I'm more talking about is the kind of data that will likely need to 
stored on the CommonsData wikibase to make any kind of such query engine 
*possible* with reasonable speed -- in particular not just the most 
specific Q-numbers that apply to a file, but (IMO) *any* Q-number that 
the file should be returned from if the topic corresponding to that 
Q-number was searched for.
I'm saying that such a Q-number needs to be included on the item on 
CommonsData for the file -- it's not enough that if used Wikidata to 
look up the more specific Q-number, then the less specific Q-number 
would be returned: I'm saying that lookup already needs to have been 
done (and maintained), so the less specific Q-number is already sitting 
on CommonsData when someone comes to search for it.
This doesn't need to be a manual process (though the presence of a 
Q-number on a CommonsData item perhaps needs to subject to manual 
overrule, in case the inference chain has gone wrong, and it really 
isn't relevant); but what I'm saying is that you can't wait to do the 
inference when the search request comes in -- instead the relevant 
Q-numbers for each file need to be pre-computed, and stored on the 
CommonsData item, so that when the search request comes in, they are 
already there to be searched on.  That denormalisation of information 
really needs to be in place whatever the fine coding of the engine -- 
it's data design, rather than engine coding.
-- James.
On 13/09/2014 20:56, Thomas Douillard wrote:
...
Hi James, I don't understand (I must admit I did not read the whole topic).
Are we talking about a specific query engine ? The one the development team
will implement in Wikibase, or are we talking of something else ?
If we do not know that, I seems difficult to have this conversation at that
point.
2014-09-13 21:51 GMT+02:00 James Heald j.heald@ucl.ac.uk:
...
"Let the ops worry about time" is not an answer.
We're talking about the something we're hoping to turn into a world-class
mass-use image bank, and its front-line public-facing search capability.
That's on an altogether different scale to WDQ running a few hundred
searches a day.
Moreover, we're talking about a public-facing search capability, where
you're user clicks a tag and they want an updated results set *instantly*
-- their sitting around while the server makes a cup of tea, or declares
the query is too complex and goes into a sulk is not an option.
If the user wants a search on "palace" and "soldier", there simply is not
time for the server to first recursively build a list of every palace it
knows about, then every image related to each of those palaces, then every
soldier it knows about, every image related to each of those soldiers, then
intersect the two (very big) lists before it can start delivering any image
hits at all.  That is not acceptable.  A random internet user wants those
hits straight away.
The only way to routinely be able to deliver that is denormalisation.
It's not a question of just buying some more blades and filling up some
more racks.  That doesn't get you a big enough factor of speedup.
What we have is a design challenge, which needs a design solution.
-- James.
...
Let the ops worry about time, I have not heard them complain about a
search
dystopia yet. Even the Wiki Data Query has reasonable response time
compairing to the power it offers in the queries. And that is on wmflabs,
not a production server.
You're saying that even when we make the effort to get structured linked
data we should not exploit the single most important advantage it offers.
It does not make sense.
It almost like just repeating the category sysem again but with another
software (albeit it offers multilinguality).
/Jan

Multimedia mailing list
Multimedia@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/multimedia

Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata-l] [Multimedia] Commons file-topic searching and storage