New subject: [Wikidata-l] Commons file-topic searching and storage

13 Sep 2014


      Hey, all!
This new post is to respond to various points from GerardM and P. 
Blissenbach, previously responding to me on wikidata-l.  I'm also 
cross-posting it to multimedia-l, who no doubt will be able to put me 
straight about lots of things.
In particular, where will "topics" to be associated with image files be 
stored, and how will they be searched ?
* Where will topics be stored ? *
On the question of where the list of topics will be stored, the initial 
thoughts of the Structured Data team would seem to be clear: they are to 
be stored on the new CommonsData wikibase.
See eg:
https://commons.wikimedia.org/w/index.php?title=File%3AStructured_Data_-_Sli... 
  ("topic links")
https://docs.google.com/document/d/1tzwGtXRyK3o2ZEfc85RJ978znRdrf9EkqdJ0zVjm... 
  (API design and class diagram)
* How would topics be searched ? *
Gerard wrote:
...
I am really interested how you envision searching when all those topics are
isolated and attached to each file..
The trite answer is: in the same way you would search any other database 
-- by setting an index.
It should be very very simple to pull the identities of all files on 
CommonsData related to topic Qnnnnn.
* Why not store information with the Q-items on WikiData, regarding what 
files are related ? *
One could do this.  Essentially what we have here is a many-many join. 
Each file can have many topics.  Each topic can have many files.  So the 
classic relational approach would be a separate join table.
Moving the information out of main Wikidata makes Wikidata smaller and 
leaner to query, particularly for queries that simply aren't interested 
in images.
As to whether you really do have a join table, or whether you just 
consider it all part of CommonsData, that's really up to the developers.
* What about the natural hierarchical structure ? *
eg
Leonardo da Vinci
-->  Mona Lisa
-->  -->  Files depicting the Mona Lisa
Shouldn't the fact that it was Leonardo that painted the Mona Lisa only 
be stored in one place, on Mona Lisa, (or perhaps on Leonardo); but 
*not* multiple times, separately on every single depiction file?
*A*:  Probably not, for several reasons.
Trying to find things (and also, to accurately represent things) in a 
hierarchical structure is the bane of Commons at the moment; it also 
makes searching Wikidata significantly non-trivial.
So the most significant reason is retrieval.
Suppose we have an image with topic "Gloucestershire Old Spot" (a breed 
of pig).  We also want to be able to retrieve the image rapidly if 
somebody keys in "Pig".
Similarly, if we have an image of the "Mona Lisa", we also want it to be 
in the set of images with somebody keying in "Leonardo"
For simple searches, one could image walking down the wikidata tree from 
"Pig" or from "Leonardo", compiling a list of derived search terms, and 
then building a union set of hits.  Slightly more cumbersome than just 
pulling everything tagged "Pig" from a relational database, but not so 
different from what WDQ manages.
However, suppose one is combining "pig" and "country house", does one 
then have to go down the tree to first identify every single country 
house, and unify the hits for each one of those searches, before 
computing the intersection with "pig" ?   Or does one instead simply go 
through the hitset for "pig" and see if it is also tagged "country house" ?
Now it's not a bad idea to identify "lead topics" and "implied topics" 
associated with an image.  Each time a new topic was added to an image, 
one would want a lookup to be made on Wikidata and a list of implied 
topics also to be added.  Similarly if a topic identified as a "lead 
topic" was changed (eg perhaps a country house had been mis-identified), 
one would also want the list of implied topics to be updated (eg what 
county it was in, which family it was associated with, etc).
Also the system would need to be looking out for relevant changes on 
Wikidata -- eg if as a result of a new claim being added 
("Gloucestershire Old Spot is a type of Pig"), what was previously an 
independent lead topic "Pig" might become an implied topic.
Similarly, if something in the chain of implications was changed, the 
consequences of that change would need to be reflected (eg if a parish 
that the country house was in had been assigned to the wrong county; or 
a work that the work was derivative of had been assigned to the wrong 
painter).
Having to monitor such things is the price of denormalisation.
The question one has to ask is what is more troublesome: having to 
propagate changes like this to multiple places in a denormalised 
structure where multiple copies of the same information need to be 
present (which can be done in quite a lazy background way); or, 
alternatively, having to navigate the normalised structure every time a 
user wants to build a results set, an overhead which directly affects 
the speed at which the user can be returned those results ?
* How will searching by users likely be done in practice ? *
A classic approach in combinatorial searching is to give the user an 
initial set of hits, and then encourage them to refine that set.
This implies, on the basis of the current query and hit-set, trying to 
identify the best refinement options to offer them.
There may be classic properties like location and time-period.  Or there 
may be tags that can be identified as particularly rich in the return 
set.  Or properties which those tags are the values of that are 
particularly rich in the return set.
But a really classic approach in image searching is simpler than that.
It simply shows a random selection of images from the current hit set, 
lets the user reveal the tags that are associated with any one of them, 
and then lets the user add one of those tags to the user's query.
This is how, in the first instance, I would expect an image search on 
topics to be first implemented -- because it's such a well-known 
technique, often works so well, and is so (comparatively) 
straightforward to implement.
So that's why (IMO) the ability to refine searches by adding another 
topic needs to be so fast and responsive.  In terms of design, this is 
the optimisation that will affect user experience.
* What about images stored on local language wikis? *
Gerard wrote:
...
I also am really interested to know when you have all those files isolated
on Commons, how you will include media files that are NOT on Commons.. This
is a normal use case.
The project is called Structured Data for Commons, and the wikibase 
being built for it is quite often being called CommonsData.
But it seems to me there is no particular reason why it should not be 
straightforward to roll out essentially the same structure to local 
language wikis as well.
I would have thought it would be fairly easy to then implement a 
federated search, that finds all files matching these criteria on 
*either* Commons *or* en-wiki (say).
Would one actually implement that all in one wikibase (ImagesData, say, 
rather than CommonsData) ?   That's a call I'd leave to the experts.
On the one hand, it probably would make it easy to search for all files 
matching the criteria on *any* wiki.
What I suspect is more likely, and probably makes more sense, is to 
converge the images themselves to all live in one place.  So if the same 
fair-use image was used on multiple fair-use wikis, it would only be 
stored once (though each fair-use wiki would retain it's own File page 
for it).  Such a structure should also make transfers to Commons much 
easier -- compared to the copy-and-paste by bot at the moment, which 
loses all the file-page history and most of the upload history.
But there are blockers in the way of that at the moment -- in 
particular, blockers that need to be addressed for image patrollers and 
fair-use enforcement specialists still to be able to do their job in 
such a set-up.  To start with there are lots of tools they use, that at 
the moment only run on one wiki but would need to effectively run on two 
(or perhaps, the fact that it was two wikis would need to be hidden). 
They would need equivalent admin and deletion rights on both xx-wiki and 
the xx partition of Images wiki.  Ideally they would be able to see 
changes to the two on the same watchlist.  etc etc.
So it may be some time before running the same image search across all 
wikis can be supported by the system itself.  But it will surely be 
supported through middleware sooner than that.
So that's some thoughts (or maybe some mis-thoughts) about file-topic 
searching and storage.
Now, tell me what I've got wrong.  :-)
All best,
James.

Commons file-topic searching and storage (was Re: Commons Categories again)