Hey, all!
This new post is to respond to various points from GerardM and P. Blissenbach, previously responding to me on wikidata-l. I'm also cross-posting it to multimedia-l, who no doubt will be able to put me straight about lots of things.
In particular, where will "topics" to be associated with image files be stored, and how will they be searched ?
* Where will topics be stored ? *
On the question of where the list of topics will be stored, the initial thoughts of the Structured Data team would seem to be clear: they are to be stored on the new CommonsData wikibase.
See eg: https://commons.wikimedia.org/w/index.php?title=File%3AStructured_Data_-_Sli... ("topic links") https://docs.google.com/document/d/1tzwGtXRyK3o2ZEfc85RJ978znRdrf9EkqdJ0zVjm... (API design and class diagram)
* How would topics be searched ? *
Gerard wrote:
I am really interested how you envision searching when all those topics are isolated and attached to each file..
The trite answer is: in the same way you would search any other database -- by setting an index.
It should be very very simple to pull the identities of all files on CommonsData related to topic Qnnnnn.
* Why not store information with the Q-items on WikiData, regarding what files are related ? *
One could do this. Essentially what we have here is a many-many join. Each file can have many topics. Each topic can have many files. So the classic relational approach would be a separate join table.
Moving the information out of main Wikidata makes Wikidata smaller and leaner to query, particularly for queries that simply aren't interested in images.
As to whether you really do have a join table, or whether you just consider it all part of CommonsData, that's really up to the developers.
* What about the natural hierarchical structure ? *
eg
Leonardo da Vinci --> Mona Lisa --> --> Files depicting the Mona Lisa
Shouldn't the fact that it was Leonardo that painted the Mona Lisa only be stored in one place, on Mona Lisa, (or perhaps on Leonardo); but *not* multiple times, separately on every single depiction file?
*A*: Probably not, for several reasons.
Trying to find things (and also, to accurately represent things) in a hierarchical structure is the bane of Commons at the moment; it also makes searching Wikidata significantly non-trivial.
So the most significant reason is retrieval.
Suppose we have an image with topic "Gloucestershire Old Spot" (a breed of pig). We also want to be able to retrieve the image rapidly if somebody keys in "Pig".
Similarly, if we have an image of the "Mona Lisa", we also want it to be in the set of images with somebody keying in "Leonardo"
For simple searches, one could image walking down the wikidata tree from "Pig" or from "Leonardo", compiling a list of derived search terms, and then building a union set of hits. Slightly more cumbersome than just pulling everything tagged "Pig" from a relational database, but not so different from what WDQ manages.
However, suppose one is combining "pig" and "country house", does one then have to go down the tree to first identify every single country house, and unify the hits for each one of those searches, before computing the intersection with "pig" ? Or does one instead simply go through the hitset for "pig" and see if it is also tagged "country house" ?
Now it's not a bad idea to identify "lead topics" and "implied topics" associated with an image. Each time a new topic was added to an image, one would want a lookup to be made on Wikidata and a list of implied topics also to be added. Similarly if a topic identified as a "lead topic" was changed (eg perhaps a country house had been mis-identified), one would also want the list of implied topics to be updated (eg what county it was in, which family it was associated with, etc).
Also the system would need to be looking out for relevant changes on Wikidata -- eg if as a result of a new claim being added ("Gloucestershire Old Spot is a type of Pig"), what was previously an independent lead topic "Pig" might become an implied topic.
Similarly, if something in the chain of implications was changed, the consequences of that change would need to be reflected (eg if a parish that the country house was in had been assigned to the wrong county; or a work that the work was derivative of had been assigned to the wrong painter).
Having to monitor such things is the price of denormalisation.
The question one has to ask is what is more troublesome: having to propagate changes like this to multiple places in a denormalised structure where multiple copies of the same information need to be present (which can be done in quite a lazy background way); or, alternatively, having to navigate the normalised structure every time a user wants to build a results set, an overhead which directly affects the speed at which the user can be returned those results ?
* How will searching by users likely be done in practice ? *
A classic approach in combinatorial searching is to give the user an initial set of hits, and then encourage them to refine that set.
This implies, on the basis of the current query and hit-set, trying to identify the best refinement options to offer them.
There may be classic properties like location and time-period. Or there may be tags that can be identified as particularly rich in the return set. Or properties which those tags are the values of that are particularly rich in the return set.
But a really classic approach in image searching is simpler than that.
It simply shows a random selection of images from the current hit set, lets the user reveal the tags that are associated with any one of them, and then lets the user add one of those tags to the user's query.
This is how, in the first instance, I would expect an image search on topics to be first implemented -- because it's such a well-known technique, often works so well, and is so (comparatively) straightforward to implement.
So that's why (IMO) the ability to refine searches by adding another topic needs to be so fast and responsive. In terms of design, this is the optimisation that will affect user experience.
* What about images stored on local language wikis? *
Gerard wrote:
I also am really interested to know when you have all those files isolated on Commons, how you will include media files that are NOT on Commons.. This is a normal use case.
The project is called Structured Data for Commons, and the wikibase being built for it is quite often being called CommonsData.
But it seems to me there is no particular reason why it should not be straightforward to roll out essentially the same structure to local language wikis as well.
I would have thought it would be fairly easy to then implement a federated search, that finds all files matching these criteria on *either* Commons *or* en-wiki (say).
Would one actually implement that all in one wikibase (ImagesData, say, rather than CommonsData) ? That's a call I'd leave to the experts.
On the one hand, it probably would make it easy to search for all files matching the criteria on *any* wiki.
What I suspect is more likely, and probably makes more sense, is to converge the images themselves to all live in one place. So if the same fair-use image was used on multiple fair-use wikis, it would only be stored once (though each fair-use wiki would retain it's own File page for it). Such a structure should also make transfers to Commons much easier -- compared to the copy-and-paste by bot at the moment, which loses all the file-page history and most of the upload history.
But there are blockers in the way of that at the moment -- in particular, blockers that need to be addressed for image patrollers and fair-use enforcement specialists still to be able to do their job in such a set-up. To start with there are lots of tools they use, that at the moment only run on one wiki but would need to effectively run on two (or perhaps, the fact that it was two wikis would need to be hidden). They would need equivalent admin and deletion rights on both xx-wiki and the xx partition of Images wiki. Ideally they would be able to see changes to the two on the same watchlist. etc etc.
So it may be some time before running the same image search across all wikis can be supported by the system itself. But it will surely be supported through middleware sooner than that.
So that's some thoughts (or maybe some mis-thoughts) about file-topic searching and storage.
Now, tell me what I've got wrong. :-)
All best,
James.