Hey, all!
This new post is to respond to various points from GerardM and P. Blissenbach, previously responding to me on wikidata-l. I'm also cross-posting it to multimedia-l, who no doubt will be able to put me straight about lots of things.
In particular, where will "topics" to be associated with image files be stored, and how will they be searched ?
* Where will topics be stored ? *
On the question of where the list of topics will be stored, the initial thoughts of the Structured Data team would seem to be clear: they are to be stored on the new CommonsData wikibase.
See eg: https://commons.wikimedia.org/w/index.php?title=File%3AStructured_Data_-_Sli... ("topic links") https://docs.google.com/document/d/1tzwGtXRyK3o2ZEfc85RJ978znRdrf9EkqdJ0zVjm... (API design and class diagram)
* How would topics be searched ? *
Gerard wrote:
I am really interested how you envision searching when all those topics are isolated and attached to each file..
The trite answer is: in the same way you would search any other database -- by setting an index.
It should be very very simple to pull the identities of all files on CommonsData related to topic Qnnnnn.
* Why not store information with the Q-items on WikiData, regarding what files are related ? *
One could do this. Essentially what we have here is a many-many join. Each file can have many topics. Each topic can have many files. So the classic relational approach would be a separate join table.
Moving the information out of main Wikidata makes Wikidata smaller and leaner to query, particularly for queries that simply aren't interested in images.
As to whether you really do have a join table, or whether you just consider it all part of CommonsData, that's really up to the developers.
* What about the natural hierarchical structure ? *
eg
Leonardo da Vinci --> Mona Lisa --> --> Files depicting the Mona Lisa
Shouldn't the fact that it was Leonardo that painted the Mona Lisa only be stored in one place, on Mona Lisa, (or perhaps on Leonardo); but *not* multiple times, separately on every single depiction file?
*A*: Probably not, for several reasons.
Trying to find things (and also, to accurately represent things) in a hierarchical structure is the bane of Commons at the moment; it also makes searching Wikidata significantly non-trivial.
So the most significant reason is retrieval.
Suppose we have an image with topic "Gloucestershire Old Spot" (a breed of pig). We also want to be able to retrieve the image rapidly if somebody keys in "Pig".
Similarly, if we have an image of the "Mona Lisa", we also want it to be in the set of images with somebody keying in "Leonardo"
For simple searches, one could image walking down the wikidata tree from "Pig" or from "Leonardo", compiling a list of derived search terms, and then building a union set of hits. Slightly more cumbersome than just pulling everything tagged "Pig" from a relational database, but not so different from what WDQ manages.
However, suppose one is combining "pig" and "country house", does one then have to go down the tree to first identify every single country house, and unify the hits for each one of those searches, before computing the intersection with "pig" ? Or does one instead simply go through the hitset for "pig" and see if it is also tagged "country house" ?
Now it's not a bad idea to identify "lead topics" and "implied topics" associated with an image. Each time a new topic was added to an image, one would want a lookup to be made on Wikidata and a list of implied topics also to be added. Similarly if a topic identified as a "lead topic" was changed (eg perhaps a country house had been mis-identified), one would also want the list of implied topics to be updated (eg what county it was in, which family it was associated with, etc).
Also the system would need to be looking out for relevant changes on Wikidata -- eg if as a result of a new claim being added ("Gloucestershire Old Spot is a type of Pig"), what was previously an independent lead topic "Pig" might become an implied topic.
Similarly, if something in the chain of implications was changed, the consequences of that change would need to be reflected (eg if a parish that the country house was in had been assigned to the wrong county; or a work that the work was derivative of had been assigned to the wrong painter).
Having to monitor such things is the price of denormalisation.
The question one has to ask is what is more troublesome: having to propagate changes like this to multiple places in a denormalised structure where multiple copies of the same information need to be present (which can be done in quite a lazy background way); or, alternatively, having to navigate the normalised structure every time a user wants to build a results set, an overhead which directly affects the speed at which the user can be returned those results ?
* How will searching by users likely be done in practice ? *
A classic approach in combinatorial searching is to give the user an initial set of hits, and then encourage them to refine that set.
This implies, on the basis of the current query and hit-set, trying to identify the best refinement options to offer them.
There may be classic properties like location and time-period. Or there may be tags that can be identified as particularly rich in the return set. Or properties which those tags are the values of that are particularly rich in the return set.
But a really classic approach in image searching is simpler than that.
It simply shows a random selection of images from the current hit set, lets the user reveal the tags that are associated with any one of them, and then lets the user add one of those tags to the user's query.
This is how, in the first instance, I would expect an image search on topics to be first implemented -- because it's such a well-known technique, often works so well, and is so (comparatively) straightforward to implement.
So that's why (IMO) the ability to refine searches by adding another topic needs to be so fast and responsive. In terms of design, this is the optimisation that will affect user experience.
* What about images stored on local language wikis? *
Gerard wrote:
I also am really interested to know when you have all those files isolated on Commons, how you will include media files that are NOT on Commons.. This is a normal use case.
The project is called Structured Data for Commons, and the wikibase being built for it is quite often being called CommonsData.
But it seems to me there is no particular reason why it should not be straightforward to roll out essentially the same structure to local language wikis as well.
I would have thought it would be fairly easy to then implement a federated search, that finds all files matching these criteria on *either* Commons *or* en-wiki (say).
Would one actually implement that all in one wikibase (ImagesData, say, rather than CommonsData) ? That's a call I'd leave to the experts.
On the one hand, it probably would make it easy to search for all files matching the criteria on *any* wiki.
What I suspect is more likely, and probably makes more sense, is to converge the images themselves to all live in one place. So if the same fair-use image was used on multiple fair-use wikis, it would only be stored once (though each fair-use wiki would retain it's own File page for it). Such a structure should also make transfers to Commons much easier -- compared to the copy-and-paste by bot at the moment, which loses all the file-page history and most of the upload history.
But there are blockers in the way of that at the moment -- in particular, blockers that need to be addressed for image patrollers and fair-use enforcement specialists still to be able to do their job in such a set-up. To start with there are lots of tools they use, that at the moment only run on one wiki but would need to effectively run on two (or perhaps, the fact that it was two wikis would need to be hidden). They would need equivalent admin and deletion rights on both xx-wiki and the xx partition of Images wiki. Ideally they would be able to see changes to the two on the same watchlist. etc etc.
So it may be some time before running the same image search across all wikis can be supported by the system itself. But it will surely be supported through middleware sooner than that.
So that's some thoughts (or maybe some mis-thoughts) about file-topic searching and storage.
Now, tell me what I've got wrong. :-)
All best,
James.
Just answering a few bits now.
2014-09-13 18:14 GMT+02:00 James Heald j.heald@ucl.ac.uk:
- Where will topics be stored ? *
On the question of where the list of topics will be stored, the initial thoughts of the Structured Data team would seem to be clear: they are to be stored on the new CommonsData wikibase.
See eg: https://commons.wikimedia.org/w/index.php?title=File% 3AStructured_Data_-_Slides.pdf&page=17 ("topic links") https://docs.google.com/document/d/1tzwGtXRyK3o2ZEfc85RJ978znRdrf 9EkqdJ0zVjmQqs (API design and class diagram)
I read that as topics will be stored on Wikidata. That is, on Commons, you say that file DouglasAdams.jpg is about topic Q42, which is referring to an object on Wikidata. Everything about Q42 is stored on Wikidata.
* What about the natural hierarchical structure ? *
eg
Leonardo da Vinci --> Mona Lisa --> --> Files depicting the Mona Lisa
Shouldn't the fact that it was Leonardo that painted the Mona Lisa only be stored in one place, on Mona Lisa, (or perhaps on Leonardo); but *not* multiple times, separately on every single depiction file?
*A*: Probably not, for several reasons.
If the topics are on Wikidata, you will have this "for free", meaning that the hierarchy is already there, ready to be exploited.
/Jan Ainali
On 13/09/2014 18:15, Jan Ainali wrote:
2014-09-13 18:14 GMT+02:00 James Heald j.heald@ucl.ac.uk:
- Where will topics be stored ? *
On the question of where the list of topics will be stored, the initial thoughts of the Structured Data team would seem to be clear: they are to be stored on the new CommonsData wikibase.
See eg: https://commons.wikimedia.org/w/index.php?title=File% 3AStructured_Data_-_Slides.pdf&page=17 ("topic links")
I read that as topics will be stored on Wikidata. That is, on Commons, you say that file DouglasAdams.jpg is about topic Q42, which is referring to an object on Wikidata. Everything about Q42 is stored on Wikidata.
Yes, I imagine you would store say
Q42182 (pointing to Buckingham palace), probably with P180 ("depicts" - as opposed to "signature of", or "chemical structure for"
But I suspect you would also store eg
Q16560 (palace) etc; even though this is implied by Buckingham Palace
- What about the natural hierarchical structure ? *
eg
Leonardo da Vinci --> Mona Lisa --> --> Files depicting the Mona Lisa
Shouldn't the fact that it was Leonardo that painted the Mona Lisa only be stored in one place, on Mona Lisa, (or perhaps on Leonardo); but *not* multiple times, separately on every single depiction file?
*A*: Probably not, for several reasons.
If the topics are on Wikidata, you will have this "for free", meaning that the hierarchy is already there, ready to be exploited.
Yes the hierarchy is there, ready to be exploited.
But exploiting it costs time.
The point I'm making in my post is that, especially when the user request is a combination search on two quite general topics, you don't want to be hanging around *waiting* while the system works out how to exploit it.
Instead you want the answer then and there -- and that means denormalisation.
-- James.
2014-09-13 20:15 GMT+02:00 James Heald j.heald@ucl.ac.uk:
On 13/09/2014 18:15, Jan Ainali wrote:
2014-09-13 18:14 GMT+02:00 James Heald j.heald@ucl.ac.uk:
- Where will topics be stored ? *
On the question of where the list of topics will be stored, the initial thoughts of the Structured Data team would seem to be clear: they are to be stored on the new CommonsData wikibase.
See eg: https://commons.wikimedia.org/w/index.php?title=File% 3AStructured_Data_-_Slides.pdf&page=17 ("topic links")
I read that as topics will be stored on Wikidata. That is, on Commons, you say that file DouglasAdams.jpg is about topic Q42, which is referring to an object on Wikidata. Everything about Q42 is stored on Wikidata.
Yes, I imagine you would store say
Q42182 (pointing to Buckingham palace), probably with P180 ("depicts" - as opposed to "signature of", or "chemical structure for"
But I suspect you would also store eg
Q16560 (palace) etc; even though this is implied by Buckingham Palace
- What about the natural hierarchical structure ? *
eg
Leonardo da Vinci --> Mona Lisa --> --> Files depicting the Mona Lisa
Shouldn't the fact that it was Leonardo that painted the Mona Lisa only be stored in one place, on Mona Lisa, (or perhaps on Leonardo); but *not* multiple times, separately on every single depiction file?
*A*: Probably not, for several reasons.
If the topics are on Wikidata, you will have this "for free", meaning that the hierarchy is already there, ready to be exploited.
Yes the hierarchy is there, ready to be exploited.
But exploiting it costs time.
The point I'm making in my post is that, especially when the user request is a combination search on two quite general topics, you don't want to be hanging around *waiting* while the system works out how to exploit it.
Instead you want the answer then and there -- and that means denormalisation.
Let the ops worry about time, I have not heard them complain about a search dystopia yet. Even the Wiki Data Query has reasonable response time compairing to the power it offers in the queries. And that is on wmflabs, not a production server. You're saying that even when we make the effort to get structured linked data we should not exploit the single most important advantage it offers. It does not make sense. It almost like just repeating the category sysem again but with another software (albeit it offers multilinguality).
/Jan
"Let the ops worry about time" is not an answer.
We're talking about the something we're hoping to turn into a world-class mass-use image bank, and its front-line public-facing search capability.
That's on an altogether different scale to WDQ running a few hundred searches a day.
Moreover, we're talking about a public-facing search capability, where you're user clicks a tag and they want an updated results set *instantly* -- their sitting around while the server makes a cup of tea, or declares the query is too complex and goes into a sulk is not an option.
If the user wants a search on "palace" and "soldier", there simply is not time for the server to first recursively build a list of every palace it knows about, then every image related to each of those palaces, then every soldier it knows about, every image related to each of those soldiers, then intersect the two (very big) lists before it can start delivering any image hits at all. That is not acceptable. A random internet user wants those hits straight away.
The only way to routinely be able to deliver that is denormalisation.
It's not a question of just buying some more blades and filling up some more racks. That doesn't get you a big enough factor of speedup.
What we have is a design challenge, which needs a design solution.
-- James.
Let the ops worry about time, I have not heard them complain about a search dystopia yet. Even the Wiki Data Query has reasonable response time compairing to the power it offers in the queries. And that is on wmflabs, not a production server. You're saying that even when we make the effort to get structured linked data we should not exploit the single most important advantage it offers. It does not make sense. It almost like just repeating the category sysem again but with another software (albeit it offers multilinguality).
/Jan
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
2014-09-13 21:51 GMT+02:00 James Heald j.heald@ucl.ac.uk:
What we have is a design challenge, which needs a design solution.
And I am just saying that you are jumping to conclusions, because there is no evidence saying that the servers that would be set up to handle this would not be able to handle the load.
Before restricting ourselves from the functionality that we really really would like to have, and avoid replicating the manual labour that we hate and want to get away from, could we please let the ops chip in?
/Jan
Just a word of caution about collecting all images in commons. A while ago, at least, some local wikis had images with license terms incompatible with commons and vice versa. I recall very simple logos of companies, and several types of "fair use" derivatives.
If that is still so, we have an obstacle that may prevent us from both moving images, and even linking to them under some local laws.
Technically, I agree with the idea quoted below.
Purodha
"James Heald" j.heald@ucl.ac.uk wrote:
What I suspect is more likely, and probably makes more sense, is to converge the images themselves to all live in one place. So if the same fair-use image was used on multiple fair-use wikis, it would only be stored once (though each fair-use wiki would retain it's own File page for it). Such a structure should also make transfers to Commons much easier -- compared to the copy-and-paste by bot at the moment, which loses all the file-page history and most of the upload history.
Yes.
Just to be clear, if we did "converge all the images to live in one place", I am not suggesting they would all be free, and I'm not suggesting they would all belong to Commons.
Just that they would all physically live in the same integrated structure; but one that would still appear to the external browser to have different 'partitions', corresponding to the different language wikis, each with a different base url.
(But inside the server all part of one integrated system, making it easy to move a file from a national partition to the Commons partition, or vice-versa -- *if* that was legally appropriate).
-- James.
On 13/09/2014 22:40, P. Blissenbach wrote:
Just a word of caution about collecting all images in commons. A while ago, at least, some local wikis had images with license terms incompatible with commons and vice versa. I recall very simple logos of companies, and several types of "fair use" derivatives.
If that is still so, we have an obstacle that may prevent us from both moving images, and even linking to them under some local laws.
Technically, I agree with the idea quoted below.
Purodha
"James Heald" j.heald@ucl.ac.uk wrote:
What I suspect is more likely, and probably makes more sense, is to converge the images themselves to all live in one place. So if the same fair-use image was used on multiple fair-use wikis, it would only be stored once (though each fair-use wiki would retain it's own File page for it). Such a structure should also make transfers to Commons much easier -- compared to the copy-and-paste by bot at the moment, which loses all the file-page history and most of the upload history.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
2014-09-13 23:53 GMT+02:00 James Heald j.heald@ucl.ac.uk:
Yes.
Just to be clear, if we did "converge all the images to live in one place", I am not suggesting they would all be free, and I'm not suggesting they would all belong to Commons.
Just that they would all physically live in the same integrated structure; but one that would still appear to the external browser to have different 'partitions', corresponding to the different language wikis, each with a different base url.
(But inside the server all part of one integrated system, making it easy to move a file from a national partition to the Commons partition, or vice-versa -- *if* that was legally appropriate).
-- James.
If I can dream, I would say that we should only have Commons as file storage. All "local" exceptions would need to be deleted. Today I suspect that some users unknowingly makes copyright infringments because that we do allow "local" uploads. For example, if a Swede in Sweden would upload an image to enwp claiming American fair use it would probably not be okay since we do not have a similar law in Sweden (and since around 40% of the traffic from Sweden goes to enwp, Swedes would be expected as an audience).
But I realize I am in a minority here and will not pursue this further (but would cheer if WMF legal would).
/Jan
multimedia@lists.wikimedia.org