Hi everybody,
With the Structured Data for Commons project about to move into high gear, it seems to me that there's something the Wikidata community needs to have a serious discussion about, before APIs start getting designed and set in stone.
Specifically: when should an object have an item with its own Q-number created for it on Wikidata? What are the limits? (Are there any limits?)
The position so far seems to be essentially that a Wikidata item has only been created when an object either already has a fully-fledged Wikipedia article written for it, or reasonably could have.
So objects that aren't particularly notable typically have not had Wikidata items made for them.
Indeed, practically the first message Lydia sent to me when I started trying to work on Commons and Wikidata was to underline to me that Wikidata objects should generally not be created for individual Commons files.
But, if I'm reading the initial plans and API thoughts of the Multimedia team correctly, eg https://commons.wikimedia.org/w/index.php?title=File%3AStructured_Data_-_Sli... and https://docs.google.com/document/d/1tzwGtXRyK3o2ZEfc85RJ978znRdrf9EkqdJ0zVjm...
there seems to be the key assumption that, for any image that contains information relating to something beyond the immediate photograph or scan, there will be some kind of 'original work' item on main Wikidata that the file page will be able to reference, such that the 'original work' Wikidata item will be able to act as a place to locate any information specifically relating to the original work.
Now in many ways this is a very clean division to be able to make. It removes any question of having to judge "notability"; and it removes any ambiguity or diversity of where information might be located -- if the information relates to the original work, then it will be stored on Wikidata.
But it would appear to imply a potentially *huge* increase in the inclusion criteria for Wikidata, and the number of Wikidata items potentially creatable.
So it seems appropriate that the Wikidata community should discuss and sign off just what should and should not be considered appropriate, before things get much further.
For example, a year ago the British Library released 1 million illustrations from out-of-copyright books, which increasingly have been uploaded to Commons. Recently the Internet Archive has announced plans to release a further 12 million, with more images either already uploading or to follow from other major repositories including eg the NYPL, the Smithsonian, the Wellcome Foundation, etc, etc.
How many of these images, all scanned from old originals, are going to need new Q-numbers for those originals? Is this okay? Or are some of them too much?
For example, for maps, cf this data schema https://docs.google.com/spreadsheets/d/1Hn8VQ1rBgXj3avkUktjychEhluLQQJl5v6WR... , each map sheet will have a separate Northernmost, Southernmost, Easternmost, Westernmost bounding co-ordinates. Does that mean each map sheet should have its own Wikidata item?
For book illustrations, perhaps it is would be enough just to reference the edition of the book. But if individual illustrations have their own artist and engraver details, does that mean the illustration needs to have its own Wikidata item? Similarly, if the same engraving has appeared in many books, is that also a sign that it should have its own Wikidata item?
What about old photographs, or old postcards, similarly. When should these have their own Wikidata item? If they have their own known creator, and creation date, then is it most simple just to give them a Wikidata item, so that such information about an original underlying work is always looked for on Wikidata? What if multiple copies of the same postcard or photograph are known, published or re-published at different times? But the potential number of old postcards and photographs, like the potential number of old engravings, is *huge*.
What if an engraving was re-issued in different "states" (eg a re-issued engraving of a place might have been modified if a tower had been built). When should these get different items?
At https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Visual_arts#Wikidata... where I raised some of these issues a couple of weeks ago, there has even been the suggestion that particular individual impressions of an engraving might deserve their own separate items; or even everything with a separate accession number, so if a museum had three copies of an engraving, we would make three separate items, each carrying their own accession number, identifying the accession number that belonged to a particular File.
(See also other sections at https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Visual_arts for further relevant discussions on how to represent often quite complicated relations with Wikidata properties).
With enough items, we could re-create and represent essentially the entire FRBR tree.
We could do this. We may even need to do this, if MM team's outline for Commons is to be implemented in its apparent current form.
But it seems to me that we shouldn't just sleepwalk into it.
It does seem to me that this does represent (at least potentially) a *very* large expansion in the number of items, and widening of the inclusion criteria, for what Wikidata is going to encompass.
I'm not saying it isn't the right thing to do, but given the potential scale of the implications, I do think it is something we do need to have properly worked through as a community, and confirmed that it is indeed what we *want* to do.
All best,
James.
(Note that this is a slightly different discussion, though related, to the one I raised a few weeks ago as to whether Commons categories -- eg for particular sets of scans -- should necessarily have their own Q-number on Wikidata. Or whether some -- eg some intersection categories -- should just have an item on Commons data. But it's clearly related: is the simplest thing just to put items for everything on Wikidata? Or does one try to keep Wikidata lean, and no larger than it absolutely needs to be; albeit then having to cope with the complexity that some categories would have a Q-number, and some would not.)
Hi James,
thanks for starting this conversation! It is indeed important and has been overlooked by us.
there seems to be the key assumption that, for any image that contains
information relating to something beyond the immediate photograph or scan, there will be some kind of 'original work' item on main Wikidata that the file page will be able to reference, such that the 'original work' Wikidata item will be able to act as a place to locate any information specifically relating to the original work.
While we would like to keep track of some sort of "original work" entity when the file is a derived work, that entity doesn't necessarily has to be on Wikidata. When a Commons image is the derivative of another image, it makes more sense to refer to the Commonsdata item of that image. One possibility would be to generalize that and allow data items on Commons which are not attached to any file but instead refer to some external work such as a Flickr image.
Gergo,
Thanks for this -- and hoping you have a very productive set of sessions, to all of you in Berlin this week.
Yes, where one has a derivative work of another Commons work (a restoration, or a cropping, say), I can see it makes sense to point to the CommonsData entry for that other work.
But I guess you need to ask yourselves how much of a chain you're prepared to walk, if that file in turn points to data on an underlying physical work. Do you extract the "original creator" through the chain, or link directly? And what if there are multiple original creators?
I don't know whether it makes sense to have "original work" items on CommonsData rather than WikiData, more generally. (And here I'm talking about an "original work" in the sense of an original physical old photograph, or map sheet, or manuscript folio). As someone has pointed out, there are issues about having to deal with things in two places, and questions whether it would still be findable, if for example one were trying to search WikiData for all objects created by a particular creator.
I think this is something we simply have to defer to you, the technical designers of the system, for your considered view on what is the best way forward.
But I would commend some of the discussions at https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Visual_arts to you, not least for some examples file-cases that you might want to consider, for questions like:
* How to treat engravings from books? * What "original works" should (or should not) get their own WikiData items? * Could an edition item on Wikidata be enough to contain all the relevant information not held on CommonsData about an illustration? * What if the same engraving appears in different books?
* If there are a number of pictures from a particular scan-set, should the scan-set have a Wikidata item? (Because there is information we may want to store about the scan-set, eg its source; and we may want to filter by it; and sort members of it according to a qualifier, numerical position in sequence)
Also I would very much commend to you the existing Wikidata schemas for artworks, for book editions, and for book works, as well as the work done by the old maps project: https://www.wikidata.org/wiki/Wikidata:WikiProject_Visual_arts/Item_structur... https://www.wikidata.org/wiki/Wikidata:WikiProject_Books#Edition_item_proper... https://www.wikidata.org/wiki/Wikidata:WikiProject_Books#Work_item_propertie... https://docs.google.com/spreadsheets/d/1Hn8VQ1rBgXj3avkUktjychEhluLQQJl5v6WR...
which already go a long way towards creating a structure for storing information about many sorts of objects.
Finally, on a more general point, I would beg all of you in Berlin this week: don't despise wikitext.
It's easy to think that a shiny new system will displace everything. But there is lots of information, and tools, based on old-fashioned wikitext that it will need to integrate with for the foreseeable future. Wikitext is a straightforward API that a lot of content, and a lot of tools have been built on. So please do think how you can work with that, rather than simply deprecate it.
Vandal-fighting tools in particular are based on changes to Wikitext, so please do consider that it would be good to be able to represent changes in the WikiData or CommonsData in ways that tools based on existing wikitext file pages can pick up and if necessary revert.
Thanks, and all best for this week,
-- James.
On 05/10/2014 11:14, Gergo Tisza wrote:
Hi James,
thanks for starting this conversation! It is indeed important and has been overlooked by us.
there seems to be the key assumption that, for any image that contains
information relating to something beyond the immediate photograph or scan, there will be some kind of 'original work' item on main Wikidata that the file page will be able to reference, such that the 'original work' Wikidata item will be able to act as a place to locate any information specifically relating to the original work.
While we would like to keep track of some sort of "original work" entity when the file is a derived work, that entity doesn't necessarily has to be on Wikidata. When a Commons image is the derivative of another image, it makes more sense to refer to the Commonsdata item of that image. One possibility would be to generalize that and allow data items on Commons which are not attached to any file but instead refer to some external work such as a Flickr image.
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
Thanks for the pointers, James! I'll try to digest them.
Our thoughts on the issue of representing relationships between works are not fully formed yet, but the current idea is loosely that * if the original work has a Wikidata item (according to whatever notability guidelines the community prefers), link to that * otherwise if it is a Commons image, link to the local data item of that image * otherwise representing the relationships in full detail is probably not that important, so it's fine to just add the authors of the originals as contributors to the CommonsData entry with some generic role such as "author of a source work", without trying to represent the accurate relationship between them.
So, if there is a chain of "derivative of" relationships between works which have Wikidata or CommonsData items, we can walk the chain upon extraction and collect the authors. Where the theoretical chain extends outside Wikidata+CommonsData, the actual (as stored in Wikibase) chain would have author information from the outlying nodes "squashed" into the edge nodes.
On Mon, Oct 6, 2014 at 11:08 AM, James Heald j.heald@ucl.ac.uk wrote:
Gergo,
Thanks for this -- and hoping you have a very productive set of sessions, to all of you in Berlin this week.
Yes, where one has a derivative work of another Commons work (a restoration, or a cropping, say), I can see it makes sense to point to the CommonsData entry for that other work.
But I guess you need to ask yourselves how much of a chain you're prepared to walk, if that file in turn points to data on an underlying physical work. Do you extract the "original creator" through the chain, or link directly? And what if there are multiple original creators?
I don't know whether it makes sense to have "original work" items on CommonsData rather than WikiData, more generally. (And here I'm talking about an "original work" in the sense of an original physical old photograph, or map sheet, or manuscript folio). As someone has pointed out, there are issues about having to deal with things in two places, and questions whether it would still be findable, if for example one were trying to search WikiData for all objects created by a particular creator.
I think this is something we simply have to defer to you, the technical designers of the system, for your considered view on what is the best way forward.
But I would commend some of the discussions at https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Visual_arts to you, not least for some examples file-cases that you might want to consider, for questions like:
- How to treat engravings from books?
- What "original works" should (or should not) get their own WikiData
items?
- Could an edition item on Wikidata be enough to contain all the relevant
information not held on CommonsData about an illustration?
What if the same engraving appears in different books?
If there are a number of pictures from a particular scan-set, should the
scan-set have a Wikidata item? (Because there is information we may want to store about the scan-set, eg its source; and we may want to filter by it; and sort members of it according to a qualifier, numerical position in sequence)
Also I would very much commend to you the existing Wikidata schemas for artworks, for book editions, and for book works, as well as the work done by the old maps project: https://www.wikidata.org/wiki/Wikidata:WikiProject_Visual_ arts/Item_structure#Describing_individual_objects https://www.wikidata.org/wiki/Wikidata:WikiProject_Books# Edition_item_properties https://www.wikidata.org/wiki/Wikidata:WikiProject_Books# Work_item_properties https://docs.google.com/spreadsheets/d/1Hn8VQ1rBgXj3avkUktjychEhluLQQ Jl5v6WRlI0LJho/edit#gid=0
which already go a long way towards creating a structure for storing information about many sorts of objects.
Finally, on a more general point, I would beg all of you in Berlin this week: don't despise wikitext.
It's easy to think that a shiny new system will displace everything. But there is lots of information, and tools, based on old-fashioned wikitext that it will need to integrate with for the foreseeable future. Wikitext is a straightforward API that a lot of content, and a lot of tools have been built on. So please do think how you can work with that, rather than simply deprecate it.
Vandal-fighting tools in particular are based on changes to Wikitext, so please do consider that it would be good to be able to represent changes in the WikiData or CommonsData in ways that tools based on existing wikitext file pages can pick up and if necessary revert.
Gergo
One of the big advantages of commonsdata over wikitext is that commonsdata is Internationalised and ready for localisation.
For this reason alone I believe it is worth looking closely at all wikitext to see if it can be expressed as a Commonsdata statement.
Joe
On 10 Oct 2014 17:09, "Gergo Tisza" gtisza@wikimedia.org wrote:
Thanks for the pointers, James! I'll try to digest them.
Our thoughts on the issue of representing relationships between works are
not fully formed yet, but the current idea is loosely that
- if the original work has a Wikidata item (according to whatever
notability guidelines the community prefers), link to that
- otherwise if it is a Commons image, link to the local data item of that
image
- otherwise representing the relationships in full detail is probably not
that important, so it's fine to just add the authors of the originals as contributors to the CommonsData entry with some generic role such as "author of a source work", without trying to represent the accurate relationship between them.
So, if there is a chain of "derivative of" relationships between works
which have Wikidata or CommonsData items, we can walk the chain upon extraction and collect the authors. Where the theoretical chain extends outside Wikidata+CommonsData, the actual (as stored in Wikibase) chain would have author information from the outlying nodes "squashed" into the edge nodes.
On Mon, Oct 6, 2014 at 11:08 AM, James Heald j.heald@ucl.ac.uk wrote:
Gergo,
Thanks for this -- and hoping you have a very productive set of
sessions, to all of you in Berlin this week.
Yes, where one has a derivative work of another Commons work (a
restoration, or a cropping, say), I can see it makes sense to point to the CommonsData entry for that other work.
But I guess you need to ask yourselves how much of a chain you're
prepared to walk, if that file in turn points to data on an underlying physical work. Do you extract the "original creator" through the chain, or link directly? And what if there are multiple original creators?
I don't know whether it makes sense to have "original work" items on
CommonsData rather than WikiData, more generally. (And here I'm talking about an "original work" in the sense of an original physical old photograph, or map sheet, or manuscript folio). As someone has pointed out, there are issues about having to deal with things in two places, and questions whether it would still be findable, if for example one were trying to search WikiData for all objects created by a particular creator.
I think this is something we simply have to defer to you, the technical
designers of the system, for your considered view on what is the best way forward.
But I would commend some of the discussions at https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Visual_arts to you, not least for some examples file-cases that you might want to
consider, for questions like:
- How to treat engravings from books?
- What "original works" should (or should not) get their own WikiData
items?
- Could an edition item on Wikidata be enough to contain all the
relevant information not held on CommonsData about an illustration?
What if the same engraving appears in different books?
If there are a number of pictures from a particular scan-set, should
the scan-set have a Wikidata item? (Because there is information we may want to store about the scan-set, eg its source; and we may want to filter by it; and sort members of it according to a qualifier, numerical position in sequence)
Also I would very much commend to you the existing Wikidata schemas for
artworks, for book editions, and for book works, as well as the work done by the old maps project:
https://www.wikidata.org/wiki/Wikidata:WikiProject_Visual_arts/Item_structur...
https://www.wikidata.org/wiki/Wikidata:WikiProject_Books#Edition_item_proper...
https://www.wikidata.org/wiki/Wikidata:WikiProject_Books#Work_item_propertie...
https://docs.google.com/spreadsheets/d/1Hn8VQ1rBgXj3avkUktjychEhluLQQJl5v6WR...
which already go a long way towards creating a structure for storing
information about many sorts of objects.
Finally, on a more general point, I would beg all of you in Berlin this
week: don't despise wikitext.
It's easy to think that a shiny new system will displace everything. But
there is lots of information, and tools, based on old-fashioned wikitext that it will need to integrate with for the foreseeable future. Wikitext is a straightforward API that a lot of content, and a lot of tools have been built on. So please do think how you can work with that, rather than simply deprecate it.
Vandal-fighting tools in particular are based on changes to Wikitext, so
please do consider that it would be good to be able to represent changes in the WikiData or CommonsData in ways that tools based on existing wikitext file pages can pick up and if necessary revert.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
One case that particularly comes to mind is where we have multiple different scans of the same work -- eg we have multiple (incomplete) sets of the early 1800s colour engravings from Ackermann's Microcosm of London, or Pyne's Royal Palaces, or Audubon's Birds of America etc.
It seems a shame not to be able to abstract the duplicated information between different scans -- eg the creatorship, the publication history, the topic list of items depicted -- given that they are versions of the same work.
However, if the different scans have been made independently, there is no chain of derivation between them. And - probably - the individual engravings would not pass WD notability, so would not have separate items, though the book they were collected in probably would.
So it doesn't seem that there would be an item on which to store the data that would be common between the different versions of the image.
(Similarly, multiple reproductions of the same vintage photograph, etc).
Perhaps there might be a case for CommonsData items for works that belong to a sequence, where the sequence has an item on Wikidata?
Or perhaps they should just have items on Wikidata?
-- James.
On 10/10/2014 17:08, Gergo Tisza wrote:
Thanks for the pointers, James! I'll try to digest them.
Our thoughts on the issue of representing relationships between works are not fully formed yet, but the current idea is loosely that
- if the original work has a Wikidata item (according to whatever
notability guidelines the community prefers), link to that
- otherwise if it is a Commons image, link to the local data item of that
image
- otherwise representing the relationships in full detail is probably not
that important, so it's fine to just add the authors of the originals as contributors to the CommonsData entry with some generic role such as "author of a source work", without trying to represent the accurate relationship between them.
So, if there is a chain of "derivative of" relationships between works which have Wikidata or CommonsData items, we can walk the chain upon extraction and collect the authors. Where the theoretical chain extends outside Wikidata+CommonsData, the actual (as stored in Wikibase) chain would have author information from the outlying nodes "squashed" into the edge nodes.
On Mon, Oct 6, 2014 at 11:08 AM, James Heald j.heald@ucl.ac.uk wrote:
Gergo,
I think the place for all data about an image should be Wikidata. It will be trivial to update a Wikidata item with an image when that image becomes available on Commons. Until that time, the item can point to a catalog's online or offline entry where the image can be viewed. I am thinking for example of a Salvador Dali work that cannot be included on Wikipedia due to copyright constraints. In this case the catalog entry at least points the user in a useful direction
On Fri, Oct 10, 2014 at 7:33 PM, James Heald j.heald@ucl.ac.uk wrote:
One case that particularly comes to mind is where we have multiple different scans of the same work -- eg we have multiple (incomplete) sets of the early 1800s colour engravings from Ackermann's Microcosm of London, or Pyne's Royal Palaces, or Audubon's Birds of America etc.
It seems a shame not to be able to abstract the duplicated information between different scans -- eg the creatorship, the publication history, the topic list of items depicted -- given that they are versions of the same work.
However, if the different scans have been made independently, there is no chain of derivation between them. And - probably - the individual engravings would not pass WD notability, so would not have separate items, though the book they were collected in probably would.
So it doesn't seem that there would be an item on which to store the data that would be common between the different versions of the image.
(Similarly, multiple reproductions of the same vintage photograph, etc).
Perhaps there might be a case for CommonsData items for works that belong to a sequence, where the sequence has an item on Wikidata?
Or perhaps they should just have items on Wikidata?
-- James.
On 10/10/2014 17:08, Gergo Tisza wrote:
Thanks for the pointers, James! I'll try to digest them.
Our thoughts on the issue of representing relationships between works are not fully formed yet, but the current idea is loosely that
- if the original work has a Wikidata item (according to whatever
notability guidelines the community prefers), link to that
- otherwise if it is a Commons image, link to the local data item of that
image
- otherwise representing the relationships in full detail is probably not
that important, so it's fine to just add the authors of the originals as contributors to the CommonsData entry with some generic role such as "author of a source work", without trying to represent the accurate relationship between them.
So, if there is a chain of "derivative of" relationships between works which have Wikidata or CommonsData items, we can walk the chain upon extraction and collect the authors. Where the theoretical chain extends outside Wikidata+CommonsData, the actual (as stored in Wikibase) chain would have author information from the outlying nodes "squashed" into the edge nodes.
On Mon, Oct 6, 2014 at 11:08 AM, James Heald j.heald@ucl.ac.uk wrote:
Gergo,
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Am 13.10.2014 00:17, schrieb Jane Darnell:
I think the place for all data about an image should be Wikidata.
Do you really mean *any* image?
E.g., if we have a scan of an old book with 50 engravings, do you want to make a wikidata item for each engraving? Or just for the book? Engravings are often simple illustrations, not notable of and by themselves, and there is frequently very little we can say about them, except for which book they were published in.
It seems to me that it makes more sense to just model the book on Wikidata, not each illustration (or even every page, including the text-only ones, in case they are extracted to a png file or something).
-- daniel
On 13/10/2014 13:03, Daniel Kinzler wrote:
Am 13.10.2014 00:17, schrieb Jane Darnell:
I think the place for all data about an image should be Wikidata.
Do you really mean *any* image?
E.g., if we have a scan of an old book with 50 engravings, do you want to make a wikidata item for each engraving? Or just for the book? Engravings are often simple illustrations, not notable of and by themselves, and there is frequently very little we can say about them, except for which book they were published in.
It seems to me that it makes more sense to just model the book on Wikidata, not each illustration (or even every page, including the text-only ones, in case they are extracted to a png file or something).
Thinking about books of engravings, eg a set like this: https://commons.wikimedia.org/wiki/Category:Views_of_the_Seats_of_Noblemen_a...
There is a fair amount one can say about each of these engravings: what the subject is; and where that location is; who was the artist, and who was the engraver; when the engraving was first published (which may or may not be the same as the date at which it was first collected).
We probably also want to identify the *edition* of the book it was taken from, and probably also the scan-set -- each with a page number or sequence number, so the set can be easily retrieved and displayed in the right order.
In terms of items required, at the moment membership of a scan-set or an edition of the book might be handled by membership of a category. It's not clear how it is intended to represent such categories and their memberships in the new structured approach. Does one associate the scanset item directly with a category? Or is the scanset item its own thing, that one maps the category onto? And is the scanset an item on Wikidata, or an item somewhere else?
A further issue arises when we have more than one copy of the same engraving.
eg:
https://commons.wikimedia.org/wiki/File:Neale%281818%29_p6.190_-_Fleurs,_Rox... https://www.wikidata.org/wiki/File:MA%281829%29_p.340_-_Fleurs_-_John_Presto...
At the moment on Commons one can make a gallery of "other versions" on the filepage, each with a short footer to explain what that version is.
So it probably makes sense to be able to record that we have multiple representations or versions of the same basic thing, which presumably means some kind of object to represent that basic thing - here an engraving.
Turning to Gergo's model of "squashing" all of the information onto a limited number of nodes (ie an item per file, plus some floating items on Wikidata), and just making information into properties of those items, I think there is a problem.
The specific thing is that we want to associate various properties together, as all being tied to a particular stage of development of the work -- ie a distinguishable "work" entity, in the language of the draft "Multimedia data model" API at https://docs.google.com/document/d/1tzwGtXRyK3o2ZEfc85RJ978znRdrf9EkqdJ0zVjm...
In particular, in the case of rights information, we need to carefully associate the rights information with the other fields it relates to: the author, the date, the nature of the contribution, the act of licensing or release or assessment.
This is tricky because there may be multiple "stages of development" associated with a single file, each with its own author/date/contribution/license information. Yet there may nevertheless only be the one file on Commons.
Even if the image has been 'restored' by a Commons user, this will not necessarily generate a separate file -- standard practice for many restorers (myself included) is to upload the restored version over the previous version, so the reader can easily compare the two by looking at the file history (and access an earlier version to download, if they so wish).
(Another example could be where we may want to associate a particular music file of a piece of classical music with a particular modern edition of the score, even if the piece was originally from the 18th century. Even if the only file we have is the recording, we still need to be able to reflect the rights in the score.)
Another important class of data is date information. There may be multiple dates associated with an image -- and we may want to sort, or filter, or order by any of them. But really, to be meaningful, we don't really want to associate the dates with the image, but rather with a stage of development in the derivative chain that has led to the image. So again, the idea of what the API the "work" comes forward, but again there cannot be presumed to be a bi-directionally unique 1 <-> 1 identity between a "work" in this sense and any image on Commons, nor (unless decreed otherwise) an item on Wikidata.
I don't know the right way to go forward, which is why I started this thread.
On the one hand, I'd like to avoid if possible a vast multiplication of items on Wikidata, for all the reasons I brought up a couple of months ago, when I wondered whether there should be an item created on Wikidata for every present Commons Category -- something which made me uneasy.
But on the other hand, there is a huge virtue in consistency -- on there being a particular place where you know a particular piece of information will be (if it exists); rather than there being a complexity of multiple places it could be, depending on whether this has an item or not, or that has an item or not, or the other.
I think something we definitely do need is worked-through examples of how data might be stored for some quite complicated cases, for people to be able to discuss and critique, rather than only the most simple type of cases discussed so far.
So, for example, suppose we had as a particular test-case the following:
An image that has been enhanced & overwritten by a User in 2014 -- based on a scan from a set made and released by an Institution in 2012 -- of an engraving published in an 1850s book -- but created and first published in the 1830s, by an engraver after a sketching artist -- after an oil painting (since destroyed) painted by an important painter in the 1540s.
How in detail do we think that might be stored, identifying the different contributors and dates and contributions, so one could sort by
* (a) contributor and the nature of their contribution -- eg best surviving representation of every known painting associated with Holbein. * (b) date and the nature of the contribution -- eg best surviving representation of every known painting made in the 1540s -- eg engravings first published in the 1830s
I don't know what the way forward is, but I think this is the kind of information we ought to be able to represent; and of sort we ought to be able to do.
-- James.
Daniel, and Gergo,
I've been thinking some more about Daniel's replies in IRC chat last week, about using qualifiers to handle underlying works that images are derived from, if the underlying work isn't either (i) another image, or (ii) something with its own Wikidata item
https://tools.wmflabs.org/meetbot/wikimedia-office/2014/wikimedia-office.201...
I think the issue I'm stuck on is: what property would the qualifier be attached to ?
For sets of images to have in mind, we might consider
https://commons.wikimedia.org/wiki/Category:Twenty-four_Views_by_Henry_Salt_...
https://commons.wikimedia.org/wiki/Category:Pyne%27s_Royal_Residences
The first choice might be attaching the information to a "Creator" property.
But for the underlying works of these engravings, there are typically *two* creators, both of which are significant -- the artist, and the engraver.
So instead, we might consider an "Underlying work" property, analogous to the "Work" class in the Multimedia API development, "a creation to which copyright, authorship, etc is attached", as per
https://docs.google.com/document/d/1tzwGtXRyK3o2ZEfc85RJ978znRdrf9EkqdJ0zVjm...
But can we then capture the whole of the work class in such a property?
There seem to me a couple of issues: (1) What should be the value of the property? There doesn't seem to be an obvious choice (eg if one were importing from a repository or catalogue). What would be the datatype, and what should we store for this field.
(2) It seems to me that we would need to enable qualifiers on qualifiers -- for example, if we represented the creator of an underlying engraving using a qualifier, we would then seem to need another qualifier to indicate whether the role was as artist, or as engraver.
Similarly, if there is sourcing, there are sources that might apply to one (1st level) qualifier, but not another. But normally the WD sourcing model is for a whole statement, not part of it.
What we're would really be doing, if we did this in full, would be in effect to store the contents of what might otherwise be an entire item in a property.
That has some attractiveness, if at a future time one wanted to promote the 'underlying work' to have a Wikidata item in its own right -- the two structures would then match exactly.
But it would mean CommonsData having a slightly different data structure to Wikidata.
It's maybe worth thinking also about what happens if information is sometimes stored on the Commons file item, and sometimes on a Wikidata item.
For example, if we are looking for views similar to the categories above, we might be searching for
* Best version of each engraving from the book-title, ordered by sequence number; or * Best version of each engraving from the book-edition, ordered by page number; or * Best version of each engraving from a scan-set, ordered by scan number
Alternatively we might want to sort by artist; or engraver; or date of first publication (the engravings were often issued first as individual prints, or partwork sets, sometimes well before publication of a final volume).
If we're looking to support these searches and orderings, does it matter that a particular field may sometimes be on the file item, but sometimes on a Wikidata item ?
(For example, in the Henry Salt set, suppose we were to have the policy that engravings we only have one copy of only get a file item, but engravings we have multiple copies of get a Wikidata item to store their common information.
Would it matter that for one of the engravings we have two copies, so the information that we would be wanting for search and selection and ordering would be stored on a Wikidata item; whereas for the rest, with only a single copy, it would be stored on a Commons item? )
None of these questions are without solutions. But it does, I think, require a decisive view to be reached, as to what we propose to do.
Thanks for all your work on this,
All best,
James.
On 16/10/2014 18:56, James Heald wrote:
On 13/10/2014 13:03, Daniel Kinzler wrote:
Am 13.10.2014 00:17, schrieb Jane Darnell:
I think the place for all data about an image should be Wikidata.
Do you really mean *any* image?
E.g., if we have a scan of an old book with 50 engravings, do you want to make a wikidata item for each engraving? Or just for the book? Engravings are often simple illustrations, not notable of and by themselves, and there is frequently very little we can say about them, except for which book they were published in.
It seems to me that it makes more sense to just model the book on Wikidata, not each illustration (or even every page, including the text-only ones, in case they are extracted to a png file or something).
Thinking about books of engravings, eg a set like this: https://commons.wikimedia.org/wiki/Category:Views_of_the_Seats_of_Noblemen_a...
There is a fair amount one can say about each of these engravings: what the subject is; and where that location is; who was the artist, and who was the engraver; when the engraving was first published (which may or may not be the same as the date at which it was first collected).
We probably also want to identify the *edition* of the book it was taken from, and probably also the scan-set -- each with a page number or sequence number, so the set can be easily retrieved and displayed in the right order.
In terms of items required, at the moment membership of a scan-set or an edition of the book might be handled by membership of a category. It's not clear how it is intended to represent such categories and their memberships in the new structured approach. Does one associate the scanset item directly with a category? Or is the scanset item its own thing, that one maps the category onto? And is the scanset an item on Wikidata, or an item somewhere else?
A further issue arises when we have more than one copy of the same engraving.
eg:
https://commons.wikimedia.org/wiki/File:Neale%281818%29_p6.190_-_Fleurs,_Rox...
https://www.wikidata.org/wiki/File:MA%281829%29_p.340_-_Fleurs_-_John_Presto...
At the moment on Commons one can make a gallery of "other versions" on the filepage, each with a short footer to explain what that version is.
So it probably makes sense to be able to record that we have multiple representations or versions of the same basic thing, which presumably means some kind of object to represent that basic thing - here an engraving.
Turning to Gergo's model of "squashing" all of the information onto a limited number of nodes (ie an item per file, plus some floating items on Wikidata), and just making information into properties of those items, I think there is a problem.
The specific thing is that we want to associate various properties together, as all being tied to a particular stage of development of the work -- ie a distinguishable "work" entity, in the language of the draft "Multimedia data model" API at https://docs.google.com/document/d/1tzwGtXRyK3o2ZEfc85RJ978znRdrf9EkqdJ0zVjm...
In particular, in the case of rights information, we need to carefully associate the rights information with the other fields it relates to: the author, the date, the nature of the contribution, the act of licensing or release or assessment.
This is tricky because there may be multiple "stages of development" associated with a single file, each with its own author/date/contribution/license information. Yet there may nevertheless only be the one file on Commons.
Even if the image has been 'restored' by a Commons user, this will not necessarily generate a separate file -- standard practice for many restorers (myself included) is to upload the restored version over the previous version, so the reader can easily compare the two by looking at the file history (and access an earlier version to download, if they so wish).
(Another example could be where we may want to associate a particular music file of a piece of classical music with a particular modern edition of the score, even if the piece was originally from the 18th century. Even if the only file we have is the recording, we still need to be able to reflect the rights in the score.)
Another important class of data is date information. There may be multiple dates associated with an image -- and we may want to sort, or filter, or order by any of them. But really, to be meaningful, we don't really want to associate the dates with the image, but rather with a stage of development in the derivative chain that has led to the image. So again, the idea of what the API the "work" comes forward, but again there cannot be presumed to be a bi-directionally unique 1 <-> 1 identity between a "work" in this sense and any image on Commons, nor (unless decreed otherwise) an item on Wikidata.
I don't know the right way to go forward, which is why I started this thread.
On the one hand, I'd like to avoid if possible a vast multiplication of items on Wikidata, for all the reasons I brought up a couple of months ago, when I wondered whether there should be an item created on Wikidata for every present Commons Category -- something which made me uneasy.
But on the other hand, there is a huge virtue in consistency -- on there being a particular place where you know a particular piece of information will be (if it exists); rather than there being a complexity of multiple places it could be, depending on whether this has an item or not, or that has an item or not, or the other.
I think something we definitely do need is worked-through examples of how data might be stored for some quite complicated cases, for people to be able to discuss and critique, rather than only the most simple type of cases discussed so far.
So, for example, suppose we had as a particular test-case the following:
An image that has been enhanced & overwritten by a User in 2014 -- based on a scan from a set made and released by an Institution in 2012 -- of an engraving published in an 1850s book -- but created and first published in the 1830s, by an engraver after a sketching artist -- after an oil painting (since destroyed) painted by an important painter in the 1540s.
How in detail do we think that might be stored, identifying the different contributors and dates and contributions, so one could sort by
- (a) contributor and the nature of their contribution
-- eg best surviving representation of every known painting associated with Holbein.
- (b) date and the nature of the contribution
-- eg best surviving representation of every known painting made in the 1540s -- eg engravings first published in the 1830s
I don't know what the way forward is, but I think this is the kind of information we ought to be able to represent; and of sort we ought to be able to do.
-- James.
Am 24.10.2014 02:17, schrieb James Heald:
I think the issue I'm stuck on is: what property would the qualifier be attached to ?
...
The first choice might be attaching the information to a "Creator" property.
I would prefer "Contributor", but yea, something like that.
But for the underlying works of these engravings, there are typically *two* creators, both of which are significant -- the artist, and the engraver.
You can have any number of Statements about a Property, and each of these Statement has it's own set of Qualifiers (and Source References). E.g.
Contributor: Henry Foo Point in time: 1872 Role: Engraver
Contributor: Melissa Bar Point in time: 1870 Role: Illustrator
So instead, we might consider an "Underlying work" property, analogous to the "Work" class in the Multimedia API development, "a creation to which copyright, authorship, etc is attached", as per
https://docs.google.com/document/d/1tzwGtXRyK3o2ZEfc85RJ978znRdrf9EkqdJ0zVjm...
But can we then capture the whole of the work class in such a property?
No. Using "Underlying work" (or, as I would prefer to call it "Derivative of"), the Work has to be modeled as an Entity in it's own right - either a Wikidata Item or a MediaInfo entity.
There seem to me a couple of issues: (1) What should be the value of the property? There doesn't seem to be an obvious choice (eg if one were importing from a repository or catalogue). What would be the datatype, and what should we store for this field.
It would be a reference to another Entity. Only the ID would be stored.
(2) It seems to me that we would need to enable qualifiers on qualifiers -- for example, if we represented the creator of an underlying engraving using a qualifier, we would then seem to need another qualifier to indicate whether the role was as artist, or as engraver.
See above: there is no need for this, since we can have any number of "top level" Creator/Contributor entries.
In some cases, the contributor's role may be implicit by using a more specific Property, like Painter, Director, etc.
Similarly, if there is sourcing, there are sources that might apply to one (1st level) qualifier, but not another. But normally the WD sourcing model is for a whole statement, not part of it.
They would apply to one *Statement* but not the other:
Contributor: Henry Foo ... Reference: Title: Detailed Research On That Book DOI: ... Reference: Title: My Art WEbsite URL: ...
Contributor: Melissa Bar ... Reference: Title: Awesome Art Book Author: R.N. Dewy ISBN: ...
What we're would really be doing, if we did this in full, would be in effect to store the contents of what might otherwise be an entire item in a property.
If we have that much relevant information, it might be worth creating a data item. Especially if we end up repeating that info for multiple files (e.g. engravings from the same book).
This can and should be decided on a case by case basis. Just like on Wikipedia, it makes sense to create a separate Article when a section of some more general article grows too big.
That has some attractiveness, if at a future time one wanted to promote the 'underlying work' to have a Wikidata item in its own right -- the two structures would then match exactly.
But it would mean CommonsData having a slightly different data structure to Wikidata.
Slightly different isn't a problem, but the ability to "nest" entities and/or qualifiers is a fundamental structural incompatibility. That's not good.
...
If we're looking to support these searches and orderings, does it matter that a particular field may sometimes be on the file item, but sometimes on a Wikidata item ?
Searching (or rather: querying) across both datasets at once would be nice, but that'S pretty far off. First, we need decent query capabilities for the individual datasets.
I would imagine that for all files based on a specific book, the same approach would be chosen (e.g. a Wikidata Item for the Book, and MediaInfo for each file).
Note that Queries are different from Searches. Searches are ranked and potentially open-ended. Queries have a definite result set, and may be sorted. Queries will (in the future) be pre-defined and cached, and can be used on wiki pages via Lua, to create a list or table based on whatever logic you like. On that level, it would also be possible to combine information from two repositories (Wikidata and Commons), but at that point, we are talking about proper programming in Lua.
Would it matter that for one of the engravings we have two copies, so the information that we would be wanting for search and selection and ordering would be stored on a Wikidata item; whereas for the rest, with only a single copy, it would be stored on a Commons item? )
It would be tricky to manage this nicely for the general case. For your specific book, you may write some specialized Lua code that deals with this.
However, I would not recommend to create a data item just because you have two files in a single case. If the relevant data is not too extensive, it's fine to duplicate it.
None of these questions are without solutions. But it does, I think, require a decisive view to be reached, as to what we propose to do.
I think there are two main parts to your questions:
a) How to model contributions without modeling all the "base works" separately. I think multiple Contributor statements with separate lists of qualifiers and source references cover this.
b) How to best integrate the information that lies partially on Wikidata, and partially on Commons. This is indeed tricky, and perhaps there is no general, one-size-fits-all solution.
One thing that may help is the planned "high level media info API", which provides license/attribution/legal information about files in a unified form, drawing from structured data both on Commons and Wikidata.
Hi Daniel, thanks for getting back to me.
A couple of quick points, which I digest in more detail what you've written:
(1) I don't think that one can say a-priori that for all files based on a specific book, the same approach would be chosen.
If you look at either of the categories I cited, you can see we have files being contributed by multiple different uploaders, from multiple different sources:
https://commons.wikimedia.org/wiki/Category:Twenty-four_Views_by_Henry_Salt_...
https://commons.wikimedia.org/wiki/Category:Pyne%27s_Royal_Residences
Some have uploaded a single file, some have uploaded multiple files. There's no reason to assume they would all adopt the same approach without strong guidance, and even then it would be questionable.
(2) It's really important to link together all the people that made a contribution, what that contribution is, when it was made, and what rights there are on it.
I don't think this can be done using qualifiers on a contributor property, because there may be more than one contributor involved.
(3) If we make it hard to move work-information easily, and as a unit, between file-data and item-data we're going to really make things difficult.
(4) It is essential to be able to order the hit-sets of searches, eg analogues of the current category view, and there are likely to be a number of different standard orderings that should be available.
If this requires coping with the fact that some of the information will be stored in file-data and some will be on Wikidata items referenced from file-data, that needs to be designed in right from the start as a basic requirement.
Cheers,
James.
On 24/10/2014 20:51, Daniel Kinzler wrote:
Am 24.10.2014 02:17, schrieb James Heald:
I think the issue I'm stuck on is: what property would the qualifier be attached to ?
...
The first choice might be attaching the information to a "Creator" property.
I would prefer "Contributor", but yea, something like that.
But for the underlying works of these engravings, there are typically *two* creators, both of which are significant -- the artist, and the engraver.
You can have any number of Statements about a Property, and each of these Statement has it's own set of Qualifiers (and Source References). E.g.
Contributor: Henry Foo Point in time: 1872 Role: Engraver
Contributor: Melissa Bar Point in time: 1870 Role: Illustrator
So instead, we might consider an "Underlying work" property, analogous to the "Work" class in the Multimedia API development, "a creation to which copyright, authorship, etc is attached", as per
https://docs.google.com/document/d/1tzwGtXRyK3o2ZEfc85RJ978znRdrf9EkqdJ0zVjm...
But can we then capture the whole of the work class in such a property?
No. Using "Underlying work" (or, as I would prefer to call it "Derivative of"), the Work has to be modeled as an Entity in it's own right - either a Wikidata Item or a MediaInfo entity.
There seem to me a couple of issues: (1) What should be the value of the property? There doesn't seem to be an obvious choice (eg if one were importing from a repository or catalogue). What would be the datatype, and what should we store for this field.
It would be a reference to another Entity. Only the ID would be stored.
(2) It seems to me that we would need to enable qualifiers on qualifiers -- for example, if we represented the creator of an underlying engraving using a qualifier, we would then seem to need another qualifier to indicate whether the role was as artist, or as engraver.
See above: there is no need for this, since we can have any number of "top level" Creator/Contributor entries.
In some cases, the contributor's role may be implicit by using a more specific Property, like Painter, Director, etc.
Similarly, if there is sourcing, there are sources that might apply to one (1st level) qualifier, but not another. But normally the WD sourcing model is for a whole statement, not part of it.
They would apply to one *Statement* but not the other:
Contributor: Henry Foo ... Reference: Title: Detailed Research On That Book DOI: ... Reference: Title: My Art WEbsite URL: ...
Contributor: Melissa Bar ... Reference: Title: Awesome Art Book Author: R.N. Dewy ISBN: ...
What we're would really be doing, if we did this in full, would be in effect to store the contents of what might otherwise be an entire item in a property.
If we have that much relevant information, it might be worth creating a data item. Especially if we end up repeating that info for multiple files (e.g. engravings from the same book).
This can and should be decided on a case by case basis. Just like on Wikipedia, it makes sense to create a separate Article when a section of some more general article grows too big.
That has some attractiveness, if at a future time one wanted to promote the 'underlying work' to have a Wikidata item in its own right -- the two structures would then match exactly.
But it would mean CommonsData having a slightly different data structure to Wikidata.
Slightly different isn't a problem, but the ability to "nest" entities and/or qualifiers is a fundamental structural incompatibility. That's not good.
...
If we're looking to support these searches and orderings, does it matter that a particular field may sometimes be on the file item, but sometimes on a Wikidata item ?
Searching (or rather: querying) across both datasets at once would be nice, but that'S pretty far off. First, we need decent query capabilities for the individual datasets.
I would imagine that for all files based on a specific book, the same approach would be chosen (e.g. a Wikidata Item for the Book, and MediaInfo for each file).
Note that Queries are different from Searches. Searches are ranked and potentially open-ended. Queries have a definite result set, and may be sorted. Queries will (in the future) be pre-defined and cached, and can be used on wiki pages via Lua, to create a list or table based on whatever logic you like. On that level, it would also be possible to combine information from two repositories (Wikidata and Commons), but at that point, we are talking about proper programming in Lua.
Would it matter that for one of the engravings we have two copies, so the information that we would be wanting for search and selection and ordering would be stored on a Wikidata item; whereas for the rest, with only a single copy, it would be stored on a Commons item? )
It would be tricky to manage this nicely for the general case. For your specific book, you may write some specialized Lua code that deals with this.
However, I would not recommend to create a data item just because you have two files in a single case. If the relevant data is not too extensive, it's fine to duplicate it.
None of these questions are without solutions. But it does, I think, require a decisive view to be reached, as to what we propose to do.
I think there are two main parts to your questions:
a) How to model contributions without modeling all the "base works" separately. I think multiple Contributor statements with separate lists of qualifiers and source references cover this.
b) How to best integrate the information that lies partially on Wikidata, and partially on Commons. This is indeed tricky, and perhaps there is no general, one-size-fits-all solution.
One thing that may help is the planned "high level media info API", which provides license/attribution/legal information about files in a unified form, drawing from structured data both on Commons and Wikidata.
I'm not sure if this quite fits here but it's related.
A few months ago I went to a meeting of natural history organisations in the UK, they were looking for a way of creating a centralised directory of specimens held in different institutions in the UK.
Wikidata seems like a possible place for this to happen, for each species there could be a place where specimens are held, however there would be very large differences between number of organisations holding specimens depending on the species and also differences in types of specimens e.g jaw bones or whole skeleton. I also wonder if this would include other organisations like zoos where they would be alive.
Any thoughts would be welcome
Thanks
John
On 24 October 2014 22:09, James Heald j.heald@ucl.ac.uk wrote:
Hi Daniel, thanks for getting back to me.
A couple of quick points, which I digest in more detail what you've written:
(1) I don't think that one can say a-priori that for all files based on a specific book, the same approach would be chosen.
If you look at either of the categories I cited, you can see we have files being contributed by multiple different uploaders, from multiple different sources:
https://commons.wikimedia.org/wiki/Category:Twenty-four_ Views_by_Henry_Salt_%281809%29
https://commons.wikimedia.org/wiki/Category:Pyne%27s_Royal_Residences
Some have uploaded a single file, some have uploaded multiple files. There's no reason to assume they would all adopt the same approach without strong guidance, and even then it would be questionable.
(2) It's really important to link together all the people that made a contribution, what that contribution is, when it was made, and what rights there are on it.
I don't think this can be done using qualifiers on a contributor property, because there may be more than one contributor involved.
(3) If we make it hard to move work-information easily, and as a unit, between file-data and item-data we're going to really make things difficult.
(4) It is essential to be able to order the hit-sets of searches, eg analogues of the current category view, and there are likely to be a number of different standard orderings that should be available.
If this requires coping with the fact that some of the information will be stored in file-data and some will be on Wikidata items referenced from file-data, that needs to be designed in right from the start as a basic requirement.
Cheers,
James.
On 24/10/2014 20:51, Daniel Kinzler wrote:
Am 24.10.2014 02:17, schrieb James Heald:
I think the issue I'm stuck on is: what property would the qualifier be attached to ?
...
The first choice might be attaching the information to a "Creator" property.
I would prefer "Contributor", but yea, something like that.
But for the underlying works of these engravings, there are typically
*two* creators, both of which are significant -- the artist, and the engraver.
You can have any number of Statements about a Property, and each of these Statement has it's own set of Qualifiers (and Source References). E.g.
Contributor: Henry Foo Point in time: 1872 Role: Engraver
Contributor: Melissa Bar Point in time: 1870 Role: Illustrator
So instead, we might consider an "Underlying work" property, analogous
to the "Work" class in the Multimedia API development, "a creation to which copyright, authorship, etc is attached", as per
https://docs.google.com/document/d/1tzwGtXRyK3o2ZEfc85RJ978znRdrf 9EkqdJ0zVjmQqs/edit#heading=h.akjw1xj0kfpf
But can we then capture the whole of the work class in such a property?
No. Using "Underlying work" (or, as I would prefer to call it "Derivative of"), the Work has to be modeled as an Entity in it's own right - either a Wikidata Item or a MediaInfo entity.
There seem to me a couple of issues:
(1) What should be the value of the property? There doesn't seem to be an obvious choice (eg if one were importing from a repository or catalogue). What would be the datatype, and what should we store for this field.
It would be a reference to another Entity. Only the ID would be stored.
(2) It seems to me that we would need to enable qualifiers on qualifiers
-- for example, if we represented the creator of an underlying engraving using a qualifier, we would then seem to need another qualifier to indicate whether the role was as artist, or as engraver.
See above: there is no need for this, since we can have any number of "top level" Creator/Contributor entries.
In some cases, the contributor's role may be implicit by using a more specific Property, like Painter, Director, etc.
Similarly, if there is sourcing, there are sources that might apply to
one (1st level) qualifier, but not another. But normally the WD sourcing model is for a whole statement, not part of it.
They would apply to one *Statement* but not the other:
Contributor: Henry Foo ... Reference: Title: Detailed Research On That Book DOI: ... Reference: Title: My Art WEbsite URL: ...
Contributor: Melissa Bar ... Reference: Title: Awesome Art Book Author: R.N. Dewy ISBN: ...
What we're would really be doing, if we did this in full, would be in
effect to store the contents of what might otherwise be an entire item in a property.
If we have that much relevant information, it might be worth creating a data item. Especially if we end up repeating that info for multiple files (e.g. engravings from the same book).
This can and should be decided on a case by case basis. Just like on Wikipedia, it makes sense to create a separate Article when a section of some more general article grows too big.
That has some attractiveness, if at a future time one wanted to promote
the 'underlying work' to have a Wikidata item in its own right -- the two structures would then match exactly.
But it would mean CommonsData having a slightly different data structure to Wikidata.
Slightly different isn't a problem, but the ability to "nest" entities and/or qualifiers is a fundamental structural incompatibility. That's not good.
...
If we're looking to support these searches and orderings, does it matter that a particular field may sometimes be on the file item, but sometimes on a Wikidata item ?
Searching (or rather: querying) across both datasets at once would be nice, but that'S pretty far off. First, we need decent query capabilities for the individual datasets.
I would imagine that for all files based on a specific book, the same approach would be chosen (e.g. a Wikidata Item for the Book, and MediaInfo for each file).
Note that Queries are different from Searches. Searches are ranked and potentially open-ended. Queries have a definite result set, and may be sorted. Queries will (in the future) be pre-defined and cached, and can be used on wiki pages via Lua, to create a list or table based on whatever logic you like. On that level, it would also be possible to combine information from two repositories (Wikidata and Commons), but at that point, we are talking about proper programming in Lua.
Would it matter that for one of the engravings we have two copies, so the
information that we would be wanting for search and selection and ordering would be stored on a Wikidata item; whereas for the rest, with only a single copy, it would be stored on a Commons item? )
It would be tricky to manage this nicely for the general case. For your specific book, you may write some specialized Lua code that deals with this.
However, I would not recommend to create a data item just because you have two files in a single case. If the relevant data is not too extensive, it's fine to duplicate it.
None of these questions are without solutions. But it does, I think,
require a decisive view to be reached, as to what we propose to do.
I think there are two main parts to your questions:
a) How to model contributions without modeling all the "base works" separately. I think multiple Contributor statements with separate lists of qualifiers and source references cover this.
b) How to best integrate the information that lies partially on Wikidata, and partially on Commons. This is indeed tricky, and perhaps there is no general, one-size-fits-all solution.
One thing that may help is the planned "high level media info API", which provides license/attribution/legal information about files in a unified form, drawing from structured data both on Commons and Wikidata.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
John,
Yes, it is related as both could be useful for tagging Commons images. (Acanthostichus hispaniolicus SMNSDO5205-1 01.jpg https://commons.wikimedia.org/wiki/File:Acanthostichus_hispaniolicus_SMNSDO5205-1_01.jpg and Acanthostichus hispaniolicus SMNSDO5205-1 02.jpg https://commons.wikimedia.org/wiki/File:Acanthostichus_hispaniolicus_SMNSDO5205-1_02.jpg show the same ant specimen, so they should link to the same Wikidata item
Maintaining a large directory of small things may be quite a burden for the community. but I guess that if you donate the associated images at the same time, the benefit would be more obvious to the community, so they would more readily accept to do it :).
On Sat, Oct 25, 2014 at 2:39 PM, John Cummings < John.Cummings@wikimedia.org.uk> wrote:
I'm not sure if this quite fits here but it's related.
A few months ago I went to a meeting of natural history organisations in the UK, they were looking for a way of creating a centralised directory of specimens held in different institutions in the UK.
Wikidata seems like a possible place for this to happen, for each species there could be a place where specimens are held, however there would be very large differences between number of organisations holding specimens depending on the species and also differences in types of specimens e.g jaw bones or whole skeleton. I also wonder if this would include other organisations like zoos where they would be alive.
Any thoughts would be welcome
Thanks
John
On 24 October 2014 22:09, James Heald j.heald@ucl.ac.uk wrote:
Hi Daniel, thanks for getting back to me.
A couple of quick points, which I digest in more detail what you've written:
(1) I don't think that one can say a-priori that for all files based on a specific book, the same approach would be chosen.
If you look at either of the categories I cited, you can see we have files being contributed by multiple different uploaders, from multiple different sources:
https://commons.wikimedia.org/wiki/Category:Twenty-four_ Views_by_Henry_Salt_%281809%29
https://commons.wikimedia.org/wiki/Category:Pyne%27s_Royal_Residences
Some have uploaded a single file, some have uploaded multiple files. There's no reason to assume they would all adopt the same approach without strong guidance, and even then it would be questionable.
(2) It's really important to link together all the people that made a contribution, what that contribution is, when it was made, and what rights there are on it.
I don't think this can be done using qualifiers on a contributor property, because there may be more than one contributor involved.
(3) If we make it hard to move work-information easily, and as a unit, between file-data and item-data we're going to really make things difficult.
(4) It is essential to be able to order the hit-sets of searches, eg analogues of the current category view, and there are likely to be a number of different standard orderings that should be available.
If this requires coping with the fact that some of the information will be stored in file-data and some will be on Wikidata items referenced from file-data, that needs to be designed in right from the start as a basic requirement.
Cheers,
James.
On 24/10/2014 20:51, Daniel Kinzler wrote:
Am 24.10.2014 02:17, schrieb James Heald:
I think the issue I'm stuck on is: what property would the qualifier be attached to ?
...
The first choice might be attaching the information to a "Creator" property.
I would prefer "Contributor", but yea, something like that.
But for the underlying works of these engravings, there are typically
*two* creators, both of which are significant -- the artist, and the engraver.
You can have any number of Statements about a Property, and each of these Statement has it's own set of Qualifiers (and Source References). E.g.
Contributor: Henry Foo Point in time: 1872 Role: Engraver
Contributor: Melissa Bar Point in time: 1870 Role: Illustrator
So instead, we might consider an "Underlying work" property, analogous
to the "Work" class in the Multimedia API development, "a creation to which copyright, authorship, etc is attached", as per
https://docs.google.com/document/d/1tzwGtXRyK3o2ZEfc85RJ978znRdrf 9EkqdJ0zVjmQqs/edit#heading=h.akjw1xj0kfpf
But can we then capture the whole of the work class in such a property?
No. Using "Underlying work" (or, as I would prefer to call it "Derivative of"), the Work has to be modeled as an Entity in it's own right - either a Wikidata Item or a MediaInfo entity.
There seem to me a couple of issues:
(1) What should be the value of the property? There doesn't seem to be an obvious choice (eg if one were importing from a repository or catalogue). What would be the datatype, and what should we store for this field.
It would be a reference to another Entity. Only the ID would be stored.
(2) It seems to me that we would need to enable qualifiers on
qualifiers -- for example, if we represented the creator of an underlying engraving using a qualifier, we would then seem to need another qualifier to indicate whether the role was as artist, or as engraver.
See above: there is no need for this, since we can have any number of "top level" Creator/Contributor entries.
In some cases, the contributor's role may be implicit by using a more specific Property, like Painter, Director, etc.
Similarly, if there is sourcing, there are sources that might apply to
one (1st level) qualifier, but not another. But normally the WD sourcing model is for a whole statement, not part of it.
They would apply to one *Statement* but not the other:
Contributor: Henry Foo ... Reference: Title: Detailed Research On That Book DOI: ... Reference: Title: My Art WEbsite URL: ...
Contributor: Melissa Bar ... Reference: Title: Awesome Art Book Author: R.N. Dewy ISBN: ...
What we're would really be doing, if we did this in full, would be in
effect to store the contents of what might otherwise be an entire item in a property.
If we have that much relevant information, it might be worth creating a data item. Especially if we end up repeating that info for multiple files (e.g. engravings from the same book).
This can and should be decided on a case by case basis. Just like on Wikipedia, it makes sense to create a separate Article when a section of some more general article grows too big.
That has some attractiveness, if at a future time one wanted to promote
the 'underlying work' to have a Wikidata item in its own right -- the two structures would then match exactly.
But it would mean CommonsData having a slightly different data structure to Wikidata.
Slightly different isn't a problem, but the ability to "nest" entities and/or qualifiers is a fundamental structural incompatibility. That's not good.
...
If we're looking to support these searches and orderings, does it matter that a particular field may sometimes be on the file item, but sometimes on a Wikidata item ?
Searching (or rather: querying) across both datasets at once would be nice, but that'S pretty far off. First, we need decent query capabilities for the individual datasets.
I would imagine that for all files based on a specific book, the same approach would be chosen (e.g. a Wikidata Item for the Book, and MediaInfo for each file).
Note that Queries are different from Searches. Searches are ranked and potentially open-ended. Queries have a definite result set, and may be sorted. Queries will (in the future) be pre-defined and cached, and can be used on wiki pages via Lua, to create a list or table based on whatever logic you like. On that level, it would also be possible to combine information from two repositories (Wikidata and Commons), but at that point, we are talking about proper programming in Lua.
Would it matter that for one of the engravings we have two copies, so
the information that we would be wanting for search and selection and ordering would be stored on a Wikidata item; whereas for the rest, with only a single copy, it would be stored on a Commons item? )
It would be tricky to manage this nicely for the general case. For your specific book, you may write some specialized Lua code that deals with this.
However, I would not recommend to create a data item just because you have two files in a single case. If the relevant data is not too extensive, it's fine to duplicate it.
None of these questions are without solutions. But it does, I think,
require a decisive view to be reached, as to what we propose to do.
I think there are two main parts to your questions:
a) How to model contributions without modeling all the "base works" separately. I think multiple Contributor statements with separate lists of qualifiers and source references cover this.
b) How to best integrate the information that lies partially on Wikidata, and partially on Commons. This is indeed tricky, and perhaps there is no general, one-size-fits-all solution.
One thing that may help is the planned "high level media info API", which provides license/attribution/legal information about files in a unified form, drawing from structured data both on Commons and Wikidata.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- *John Cummings - **Wikimedia UK volunteer* tweet @mrjohnc
Wikimedia UK is a Charitable Company registered in England and Wales. Registered Company No. 6741827. Registered Charity No.1144513. Registered Office: 4th Floor, Development House, 56-64 Leonard Street, London EC2A 4LT. United Kingdom. Wikimedia UK is the UK chapter of a global Wikimedia movement. The Wikimedia projects are run by the Wikimedia Foundation (who operate Wikipedia, amongst other projects). Wikimedia UK is an independent non-profit charity with no legal control over Wikipedia nor responsibility for its contents.
Telephone (0044) 207 065 0990.
Visit http://www.wikimedia.org.uk/ and @wikimediauk
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Hello John
I'm not sure that Wikidata is the right place for this kind of information, due to it's high granularity. As Zolo points out, maintaining a large directory of small things may be quite a burden for the community.
However, Wikibase is by design well suited for representing research data, since it allows for very fine grained sourcing an annotation. Europeana's EAGLE project[1] is already using Wikibase[2] to manage diverse translations of inscriptions (e.g. [3]). Such a local Wikibase installation could still refer to Wikidata as a vocabulary, e.g. using Wikidata Q-Numbers to identify taxons.
-- daniel
[1] http://www.eagle-network.eu/ [2] http://www.eagle-network.eu/wiki/ [3] http://www.eagle-network.eu/wiki/index.php/Item:Q5102?setlang=en
Am 25.10.2014 14:39, schrieb John Cummings:
I'm not sure if this quite fits here but it's related.
A few months ago I went to a meeting of natural history organisations in the UK, they were looking for a way of creating a centralised directory of specimens held in different institutions in the UK.
Wikidata seems like a possible place for this to happen, for each species there could be a place where specimens are held, however there would be very large differences between number of organisations holding specimens depending on the species and also differences in types of specimens e.g jaw bones or whole skeleton. I also wonder if this would include other organisations like zoos where they would be alive.
Any thoughts would be welcome
Thanks
John
In my opinion "how many items will X add" is a false problem. If p.a. we moved file categories to subpages as we do on templates, we'd have 20 millions new Commons pages: but the question would be, are they as accessible as they were before? Similarly, the only danger is when items' statements are not transcluded outside Wikidata. When stuff is in use on projects, as for authority codes; and when it can be edited in-place, as we all do for sitelinks and ru.wiki does for much more: then there is nothing I worry about.
Nemo
@Daniel - the further back you go, the more notable the engravings in books become (see for example the whole family of engravings and copies thereof for the 17th-century "Counts of Holland" series) and sometimes engravings from books are the source for paintings. @Nemo - I don't follow your thinking on this one - when you say "new Commons pages, do you mean new Wikidata items based on Commons categories? I don't see a problem with that. Things that have needed a category on Wikimedia Commons are probably notable enough for Wikidata (though I can think of some non-notable categories like "1610 engravings" that would be unnecessary on Wikidata)
On Tue, Oct 14, 2014 at 8:39 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
In my opinion "how many items will X add" is a false problem. If p.a. we moved file categories to subpages as we do on templates, we'd have 20 millions new Commons pages: but the question would be, are they as accessible as they were before? Similarly, the only danger is when items' statements are not transcluded outside Wikidata. When stuff is in use on projects, as for authority codes; and when it can be edited in-place, as we all do for sitelinks and ru.wiki does for much more: then there is nothing I worry about.
Nemo
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l