Hi James,
The short anser to your long email is that during Wikimania we had a
meeting with the WMF Multimedia team - specifically Fabrice and Gilles -
and both groups expressed interest in ensuring that the GWT (and Europeana
as an external user) was part of the discussions about structured data.
Jens and Daniel from the Wikidata team in WM-De were also very
enthusiastic. Equally, the WMF analytics team (Toby) wants to talk to us
about the metrics GLAMs need.
The first specific outcome was that Dan Entous and Joris Pekel have been
invited to attend the small 'kickoff' event planned for Berlin in
September. Dan as the lead for the GWT, and Joris as the guy in Europeana
who has the experience in managing dataset to make them work in the
Europeana API.
I understand there is also going to be a larger meeting in October or
November in Amsterdam(?) which will also feature our involvement.
-Liam
Peace, love & metadata
On 15 August 2014 18:49, James Heald <j.heald(a)ucl.ac.uk> wrote:
So now Wikimania has been and gone, can we think about
where we're at with
the Structured Data initative for Commons ?
In particular Liam, I think you said after one of the sessions you were
"all over this" -- it would be good to know your thoughts.
I do think, as the GW toolset community, we ought to have a lot we should
be able to offer here, because essentially we are doing big uploads from
data which is *already* structured, so
(i) we've got at least some experience already with working with data
that is at least in some form structured
(ii) we may know and be able to flag some awkward edge cases
(iii) we would like to accompany uploads with data that can be "born
structured", rather than converted later
(iv) in any case we're uploading a lot of images, which somebody is
going to have to convert to structured
(v) we may have seen (or even written) some of the gnarlier templates
on Commons, that migration will have to cope with.
It's not clear (at least not yet to me) how the Multimedia and Wikidata
teams may best want to be communicated with, but I'm including Keegan (WMF)
in cc:, who I think is the staffer with assigned community liaison
responsibility.
The biggest message to me from looking through some of the documents after
the meetings is just how much of the information is going to be stored as
part of central main Wikidata.
Essentially, if we upload an image of an object, then it is expected that
an 'item' (ie a Q-number) for that object will be added to Wikidata, which
will contain all the metadata that describes the object rather than just
the image.
The Wikidata community is already developing a very strong ontology to
describe such objects -- key resources are
https://www.wikidata.org/wiki/Wikidata:WikiProject_Visual_
arts/Item_structure
https://www.wikidata.org/wiki/Wikidata:WikiProject_Books
where there are active and friendly communities involved in refining them.
We can get involved and help the process right now, by trying to identify
and fill and gaps in these ontologies, and by being enthusiastic early
adopters -- there is no reason we should not be getting involved right now,
filling in appropriate metadata on Wikidata right now each time we upload
an image to Commons -- real-world testing the current ontologies to see
what creaks.
Data specific to the image itself (rather than what it shows) will be
stored in a separate Commons Wikibase.
This will include such things as the file name, a file description,
photographer, wikicontributor name, precise geographical location etc.
Commons Wikibase is also likely to contain a tag-like "topic list" -- a
list of all the Wikidata Q-numbers that apply to the image. These I think
will be gathered by climbing up the Wikidata tree from any specified
Subject identified for the image -- so a view of Westminster Abbey might
get topics such as "Westminster; London; England; Cathedral; religious
building" etc; and games will be invented to encourage people to identify
more such topics for the best images.
This should allow WM to introduced a proper combinatorial search engine
based on tags for Commons; and many of the most egregious Commons
intersection categories will wither on the vine. (There is debate as to
whether Commons will end up needing *any* category pages, but I suspect it
will, because they are just so convenient to use as places for jotting down
facts -- on the other hand, it is possible one might be forced to create an
associated Commons article/gallery for that).
It would be nice (IMO) if there could be an interface to the topic list
through the wikisource code for the filepage -- I think this would be
well-received by the community, allow easy adaptation of existing bots,
etc. But this may be resisted as being too fragile a point of failure, as
it would mean that people making hand-edits would have to know (and get
right) the meaningless number-strings of individual Q-numbers.
Finally some very specific text data -- such as the EXIF data describing
shutter-speed etc, is likely to continue to live on the file description
page; because it's probably not something that people are primarily going
to want to search, and it may be a bit unpredictable.
Part of the immediate effort in the next few weeks is going to be to
produce clearer ideas about what information is going to live where, and in
particular what information is going to live on the Commons Wikibase, and
how it will be structured.
The good news is that much of the most complicated information will be
stored on WikiData, so can be as detailed as we like (and can be accessed
live now).
On the other hand, the design for Commons Wikibase will initially aim to
be as simple as possible, with the aim to evolve it as experience is
gained, to migrate the edge cases later.
The file description page (or something not entirely unlike it) will
continue to exist as a view bringing together all the data.
Current templates will be re-written to draw information from Wikidata.
However, this won't yet be possible until the Wikidata team has implemented
the "Arbitrary Access" feature -- the ability for a wikipage to access the
properties of an arbitrary Wikidata Q-number. What's causing the hold-up
is that if the properties of the Q-number item are edited, then all the
pages that access that Q-number need to be marked as dirty and
regenerated. That's easy if you only have one page that can access the
Q-number, but hard if arbitrary pages can access it, through a chain of
properties.
(eg: the file page for a painting Q12345 may use property Pnnn to its
creator Q4567 who has property Pxxx, a date of birth. If the date of birth
gets made more precise, the system has to recurse back to indicate that all
the file pages showing pictures of that creator's work need to be
regererated. This is tough, but file-page templates won't be able to draw
on Wikidata information until it is in place).
It is progressively hoped to simplify the myriad of different templates
used on the file pages as quickly as possible, to standardise them to draw
from the structured data stores.
Templates to display summary information about collection objects, which
will draw from Wikidata, may well be standardised so they can easily be
used on Wikipedias and other wikis -- or, to put that the other way round,
since Wikipedias and other wikis will also be developing standardised
templates to display summary object information, it should well be possible
to use the same code twice.
However, it would be good to get involved in the development of these
templates, to make sure they accurately reflect the information we
currently like to show in Commons.
(There may be some important details to get right -- for example the
Wikidata data-type for dates currently comprises a 'best' value, and an
optional numeric range (which is great for sorting). But if the catalogue
source data says eg "mid 17th century to early 18th century", do we want to
make sure that precise string is still stored? And should it still be
possible to make it visible? This needs close engagement; but probably
principally with the community-based development effort in the Wikidata
community groups.
Already very standardised are the present Commons creator templates and
Commons institution templates. These are likely to be an early quick win.
Looking down a typical present-day filepage, that means that it is the
Source/Photographer information in the present "Artist" template, which is
currently free-form and often a freely composed pull together of multiple
different sources of metadata, that is likely to be going to need the most
work to unpick.
This is also the field most commonly used for the credit link-back
templates to the originating GLAM institutions, which are obviously a key
consideration for our GLAM partners.
These templates may currently often be very institution-specific, and may
do quite complex stuff -- eg the present version of the British Library
https://commons.wikimedia.org/wiki/Template:British_Library_image
as used at eg
https://commons.wikimedia.org/wiki/File:Cuthbert_discovers_
piece_of_timber_-_Life_of_St._Cuthbert_%28late_12th_C%29,_f.
45v_-_BL_Yates_Thompson_MS_26.jpg
contains link-backs to a number of catalogues, each with their own
corresponding text; and as well as linking back to the information about
the underlying object (which is likely to be stored on WikiData), it will
also likely contain a link-back to the source of the original file (in this
case the specific file at BL images online), which being information
specific to the file is likely to be living on the Commons Wikibase.
The Source/Photographer field as a whole is (I think) likely to be one of
the last on the file page to be assimilated, because it can be so
sui-generis, and so the present rats nest of templates may continue to be
acceptable for some time -- though even they are likely to need
modification, as eg Photographer information moves to the Commons Wikibase.
That said, each institution is only going to need to manage its own
template.
But it probably would make sense to start an effort to think
* what is the structured data that typically lives in these templates?
* and is there some standardisation we could start to get into the box,
even now
Apart from anything else, something readily customisable might be much
easier for new institutions to adapt and adopt.
For the migration project as a whole, an audit of all the source templates
of this sort would be useful. That is something the MM/WD project team
could perhaps usefully encourage the community to undertake for them.
I have to admit there are lines I am not sure about, as to what gets a
Wikidata entry and what does not.
For example, when does a photograph deserve its own entry?
Perhaps a bright line is that an image of a photograph one took oneself
doesn't get an entry on Wikidata, but a photograph by Man Ray perhaps does.
What about a photograph by a photographer by more intermediate
notability? Or instead, perhaps an engraving from a book of 19th century
engravings?
It makes sense to create an identifier for the book on Wikidata; and also
the place depicted. This is often almost enough to identify the particular
image, but really one would want to store the page number, and perhaps the
scan number as well. (Since one might well have either one or the other or
both). It would probably be good to store some identifier for the set of
scans as well -- this too probably doesn't belong on Wikidata, (although
one might identify it as set number <identifer> from eg the Mechanical
Curator collection, which itself probably then *would* get a Wikidata
identifier).
So the Commons wikibase probably needs to be able to identify images as
having a sequence in a particular set, and that set as perhaps having an
identifer that links it to a collection which has a particular Q-number on
Wikidata.
This is the kind of thinking we will particularly need to be doing over
the next few weeks -- what is the metadata that will *not* be stored on
Wikidata, so will *need* to be stored on the Commons Wikibase if it is to
continue to be accessible? That is something that we as the community need
to evolve, thinking of all the use cases we can.
I am sure that there was something else I meant to say, but this email
seems long enough already.
There's a scratchpad of some bookmarks I started keeping on a subpage of
my userpage at Wikidata that people are welcome to,
https://www.wikidata.org/wiki/User:Jheald/bookmarks
This gives a nutshell of where some different fields might be stored
https://docs.google.com/presentation/d/1x-vOUr-
zveLzoIP6uJC1Sz95xwuTFNwaBqJqmtWH8Qk/edit#slide=id.g3704ec6dd_2_554
This etherpad is good, esp lines immediately after 140, and "What new
fields should be created to complement the old fields?" at 156 (actually in
the context of Upload Wizard, but it gives some ideas)
http://etherpad.wikimedia.org/p/multimedia-wikidata-catchup
There's a spreadsheet showing some of the fields they're thinking about
https://docs.google.com/spreadsheets/d/1rk05EcLZpJaqOh5wymK6teIQufH9t
0xn6oDPeyJHap0/edit#gid=0
-- though I think quite a lot of what's down as living on WikiData should
really be Commons WikiBase --
and also a suggestion based on some simple use-cases:
https://docs.google.com/document/d/1C7UTB1kbaf_EisF3LmhpIQkb_
ifSkB0rD8IGI9aMQhM/edit
though I think we would probably see that as *too* simple, even for a
first build, because for many of our applications
sequence-number in set
& set-identifer in collection
are probably essential quantities to have (as they probably are for the
WikiSource collection too).
Finally, this is an etherpad from the Hackathon just been, which has a lot
of useful links at the end.
http://etherpad.wikimedia.org/p/structured-data-discussion-7-august-2014
Hope this initial brain dump is of at least some use, to make it worth its
length,
All best,
James Heald. (User:Jheald).
_______________________________________________
Glamtools mailing list
Glamtools(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/glamtools