Re: [Wikidata-l] Commons Categories again (was Re: Commons Wikibase)

1 Sep 2014

Hoi,
Wikidata is very much a "working database". Its relevance is exactly
because of this. Without the connection to the interwiki links, it would
not be the same, it would not have the coverage and it would not have the
same sized community.

Considerations about secondary use are secondary. Yes, people may use it
for their own purposes and when it fits their needs, well and good. When it
does not, that is fine too. As it is, we do have all kind of Wiki "junk" in
there. We have disambiguation pages, list articles, templates, categories.
The challenge is to find a use for them.

When I add statements based on categories, I "document" many categories
[1]. As a result over 900 items for categories will show the result of a
query in the Reasonator. The results is what I think a category could
contain given the subject of a category. For Wikipedians they are articles
not categorised, red links and blue links.

There are several reasons why this is not (yet) a perfect fit. The most
obvious one is including articles that are not part of the selection eg a
list in a category full of humans. Currently not everything can be
expressed in a way that allows Reasonator to pick things up in a query..
dates come to mind. Then there are the categories that have an "arbitrary"
set of entries.

I am not going to speculate on what kind of qualifiers Commons will come up
with. In essence when you can sort it / select it Wikidata will do a better
job for you. The "only" thing we have to do is identify the items that fit
the mold. This is something that you can often find the basis for in
existing categories.
Thanks,
     GerardM

[1]
http://ultimategerardm.blogspot.nl/2014/08/wikidata-my-workflow-enriching-w…

http://tools.wmflabs.org/wikidata-todo/autolist.html?q=CLAIM%5B31%3A4167836…

On 1 September 2014 00:42, James Heald &lt;j.heald(a)ucl.ac.uk&gt; wrote:

...
  Hi everybody,

 Sorry to open up an old thread again after ten days, but there were some
 things in Lydia's reply below that I wanted to come back to.

 So, first, a couple of examples of the kind of Commons Categories I had in
 mind:

 https://commons.wikimedia.org/wiki/Category:Images_released_
 by_British_Library_Images_Online

 https://commons.wikimedia.org/wiki/Category:Metropolitan_
 Improvements_%281828%29_Thomas_Hosmer_Shepherd

 Despite their names, both these cats effectively identify images from
 particular photosets on Flickr.  The first category relates to a particular
 set of images released by a particular institution on a particular date.
 The second relates to a particular set of scans from a particular edition
 of a particular book.  Both (IMO) would (and, moreover *should*) currently
 fail Wikidata:Notability.

 The book, and even the edition, might be notable. But a particular set of
 scans surely would not. Similarly, the first category is really just a
 photoset from Flickr, again something that wouldn't currently get a
 Wikidata Q-number.

 Now in the email below, Lydia effectively said: no problem, just give each
 Commons Category a Wikidata Q-number anyway.  ("Imho they should be on
 Wikidata. I fear if we introduce another layer it'll be considerably harder
 to use and maintain.")

 GerardM, in sessions at Wikimania, also argued strongly simply for putting
 everything in Wikidata.

 But I think this would be a mistake, because IMO Wikidata:Notability is a
 positive virtue, which should be defended.  It is *useful* to people that
 they can download a dump of Wikidata for their own purposes, and get
 real-world relevant items, rather than the dump being bloated with wiki
 junk.

 So in my opinion, Commons categories should generally *not* get Q-numbers
 on Wikidata (unless they pass WD:N), but should instead get items on the
 Commons Wikibase which is being created expressly for the purpose of
 holding structured data on things which really only have a commonswiki
 significance, and are not real-world notable.

 A second point relates to Magnus's issue about how much of this could be
 replaced by queries.

 Yes, if one were progressively building up a topic search on images from
 books in the 1-million image BL Mechanical Curator release, one might ask
 for books about London, then books published in a particular date range.
 But within that, the natural query to specify scans from this particular
 copy of 'Metropolitan Improvements' is the image's membership of this
 particular set -- membership of the set in itself is something that should
 be queryable, and such a query is the kind of query that, at the right
 stage, should be offerable to the user trying to refine their search.

 In fact, most current Commons categories will not be WD-notable.  But even
 for the most egregious of Commons intersection categories, IMO it will
 still be worth the Commons Wikibase tracking category membership for an
 image, not least for the ability that will give to easily present the
 category's files in different ways -- eg perhaps sorted by filename; or by
 original creation date; or by upload date; or by uploader; or by
 geographical proximity... etc.  Holding the category membership in the
 wikibase then allows people to write gadgets to sort or filter or
 re-present the category in multiple ways.  So it's useful to have the
 category as an entity that can be a target for a property.

 But there are also reasons for a category to have an item in its own right
 -- because there is structured data that one may wish to associate with the
 category:  one example would be access stats to members of the category (eg
 which categories in the Mechanical Curator collection have had the most
 file views?) -- the kind of thing of great interest to GLAMs.

 Many categories also contain information defining them -- for example, for
 the book scans category, one would want a property that this category
 contained scans of the particular book (pointed to by its Q-number),
 probably a particular edition (probably a qualifier).  One might also want
 to associate linked data -- pointers to entries for the book in (possibly
 multiple) catalogues of its original host institution.

 So for all these reasons it may well be useful, as a matter of course, to
 have a container for structured information associated with each commonscat.

 This is why I think each and every category on Commons should have its own
 Commons Wikibase item, with an associated C-number.

 Queries are important, but I'd suggest they are best seen as an *addition*
 to the present category system, rather than a *replacement* for it.

 A particular way forward, it seems to me,  might be to allow categories to
 be *augmented* with specific queries -- i.e. to allow rules to be specified
 for particular categories, so that files whose structured-data topic
 information matched the rules would automatically be added to the
 categories, alongside the files already there.

 Categories, including intersection categories, would therefore effectively
 auto-update, without human intervention, to include new files if they had
 appropriate topic information.

 Existing legacy categorisation information would survive, allowing the new
 augmentation approach to slowly come into play if topic information were
 initially weak.  And categories should still be specifiable by hand (or
 automatically through templates, e.g. as source categories are often
 specified through source templates) -- because this can still be the most
 efficient way to specify naturally closed sets.

 This would effectively allow a transition pathway towards categorisation /
 sets-of-interest becoming more determined by the structured data.

 One thing in particular it could allow would be a gadget to highlight
 images that were in a category directly, *not* by virtue of any rule on any
 metadata, which could then allow such images to be investigated and/or have
 their topic metadata improved.

 It's easy to mock the sometimes extraordinary depths of intersection
 categories on Commons; such intersection categories are a pain to determine
 for categorisation, not a very good fit for retrieval, and nor does it well
 match how the rest of the world does things, which makes metadata import
 harder and less effective than it should be.

 But there are virtues in the category system too. There is a wealth of
 hard-won information encoded in it. And some categories do match natural
 groupings of images. The hand-curated category sets and hierarchies,
 reflecting context knowledge, will often do better than even the best
 AI-driven suggestions will ever be able to match for search refinement.

 Such an approach as I've suggested above would combine categories and
 topics in an evolutionary rather than revolutionary way.  Categories would
 not all go away -- ever -- but would continue to exist side-by-side with
 topics in a symbiotic way, that IMO would make the transition smoother and
 more likely to engage and involve the existing community, to an end-point
 that it seems to me would have additional strengths over a pure query
 system.

 I'm interested to know what other people think.

   -- James.  (User:Jheald)

 On 19/08/2014 15:27, Lydia Pintscher wrote:

  Hey :)

 On Mon, Aug 18, 2014 at 4:22 PM, James Heald &lt;j.heald(a)ucl.ac.uk&gt; wrote:

  Thanks Lydia!

 Something that occurs to me is that one may well want to include Commons
 categories in such a database, not just files, which presumably might be
 stored on a page like

    Info:Category:Insert random Commons category intersection here

 so that one could then ask whether a file belongs to such a category or
 not,
 and the data would all be in the database.

 So what you want is to be able to make the category one possible
 search criteria when searching for images? We don't need an entity
 type for that I think. We "just" have to build the search interface in
 a way that it can take those into account as well from where they are
 already now.
 Or is that missing something important you had in mind?

  Such categories (or sets) may well not be Wikidata notable, for example:

    Category:Pictures I took on my cellphone one midsummer morning

 so we cannot assume they have Q-numbers.

 My assumption so far was that we can assume every topic we use to tag
 images to be in Wikidata. Are there some examples currently in use on
 Commons that you think would not be covered? Because Wikidata will be
 used to tag much more than just Commons images in the future. So we
 should have a really huge vocabulary.

  But it would be nice if we could describe such properties using the
  existing
 Wikidata syntax, ie via a property Pxyz = "belongs to set", and then an
 item
 number for the set it belonged to.

 What set is this for example? Like "everything takes as part of Wiki
 Loves Monuments 2012"? Or some other kind of set?

  Since the items wouldn't be on Wikidata, it would be useful if they had a
  different namespace,  eg   C nnnnnn

 Imho they should be on Wikidata. I fear if we introduce another layer
 it'll be considerably harder to use and maintain.

 Cheers
 Lydia

 _______________________________________________
 Wikidata-l mailing list
 Wikidata-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata-l] Commons Categories again (was Re: Commons Wikibase)