[Commons-l] All about Wikimania, future projects, licenses, etc etc

Sun Aug 12 17:56:02 UTC 2007

On 8/12/07, Brianna Laugher <brianna.laugher at gmail.com> wrote:
> Well if we had a better tool it would remove a lot of the problems we
> currently have.

Be careful not to blame the software too much. Yes we have some
serious software gaps, but none of them would take much time to close
if they were getting active development.  After that the gaps all
become data quality, process, and manpower related.

We have 1.75 million images, 2 million including deleted images.  All
the software changes you would ask for will take less man hours than a
single pass of data quality improvements over our whole collection.

> Tagging is flawed - some people put 'wiki', some put 'wikis', some put
> 'wikipedia', etc etc. And yet somehow it doesn't seem to matter. this
> is puzzling. I haven't really seen a site do
> intentionally-collaborative tagging, where the users actively try to
> have the same understanding for the same tag. no wonder we have so
> many problems with categories. ;)

What is flawed is the the model of "user contributed content" which
lacks strong facilities for collaboration.   We understand
collaboration, we have strong facilities for it.  This is why "Joe
uses 'dog', john uses 'dogs' isn't and shouldn't be a hard problem for
us.  With collaboration we can just go fix all of it.

> There are several advantages we have over all the competitors people
> have mentioned -- Google images, Flickr, Getty images:

Woah. Now. I only mentioned Getty as an example of what we should
aspire to search wise. In that one regard they simply blow us out of
the water.  Their search produces useful results for things in ways
that we can't even hope to produce today no matter how good we make
the software,  because we simply do not have the data on each item in
our collection required right now.

Of course, in every other regard we already blow them away.  Can I not
point out an area where someone else does clearly better without
getting the Commons Sales Pitch? :)

[snip]
> * attention to detailed annotation - we kill the others when it comes
> to this. especially with nature images. (I just did a search on getty
> images for 'kangaroo'. most of the images look hokey and staged.)

Different audience, they cater to commercial stock photography..
advertisements and such, while our primary customer is an
encyclopedia, but that has nothing to do with annotation.

Their annotation is fantastic.   For example, searching for "kangaroo
fighting" ... gives you only pictures of kangaroos fighting.
Searching for "kangaroo costume" asks you to clarify if you want
"traditional clothing" or "Costume (Dressing Up)"  you either get a
painting of people in what appears to be tribal dress cooking a
kangaroo, or you get pictures of people in cheezy kangaroo costumes
depending on your choice.

This blows away anything that we currently offer.  It's fantastically useful.

What stinks, in my view, is we're not that far from being able to have
that kind of search ourselves.  Their keyword data looks a lot like
our categories.  The keywords themselves are classified into groups
(to help people find the right keywords, but not to classify the
images), and there is keyword disambiguation data.

The biggest difference, as far as I can tell, is that we're utterly
paranoid about "over categorization". While they apply all that are
appropriate, people on commons are constantly trying to reduce images
to a few.. or even one category.  It's nuts and it clearly doesn't
work.

A typical image in getty's web collection will have something between
20 and 40 'keywords' assigned to them. We have an average of 2.9
(including all the license cats).