On 8/12/07, Brianna Laugher brianna.laugher@gmail.com wrote:
Well if we had a better tool it would remove a lot of the problems we currently have.
Be careful not to blame the software too much. Yes we have some serious software gaps, but none of them would take much time to close if they were getting active development. After that the gaps all become data quality, process, and manpower related.
We have 1.75 million images, 2 million including deleted images. All the software changes you would ask for will take less man hours than a single pass of data quality improvements over our whole collection.
Tagging is flawed - some people put 'wiki', some put 'wikis', some put 'wikipedia', etc etc. And yet somehow it doesn't seem to matter. this is puzzling. I haven't really seen a site do intentionally-collaborative tagging, where the users actively try to have the same understanding for the same tag. no wonder we have so many problems with categories. ;)
What is flawed is the the model of "user contributed content" which lacks strong facilities for collaboration. We understand collaboration, we have strong facilities for it. This is why "Joe uses 'dog', john uses 'dogs' isn't and shouldn't be a hard problem for us. With collaboration we can just go fix all of it.
There are several advantages we have over all the competitors people have mentioned -- Google images, Flickr, Getty images:
Woah. Now. I only mentioned Getty as an example of what we should aspire to search wise. In that one regard they simply blow us out of the water. Their search produces useful results for things in ways that we can't even hope to produce today no matter how good we make the software, because we simply do not have the data on each item in our collection required right now.
Of course, in every other regard we already blow them away. Can I not point out an area where someone else does clearly better without getting the Commons Sales Pitch? :)
[snip]
- attention to detailed annotation - we kill the others when it comes
to this. especially with nature images. (I just did a search on getty images for 'kangaroo'. most of the images look hokey and staged.)
Different audience, they cater to commercial stock photography.. advertisements and such, while our primary customer is an encyclopedia, but that has nothing to do with annotation.
Their annotation is fantastic. For example, searching for "kangaroo fighting" ... gives you only pictures of kangaroos fighting. Searching for "kangaroo costume" asks you to clarify if you want "traditional clothing" or "Costume (Dressing Up)" you either get a painting of people in what appears to be tribal dress cooking a kangaroo, or you get pictures of people in cheezy kangaroo costumes depending on your choice.
This blows away anything that we currently offer. It's fantastically useful.
What stinks, in my view, is we're not that far from being able to have that kind of search ourselves. Their keyword data looks a lot like our categories. The keywords themselves are classified into groups (to help people find the right keywords, but not to classify the images), and there is keyword disambiguation data.
The biggest difference, as far as I can tell, is that we're utterly paranoid about "over categorization". While they apply all that are appropriate, people on commons are constantly trying to reduce images to a few.. or even one category. It's nuts and it clearly doesn't work.
A typical image in getty's web collection will have something between 20 and 40 'keywords' assigned to them. We have an average of 2.9 (including all the license cats).