Blog entry about finding free images

List overview All Threads
Download

newer

older

Re: [Commons-l] [Foundation-l]...

wikistream and commons updates

Magnus Manske

13 Oct 2011 13 Oct '11

5:18 a.m.

http://www.dailyblogtips.com/how-to-find-free-pictures-for-your-blog/

Guess which site with lots of free images is not mentioned...?

We /seriously/ need to do more PR (viral or otherwise) for Commons!

Show replies by date

Federico Leva (Nemo)

13 Oct 13 Oct

12:24 p.m.

Magnus Manske, 13/10/2011 14:18:

...

http://www.dailyblogtips.com/how-to-find-free-pictures-for-your-blog/

So basically, it lists 3 search engines and two obscure websites whose images you have to pay (one of them with only 400k images)? Perhaps it's only an advertisement for those websites.

...

Guess which site with lots of free images is not mentioned...?

We /seriously/ need to do more PR (viral or otherwise) for Commons!

Nemo

Neil Kandalgaonkar

12:56 p.m.

I doubt that our licenses and Google's license filter are playing together well.

I am regularly pinged by a Google guy who is desperate to get Commons data into Google Image Search in a more systematic way. If any of you think this project is interesting I can totally get you off the ground. It will not be hard.

On 10/13/11 12:24 PM, Federico Leva (Nemo) wrote:

...

Magnus Manske, 13/10/2011 14:18:

...
http://www.dailyblogtips.com/how-to-find-free-pictures-for-your-blog/

So basically, it lists 3 search engines and two obscure websites whose images you have to pay (one of them with only 400k images)? Perhaps it's only an advertisement for those websites.

...
Guess which site with lots of free images is not mentioned...?

We /seriously/ need to do more PR (viral or otherwise) for Commons!

Sure. But it would be enough if people used Google with a license filter. Problem is, many seem to prefer all rights reserved images (including this blog post), or they don't care.

Nemo

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

-- Neil Kandalgaonkar (| neilk@wikimedia.org

Rayson Ho

1:35 p.m.

On Thu, Oct 13, 2011 at 3:56 PM, Neil Kandalgaonkar neilk@wikimedia.org wrote:

...

I am regularly pinged by a Google guy who is desperate to get Commons data into Google Image Search in a more systematic way. If any of you think this project is interesting I can totally get you off the ground. It will not be hard.

(I only hacked a few lines of MediaWiki, so I don't think I am the right person...)

What exactly needs to be done?? Can't Google just parse the licensing section to decide if the image is under CC or not??

If Google can find more of my 400+ images, then they can be used by others more often... that would certainly make me work harder to take more photos & upload more to Commons!

Rayson

================================= Grid Engine / Open Grid Scheduler http://gridscheduler.sourceforge.net

...

On 10/13/11 12:24 PM, Federico Leva (Nemo) wrote:

...
Magnus Manske, 13/10/2011 14:18:

...
http://www.dailyblogtips.com/how-to-find-free-pictures-for-your-blog/

So basically, it lists 3 search engines and two obscure websites whose images you have to pay (one of them with only 400k images)? Perhaps it's only an advertisement for those websites.

...
Guess which site with lots of free images is not mentioned...?

We /seriously/ need to do more PR (viral or otherwise) for Commons!

Sure. But it would be enough if people used Google with a license filter. Problem is, many seem to prefer all rights reserved images (including this blog post), or they don't care.

Nemo

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

-- Neil Kandalgaonkar (| neilk@wikimedia.org

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

-- Rayson ================================================== Open Grid Scheduler - The Official Open Source Grid Engine http://gridscheduler.sourceforge.net/

Neil Kandalgaonkar

1:58 p.m.

On 10/13/11 1:35 PM, Rayson Ho wrote:

...

On Thu, Oct 13, 2011 at 3:56 PM, Neil Kandalgaonkarneilk@wikimedia.org wrote:

...

What exactly needs to be done??

1) Figure out some scheme whereby the actual license is available in the database, not merely expressed in human-readable HTML. This is hard, so it's where I gave up. Timo (Krinkle) and Roan Kattouw were working on this for a bit but they were pulled off to do other things. To do this right we'd create a new namespace like License: and then connect that to some database entity. We could then connect those to existing license templates.

2) Once that's done, from a daily cronjob or something, generate Sitemaps (a summary of all our content) compatible with the Google Image Search extended syntax.

http://www.google.com/support/webmasters/bin/answer.py?answer=178636

A less organized dump from my brain here: http://www.mediawiki.org/wiki/User:NeilK/Sitemaps

...

Can't Google just parse the licensing section to decide if the image is under CC or not??

Google *can* do any number of things. However, they will probably not do any custom development work for Commons.

By 2011 standards, Commons is a relatively small image repository. Flickr has billions of images, and it's not even the most popular photo host. Facebook, although inaccessible to Google, adds several billion images to its repository *per week*.

Commons may have some of the "best of the web" images for illustration purposes, so it is a high-value thing for Google to crawl. So yeah, it's enough for them to assign a guy to talk to me every few months or so. But not enough that they will assign developers. They wouldn't have even bothered pinging us if that were the case.

Commons has no real way to communicate licenses to Google. Templates create human-readable HTML, not machine-parseable legal information. If someone edited the CC master template tomorrow to look a bit prettier, anything that was trying to parse licenses from HTML would break.

Google has a standard for us to tell them the license, in the extended Sitemap syntax for images, linked to above. That's what we should do, because it would make that information available to Google, and potentially to any other search engines that can read that standard.

...

If Google can find more of my 400+ images, then they can be used by others more often... that would certainly make me work harder to take more photos& upload more to Commons!

Hell yes!

-- Neil Kandalgaonkar (| neilk@wikimedia.org

Neil Kandalgaonkar

2 p.m.

Sorry, just want to clarify:

It would be easy to get images into Google Image Search.

It would be hard to get correct licenses into Google Image Search, given the current situation. We'd need to do some serious rethinking on our end.

On 10/13/11 1:58 PM, Neil Kandalgaonkar wrote:

...

On 10/13/11 1:35 PM, Rayson Ho wrote:

...
On Thu, Oct 13, 2011 at 3:56 PM, Neil Kandalgaonkarneilk@wikimedia.org wrote:

...
What exactly needs to be done??

Figure out some scheme whereby the actual license is available in the

database, not merely expressed in human-readable HTML. This is hard, so it's where I gave up. Timo (Krinkle) and Roan Kattouw were working on this for a bit but they were pulled off to do other things. To do this right we'd create a new namespace like License: and then connect that to some database entity. We could then connect those to existing license templates.

Once that's done, from a daily cronjob or something, generate

Sitemaps (a summary of all our content) compatible with the Google Image Search extended syntax.

http://www.google.com/support/webmasters/bin/answer.py?answer=178636

A less organized dump from my brain here: http://www.mediawiki.org/wiki/User:NeilK/Sitemaps

...
Can't Google just parse the licensing section to decide if the image is under CC or not??

Google *can* do any number of things. However, they will probably not do any custom development work for Commons.

By 2011 standards, Commons is a relatively small image repository. Flickr has billions of images, and it's not even the most popular photo host. Facebook, although inaccessible to Google, adds several billion images to its repository *per week*.

Commons may have some of the "best of the web" images for illustration purposes, so it is a high-value thing for Google to crawl. So yeah, it's enough for them to assign a guy to talk to me every few months or so. But not enough that they will assign developers. They wouldn't have even bothered pinging us if that were the case.

Commons has no real way to communicate licenses to Google. Templates create human-readable HTML, not machine-parseable legal information. If someone edited the CC master template tomorrow to look a bit prettier, anything that was trying to parse licenses from HTML would break.

Google has a standard for us to tell them the license, in the extended Sitemap syntax for images, linked to above. That's what we should do, because it would make that information available to Google, and potentially to any other search engines that can read that standard.

...
If Google can find more of my 400+ images, then they can be used by others more often... that would certainly make me work harder to take more photos& upload more to Commons!

Hell yes!

-- Neil Kandalgaonkar (| neilk@wikimedia.org

David Gerard

2:02 p.m.

On 13 October 2011 21:58, Neil Kandalgaonkar neilk@wikimedia.org wrote:

...

Commons has no real way to communicate licenses to Google. Templates create human-readable HTML, not machine-parseable legal information. If someone edited the CC master template tomorrow to look a bit prettier, anything that was trying to parse licenses from HTML would break.

Do they read microformats/RDF? Adding those to the templates wouldn't be unfeasible.

- d.

Neil Kandalgaonkar

14 Oct 14 Oct

1:10 a.m.

I didn't ask, but perhaps I could find out. It's an interesting idea, although to some degree it's once again postponing the necessary IMO work of putting licenses in the database. I mean, Commons regards correct licensing as one of the most important activities, and yet licenses aren't a real object in the system. It's very difficult to gather even basic information about how licenses are used on Commons.

Anyway as far as I can tell, microformats are dead. However, HTML5 microdata is on its way.

http://www.w3.org/TR/microdata/

A Google employee wrote that spec, but that's not a guarantee it will actually work with anything, or that Google Image Search has any idea he wrote it. ;)

On 10/13/11 2:02 PM, David Gerard wrote:

...

On 13 October 2011 21:58, Neil Kandalgaonkarneilk@wikimedia.org wrote:

...
Commons has no real way to communicate licenses to Google. Templates create human-readable HTML, not machine-parseable legal information. If someone edited the CC master template tomorrow to look a bit prettier, anything that was trying to parse licenses from HTML would break.

Do they read microformats/RDF? Adding those to the templates wouldn't be unfeasible.

d.

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

-- Neil Kandalgaonkar (| neilk@wikimedia.org

James Forrester

1:19 a.m.

On 14 October 2011 09:10, Neil Kandalgaonkar neilk@wikimedia.org wrote:

...

I didn't ask, but perhaps I could find out. It's an interesting idea, although to some degree it's once again postponing the necessary IMO work of putting licenses in the database. I mean, Commons regards correct licensing as one of the most important activities, and yet licenses aren't a real object in the system. It's very difficult to gather even basic information about how licenses are used on Commons.

Anyway as far as I can tell, microformats are dead. However, HTML5 microdata is on its way.

http://www.w3.org/TR/microdata/

A Google employee wrote that spec, but that's not a guarantee it will actually work with anything, or that Google Image Search has any idea he wrote it. ;)

Eurgh. We don't really want to get into the RDFa/microformats/microdata holy wars... :-)

Frankly, given the movement's mission and commitments, using RDFa seems most sensible. Though I agree with Neil that having licences as a first-level object would be nice-to-have, the templates give us the ability to achieve proper machine-readable licence-tagging right now (well, technically they'd apply to the page not the image, but Google could easily code around that, or in a pinch we could have an extension that extracted the RDFa attributes and applied them to the IMG and A elements).

Yours,

-- James D. Forrester jdforrester@wikimedia.org | jdforrester@gmail.com [[Wikipedia:User:Jdforrester|James F.]]

John Vandenberg

1:21 a.m.

On Fri, Oct 14, 2011 at 7:10 PM, Neil Kandalgaonkar neilk@wikimedia.org wrote:

...

I didn't ask, but perhaps I could find out. It's an interesting idea, although to some degree it's once again postponing the necessary IMO work of putting licenses in the database. I mean, Commons regards correct licensing as one of the most important activities, and yet licenses aren't a real object in the system. It's very difficult to gather even basic information about how licenses are used on Commons.

Anyway as far as I can tell, microformats are dead. However, HTML5 microdata is on its way.

http://www.w3.org/TR/microdata/

A Google employee wrote that spec, but that's not a guarantee it will actually work with anything, or that Google Image Search has any idea he wrote it. ;)

OAI and Dublin Core are not dead.

https://strategy.wikimedia.org/wiki/Proposal:Dublin_Core

;-)

-- John Vandenberg

Ryan Kaldari

10:30 a.m.

On 10/14/11 1:21 AM, John Vandenberg wrote:

...

OAI and Dublin Core are not dead.

We actually provide an OAI-PMH feed of new image uploads already, but without having the licensing and attribution information in the database, it is of limited use.

Ryan Kaldari

Federico Leva (Nemo)

10:46 a.m.

Ryan Kaldari, 14/10/2011 19:30:

...

On 10/14/11 1:21 AM, John Vandenberg wrote:

...
OAI and Dublin Core are not dead.

We actually provide an OAI-PMH feed of new image uploads already,

Anticipating Andrea: where? ;-) Do you mean the password-restricted mysterious special page? :-)

...

but without having the licensing and attribution information in the database, it is of limited use.

Nemo

Ryan Kaldari

11:32 a.m.

On 10/14/11 10:46 AM, Federico Leva (Nemo) wrote:

...

Anticipating Andrea: where? ;-) Do you mean the password-restricted mysterious special page? :-)

Yep, that's it. It has some fairly expensive DB queries, so access is restricted to major search engine vendors as far as I understand. I'm not really the right person to ask about it though.

Ryan Kaldari

Andrea Zanni

15 Oct 15 Oct

11:39 a.m.

2011/10/14 Ryan Kaldari rkaldari@wikimedia.org

...

On 10/14/11 10:46 AM, Federico Leva (Nemo) wrote:

...
Anticipating Andrea: where? ;-) Do you mean the password-restricted mysterious special page? :-)

Yep, that's it. It has some fairly expensive DB queries, so access is restricted to major search engine vendors as far as I understand. I'm not really the right person to ask about it though.

You got me, Nemo. :-)

Ryan, who is the right person to ask? Everyone who is interested in having structured metadata for Commons always stumble upon that mysterious page (at leats, the few who are insterested in OAI-MPH), and I think it would be good to understand more of it once for good. At least, we could know if could be useful or not. As you perfectly know, the issue of not having a reliable/proper system for "metadata communication" (call is as you like) is a perennial issue, and I bet many of us would like to have it on developers priorities (at least, for Commons) :-)

Thanks for the information!

Aubrey

Ryan Kaldari

...

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

Federico Leva (Nemo)

18 Oct 18 Oct

5:56 a.m.

Andrea Zanni, 15/10/2011 20:39:

...

2011/10/14 Ryan Kaldari
On 10/14/11 10:46 AM, Federico Leva (Nemo) wrote:
 > Anticipating Andrea: where? ;-) Do you mean the password-restricted
 > mysterious special page? :-)

Yep, that's it. It has some fairly expensive DB queries, so access is
restricted to major search engine vendors as far as I understand. I'm
not really the right person to ask about it though.
You got me, Nemo. :-) Ryan, who is the right person to ask? Everyone who is interested in having structured metadata for Commons always stumble upon that mysterious page (at leats, the few who are insterested in OAI-MPH), and I think it would be good to understand more of it once for good. At least, we could know if could be useful or not.

The surprising answer seems to be on https://meta.wikimedia.org/wiki/Wikimedia_update_feed_service (found by chance on https://meta.wikimedia.org/wiki/Help:Export !).

Nemo

Paul Houle

14 Oct 14 Oct

5:17 a.m.

On 10/13/2011 5:02 PM, David Gerard wrote:

...

On 13 October 2011 21:58, Neil Kandalgaonkarneilk@wikimedia.org wrote:

...
Commons has no real way to communicate licenses to Google. Templates create human-readable HTML, not machine-parseable legal information. If someone edited the CC master template tomorrow to look a bit prettier, anything that was trying to parse licenses from HTML would break.

I'm going to say that this is B.S. It seems everybody in business thinks it's easy to write GUI applications (where you really spend four months rewriting the requirements again and again and doing testing that never ends) and hard to write screen scrapers. (where you sometimes get it to work in four minutes)

I built a rather complicated system that reads the Wiki markup and extracts a whole bunch of metadata. This system was fairly accurate but eventually reached a plateau of what it could do. It had trouble extracting licenses all the time because templates are wrapped up inside of templates which are wrapped up inside of templates and so on.

The old system often had to deal with contradictory data -- for instance, there's a certain guy who uses {self} templates on photos that came from Flickr. Nobody really noticed that there's a problem here because the HTML markup looks superficially O.K. The issue is that HTML output on Commons is tested every day, and the ability to get semantics out of the inner markup doesn't get tested. "Fifth wheel" features (microformats, etc.) are even more likely to break without being noticed since nobody actually uses them...

Later on I developed a much simpler heuristic: extract all hyperlinks from the HTML and filter for links that point to licenses. For instance,

http://commons.wikimedia.org/wiki/File:2011-03-09-fort-du-lomont-10.jpg

has a link to

http://creativecommons.org/licenses/by/3.0/deed.en

This is as easy to read as any kind of structured metadata could ever be. And it's not a "fifth wheel", it's actually visible in the HTML markup, so if it's wrong people will notice.

Petr Kadlec

5:54 a.m.

On 14 October 2011 14:17, Paul Houle paul@ontology2.com wrote:

...

Later on I developed a much simpler heuristic: extract all hyperlinks from the HTML and filter for links that point to licenses.

You could even do that with the API, e.g.

http://commons.wikimedia.org/w/api.php?action=query&prop=extlinks&ti...

-- [[cs:User:Mormegil | Petr Kadlec]]

Magnus Manske

12:58 a.m.

On Thu, Oct 13, 2011 at 9:58 PM, Neil Kandalgaonkar neilk@wikimedia.org wrote:

...

Google has a standard for us to tell them the license, in the extended Sitemap syntax for images, linked to above. That's what we should do, because it would make that information available to Google, and potentially to any other search engines that can read that standard.

I have created a preliminary sitemap file for Commons on the toolserver.

I use categories to find licenses, currently CC-BY-SA, CC-BY, GFDL, and PD. This can assign 9,355,602 of our 11.3M files at least one license. (There might be multiple entries for the same file in there, though.) It's farm from complete, but a reasonable start IMHO.

For those with toolserver access, the file is here (300MB gzipped): /mnt/user-store/magnus/commons.sitemap.gz

Generation took 38 minutes. Script (hereby under GFDL) is here: /home/magnus/commons_sitemap/make_sitemap.pl (utilizing /home/magnus/sql_quick )

Magnus

Neil Kandalgaonkar

5:25 p.m.

Thanks for doing this Magnus. I am super busy next week, going to two conferences, but I've scheduled some time near the end of October to evaluate this & see if I can get it working in the cluster.

On 10/14/11 12:58 AM, Magnus Manske wrote:

...

On Thu, Oct 13, 2011 at 9:58 PM, Neil Kandalgaonkarneilk@wikimedia.org wrote:

...
Google has a standard for us to tell them the license, in the extended Sitemap syntax for images, linked to above. That's what we should do, because it would make that information available to Google, and potentially to any other search engines that can read that standard.

I have created a preliminary sitemap file for Commons on the toolserver.

I use categories to find licenses, currently CC-BY-SA, CC-BY, GFDL, and PD. This can assign 9,355,602 of our 11.3M files at least one license. (There might be multiple entries for the same file in there, though.) It's farm from complete, but a reasonable start IMHO.

For those with toolserver access, the file is here (300MB gzipped): /mnt/user-store/magnus/commons.sitemap.gz

Generation took 38 minutes. Script (hereby under GFDL) is here: /home/magnus/commons_sitemap/make_sitemap.pl (utilizing /home/magnus/sql_quick )

Magnus

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

-- Neil Kandalgaonkar (| neilk@wikimedia.org

Jean-Frédéric

13 Oct 13 Oct

1:43 p.m.

...

I am regularly pinged by a Google guy who is desperate to get Commons data into Google Image Search in a more systematic way. If any of you think this project is interesting I can totally get you off the ground. It will not be hard.

Google guys inquiring about Commons? That’s cool to know.

But I don’t recall reading anywhere on Commons about Google folks pinging the WMF about Commons indexing. Did I miss it, or it was never announced? If the latter, how volunteer Commons hackers can possibly give a hand if they’re not aware of it?

I totally understand the WMF/devs have other stuff to do/priorities. But I think such info should be communicated to communities. Especially if they can help as « it will not be hard ». No?

-- Jean-Frédéric * (cc-ing Sumana)*

Mathias Schindler

1:53 p.m.

2011/10/13 Jean-Frédéric jeanfrederic.wiki@gmail.com:

...

...
I am regularly pinged by a Google guy who is desperate to get Commons data into Google Image Search in a more systematic way. If any of you think this project is interesting I can totally get you off the ground. It will not be hard.

Google guys inquiring about Commons? That’s cool to know.

Isn't this something a decent Sitemaps XML file on our side can do?

ChrisiPK

3:30 p.m.

I remember, that at some points CC templates had RDF tags. [0] I can't recall why they were removed. Is this something that Google can parse?

[0] https://creativecommons.org/ns

Am 13.10.2011 22:43, schrieb Jean-Frédéric:

...

I am regularly pinged by a Google guy who is desperate to get Commons
data into Google Image Search in a more systematic way. If any of you
think this project is interesting I can totally get you off the ground.
It will not be hard.
Google guys inquiring about Commons? That’s cool to know.

But I don’t recall reading anywhere on Commons about Google folks pinging the WMF about Commons indexing. Did I miss it, or it was never announced? If the latter, how volunteer Commons hackers can possibly give a hand if they’re not aware of it?

I totally understand the WMF/devs have other stuff to do/priorities. But I think such info should be communicated to communities. Especially if they can help as « it will not be hard ». No?

-- Jean-Frédéric / (cc-ing Sumana)/

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

Andrea Zanni

11:14 p.m.

Is this something that could be solved implementing OAI-MPH protocol in Commons?

I don't know how hard it would be, but it seems that exporting metadata (as a License is) is becoming more and more important...

Last time we spoke with a dev about it (Berlin Hackathon), he basically said: "We need to change mediawiki..."

Aubrey

2011/10/14 ChrisiPK chrisipk@gmail.com

...

I remember, that at some points CC templates had RDF tags. [0] I can't recall why they were removed. Is this something that Google can parse?

[0] https://creativecommons.org/ns

Am 13.10.2011 22:43, schrieb Jean-Frédéric:

...
I am regularly pinged by a Google guy who is desperate to get Commons
data into Google Image Search in a more systematic way. If any of you
think this project is interesting I can totally get you off the
ground.

...
It will not be hard.
Google guys inquiring about Commons? That’s cool to know.

But I don’t recall reading anywhere on Commons about Google folks pinging
the

...
WMF about Commons indexing. Did I miss it, or it was never announced? If

the

...
latter, how volunteer Commons hackers can possibly give a hand if they’re

not

...
aware of it?

I totally understand the WMF/devs have other stuff to do/priorities. But

I

...
think such info should be communicated to communities. Especially if they

can

...
help as « it will not be hard ». No?

-- Jean-Frédéric / (cc-ing Sumana)/

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

Neil Kandalgaonkar

14 Oct 14 Oct

1:14 a.m.

On 10/13/11 1:43 PM, Jean-Frédéric wrote:

...

Did I miss it, or it was never announced? If the latter, how volunteer Commons hackers can possibly give a hand if they’re not aware of it?

http://lists.wikimedia.org/pipermail/commons-l/2011-March/005909.html

Anyway, this appears to be moot since Magnus just solved much of the problem.

-- Neil Kandalgaonkar (| neilk@wikimedia.org

4661

Age (days ago)

4666

Last active (days ago)

commons-l@lists.wikimedia.org

23 comments

14 participants

tags (0)

participants (14)

Andrea Zanni
ChrisiPK
David Gerard
Federico Leva (Nemo)
James Forrester
Jean-Frédéric
John Vandenberg
Magnus Manske
Mathias Schindler
Neil Kandalgaonkar
Paul Houle
Petr Kadlec
Rayson Ho
Ryan Kaldari