Unbreaking statistics

List overview All Threads
Download

newer

older

Bugzilla Weekly Report

Proyect Doogg - New Funtion for...

Peter Gervai

6 Jun 2009 6 Jun '09

2:07 a.m.

Hello,

I see I've created quite a stir around, but so far nothing really useful popped up. :-(

But I see that one from Neil:

...

Yes, modifying the http://stats.grok.se/ systems looks like the way to go.

For me it doesn't really seem to be, since it seems to be using an extremely dumbed down version of input, which only contains page views and [unreliable] byte counters. Most probably it would require large rewrites, and a magical new data source.

...

What do people actually want to see from the traffic data? Do they want referrers, anonymized user trails, or what?

Are you old enough to remember stats.wikipedia.org? As far as I remember originally it ran webalizer, then something else, then nothing. If you check a webalizer stat you'll see what's in it. We are using, or we used until our nice fellow editors broke it, awstats, which basically provides the same with more caching.

Most used and useful stats are page views (daily and hourly stats are pretty useful too), referrers, visitor domain and provider stats, os and browser stats, screen resolution stats, bot activity stats, visitor duration and depth, among probably others.

At a brief glance I could replicate the grok.se stats easily since it seems to work out of http://dammit.lt/wikistats/, but it's completely useless for anything beyond page hit count.

Is there a possibility to write a code which process raw squid data? Who do I have to bribe? :-/

-- byte-byte, grin

Show replies by date

Tim Starling

6 Jun 6 Jun

3:38 a.m.

Peter Gervai wrote:

...

Is there a possibility to write a code which process raw squid data? Who do I have to bribe? :-/

Yes it's possible. You just need to write a script that accepts a log stream on stdin and builds the aggregate data from it. If you want access to IP addresses, it needs to run on our own servers with only anonymised data being passed on to the public.

http://wikitech.wikimedia.org/view/Squid_logging http://wikitech.wikimedia.org/view/Squid_log_format

-- Tim Starling

Robert Rohde

4:13 a.m.

On Fri, Jun 5, 2009 at 6:38 PM, Tim Starlingtstarling@wikimedia.org wrote:

...

Peter Gervai wrote:

...
Is there a possibility to write a code which process raw squid data? Who do I have to bribe? :-/

Yes it's possible. You just need to write a script that accepts a log stream on stdin and builds the aggregate data from it. If you want access to IP addresses, it needs to run on our own servers with only anonymised data being passed on to the public.

http://wikitech.wikimedia.org/view/Squid_logging http://wikitech.wikimedia.org/view/Squid_log_format

How much of that is really considered private? IP addresses obviously, anything else?

I'm wondering if a cheap and dirty solution (at least for the low traffic wikis) might be to write a script that simply scrubs the private information and makes the rest available for whatever applications people might want.

-Robert Rohde

Gregory Maxwell

6:20 a.m.

On Fri, Jun 5, 2009 at 10:13 PM, Robert Rohderarohde@gmail.com wrote:

...

On Fri, Jun 5, 2009 at 6:38 PM, Tim Starlingtstarling@wikimedia.org wrote:

...
Peter Gervai wrote:

...
Is there a possibility to write a code which process raw squid data? Who do I have to bribe? :-/

Yes it's possible. You just need to write a script that accepts a log stream on stdin and builds the aggregate data from it. If you want access to IP addresses, it needs to run on our own servers with only anonymised data being passed on to the public.

http://wikitech.wikimedia.org/view/Squid_logging http://wikitech.wikimedia.org/view/Squid_log_format

How much of that is really considered private? IP addresses obviously, anything else?

I'm wondering if a cheap and dirty solution (at least for the low traffic wikis) might be to write a script that simply scrubs the private information and makes the rest available for whatever applications people might want.

There is a lot of private data in user agents ("MSIE 4.123; WINNT 4.0; bouncing_ferret_toolbar_1.23 drunken_monkey_downloader_2.34" may be uniquely identifying). There is even private data titles if you don't sanitize carefully (/wiki/search?lookup=From%20rarohde%20To%20Gmaxwell%20OMG%20secret%20stuff%20lemme%20accidently%20paste%20it%20into%20the%20search%20box). There is private data in referrers (http://rarohde.com/url_that_only_rarohde_would_have_comefrom).

Things which individually do not appear to disclose anything private can disclose private things (look at the people uniquely identified by AOL's 'anonymized' search data).

On the flip side, aggregation can take private things (i.e. useragents; IP info; referrers) and convert it to non-private data: Top user agents; top referrers; highest traffic ASNs... but becomes potentially revealing if not done carefully: The 'top' network and user agent info for a single obscure article in a short time window may be information from only one or two users, not really an aggregation.

Things like common paths through the site should be safe so long as they are not provided with too much temporal resolution, limit themselves to existing articles, and limit themselves to either really common paths or breaking paths into two or three node chains and skip releasing the least common of those.

Generally when dealing with private data you must approach it with the same attitude that a C coder must take to avoid buffer overflows. Treat all data as hostile, assume all actions are potentially dangerous. Try to figure out how to break it, and think deviously.

Robert Rohde

7:39 a.m.

On Fri, Jun 5, 2009 at 9:20 PM, Gregory Maxwellgmaxwell@gmail.com wrote:

...

On Fri, Jun 5, 2009 at 10:13 PM, Robert Rohderarohde@gmail.com wrote: There is a lot of private data in user agents ("MSIE 4.123; WINNT 4.0; bouncing_ferret_toolbar_1.23 drunken_monkey_downloader_2.34" may be uniquely identifying). There is even private data titles if you don't sanitize carefully (/wiki/search?lookup=From%20rarohde%20To%20Gmaxwell%20OMG%20secret%20stuff%20lemme%20accidently%20paste%20it%20into%20the%20search%20box). There is private data in referrers (http://rarohde.com/url_that_only_rarohde_would_have_comefrom).

Things which individually do not appear to disclose anything private can disclose private things (look at the people uniquely identified by AOL's 'anonymized' search data).

On the flip side, aggregation can take private things (i.e. useragents; IP info; referrers) and convert it to non-private data: Top user agents; top referrers; highest traffic ASNs... but becomes potentially revealing if not done carefully: The 'top' network and user agent info for a single obscure article in a short time window may be information from only one or two users, not really an aggregation.

Things like common paths through the site should be safe so long as they are not provided with too much temporal resolution, limit themselves to existing articles, and limit themselves to either really common paths or breaking paths into two or three node chains and skip releasing the least common of those.

Generally when dealing with private data you must approach it with the same attitude that a C coder must take to avoid buffer overflows. Treat all data as hostile, assume all actions are potentially dangerous. Try to figure out how to break it, and think deviously.

On reflection I agree with you, though I think the biggest problem would actually be a case you didn't mention. If one provided timing and page view information, then one can almost certainly single out individual users by correlating the view timing with edit histories.

Okay, so no stripped logs. The next question becomes what is the right way to aggregate. We can A) reinvent the wheel, or B) adapt a pre-existing log analyzer in a mode to produce clean aggregate data. While I respect the work of Zachte and others, this might be a case where B is a better near-term solution.

Looking at http://stats.wikipedia.hu/cgi-bin/awstats.pl (the page that started this mess), his AWStats config already suppresses IP info and aggregates everything into groups that make it very hard to identify anything personal from. (There is still a small risk with allowing users to drill down to pages / requests that are almost never made, but perhaps that could be turned off.) AWStats has native support for Squid logs and is open source.

This is not necessarily the only option, but I suspect that if we gave it some thought it would be possible to find an off-the-shelf tool that would be good enough to support many wikis and configurable enough to satisfy even the GMaxwell's of the world ;-). huwiki is actually the 20th largest wiki (by number of edits), so if it worked for them, then a tool like AWStats can probably work for most of the projects (which are not EN).

-Robert Rohde

John at Darkstar

7 Jun 7 Jun

9:09 a.m.

Some articles are always very seldom referred and those can be used to uniquely identify a machine. Then there are all those who do something that goes into public logs. The later are very difficult to obfuscate, but the first one is possible to solve by setting a time frame long enough that sufficient alternate traffic will be within the same window. Unfortunately this time frame is pretty long for some articles, and from some tests it seems to be weeks on Norsk (bokmål) Wikipedia. John

Robert Rohde skrev:

...

On Fri, Jun 5, 2009 at 9:20 PM, Gregory Maxwellgmaxwell@gmail.com wrote:

...
On Fri, Jun 5, 2009 at 10:13 PM, Robert Rohderarohde@gmail.com wrote: There is a lot of private data in user agents ("MSIE 4.123; WINNT 4.0; bouncing_ferret_toolbar_1.23 drunken_monkey_downloader_2.34" may be uniquely identifying). There is even private data titles if you don't sanitize carefully (/wiki/search?lookup=From%20rarohde%20To%20Gmaxwell%20OMG%20secret%20stuff%20lemme%20accidently%20paste%20it%20into%20the%20search%20box). There is private data in referrers (http://rarohde.com/url_that_only_rarohde_would_have_comefrom).

Things which individually do not appear to disclose anything private can disclose private things (look at the people uniquely identified by AOL's 'anonymized' search data).

On the flip side, aggregation can take private things (i.e. useragents; IP info; referrers) and convert it to non-private data: Top user agents; top referrers; highest traffic ASNs... but becomes potentially revealing if not done carefully: The 'top' network and user agent info for a single obscure article in a short time window may be information from only one or two users, not really an aggregation.

Things like common paths through the site should be safe so long as they are not provided with too much temporal resolution, limit themselves to existing articles, and limit themselves to either really common paths or breaking paths into two or three node chains and skip releasing the least common of those.

Generally when dealing with private data you must approach it with the same attitude that a C coder must take to avoid buffer overflows. Treat all data as hostile, assume all actions are potentially dangerous. Try to figure out how to break it, and think deviously.

On reflection I agree with you, though I think the biggest problem would actually be a case you didn't mention. If one provided timing and page view information, then one can almost certainly single out individual users by correlating the view timing with edit histories.

Okay, so no stripped logs. The next question becomes what is the right way to aggregate. We can A) reinvent the wheel, or B) adapt a pre-existing log analyzer in a mode to produce clean aggregate data. While I respect the work of Zachte and others, this might be a case where B is a better near-term solution.

Looking at http://stats.wikipedia.hu/cgi-bin/awstats.pl (the page that started this mess), his AWStats config already suppresses IP info and aggregates everything into groups that make it very hard to identify anything personal from. (There is still a small risk with allowing users to drill down to pages / requests that are almost never made, but perhaps that could be turned off.) AWStats has native support for Squid logs and is open source.

This is not necessarily the only option, but I suspect that if we gave it some thought it would be possible to find an off-the-shelf tool that would be good enough to support many wikis and configurable enough to satisfy even the GMaxwell's of the world ;-). huwiki is actually the 20th largest wiki (by number of edits), so if it worked for them, then a tool like AWStats can probably work for most of the projects (which are not EN).

-Robert Rohde

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Brian

6 Jun 6 Jun

7:38 a.m.

Scrubbing log files to make the data private is hard work. You'd be impressed by what researchers have been able to do - taking purportedly anonymous data and using it to identify users en masse by correlating it with publicly available data from other sites such as Amazon, Facebook and Netflix. Make no doubt - if you don't do it carefully you will be the target of, in the best of cases, an academic researcher who wants to prove that you don't understand statistics.

On Fri, Jun 5, 2009 at 8:13 PM, Robert Rohde rarohde@gmail.com wrote:

...

On Fri, Jun 5, 2009 at 6:38 PM, Tim Starlingtstarling@wikimedia.org wrote:

...
Peter Gervai wrote:

...
Is there a possibility to write a code which process raw squid data? Who do I have to bribe? :-/

Yes it's possible. You just need to write a script that accepts a log stream on stdin and builds the aggregate data from it. If you want access to IP addresses, it needs to run on our own servers with only anonymised data being passed on to the public.

http://wikitech.wikimedia.org/view/Squid_logging http://wikitech.wikimedia.org/view/Squid_log_format

How much of that is really considered private? IP addresses obviously, anything else?

I'm wondering if a cheap and dirty solution (at least for the low traffic wikis) might be to write a script that simply scrubs the private information and makes the rest available for whatever applications people might want.

-Robert Rohde

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

John at Darkstar

7 Jun 7 Jun

8:58 a.m.

If someone wants to work on this I have some ideas to make something usefull out of this log, but I'm a bit short on time. Basically its two ideas that are really usefull; one is to figure out which articles are most interesting to show in a portal and the other is how to detect articles with missing linking between them. John

Tim Starling skrev:

...

Peter Gervai wrote:

...
Is there a possibility to write a code which process raw squid data? Who do I have to bribe? :-/

Yes it's possible. You just need to write a script that accepts a log stream on stdin and builds the aggregate data from it. If you want access to IP addresses, it needs to run on our own servers with only anonymised data being passed on to the public.

http://wikitech.wikimedia.org/view/Squid_logging http://wikitech.wikimedia.org/view/Squid_log_format

-- Tim Starling

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Platonides

6:39 p.m.

John at Darkstar wrote:

...

If someone wants to work on this I have some ideas to make something usefull out of this log, but I'm a bit short on time. Basically its two ideas that are really usefull; one is to figure out which articles are most interesting to show in a portal and the other is how to detect articles with missing linking between them. John

How are you planning to detect articles which 'should have links' between them?

John at Darkstar

10:26 p.m.

I tried to convince myself to stay out of this thread, but this was somewhat interesting. ;)

I'm not quite sure this will work out for every case, but my gross idea is like this:

Imagine an user trying to get an answare about some kind of problem. He searches with Google and dumps into the most obvious article about (for example) breweries, even if he really want to know something about a specific beer (Groelch or Mac or whatever). He can't find anything about it so he makes an additional search (at the page hopefully), gets a result list, reads through a lot of articles and then finally finds what he searches for. Then he leaves.

Now, imagine a function that pushes new visited pages on a small page list and a function popping that list each time the search result page is visited. The page list is stored in a cookie. This small page list is then reported to a special logging server by a AJAX request. It can't just piggyback as the final page usually will not lead to a new request, - the user simply leaves.

Later a lot of such page lists an be analyzed and compared to the known link structure. If a pair of pages consistently emerges in the log without having a parent - child relation then you know a link is missing.

Some guestimates says that you need more than 100 page views before something like this can detect obvious missing links. For Norwegian (bokmål) Wikipedia that is about 2-3 months of statistics for half the article base, but note that the accumulated stats would be rectified by the page redirect information from the database as a link very seldom are dropped it is usually added.

Well, something like that. I was wondering about running a test case but given some previous discussion I concluded that I would get a go on this.

It is also possible to analyze the article relations where the user goes back to Googles result list but that is somewhat more evolved.

John

Platonides skrev:

...

John at Darkstar wrote:

...
If someone wants to work on this I have some ideas to make something usefull out of this log, but I'm a bit short on time. Basically its two ideas that are really usefull; one is to figure out which articles are most interesting to show in a portal and the other is how to detect articles with missing linking between them. John

How are you planning to detect articles which 'should have links' between them?

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

John at Darkstar

11:23 p.m.

Had to run and missed a couple of important items. One is that you can calculate the likelihood that a link is missing. (Its similar to Googles page rank) If the likelihood turns out to be to small you simply don't report anything. You also can skip reporting if you don't have any intervening search or search result. You can also analyze the link structure at the log server and skip logging of uninteresting items, but even more interesting, this can be done client side if sufficient information is embedded on the pages. For this, note that a high likelihood of a missing link implies few existing inbound links and then they can simply be embedded on the page itself.

Analyzing a dumb log would be very costly indeed.

John

John at Darkstar skrev:

...

I tried to convince myself to stay out of this thread, but this was somewhat interesting. ;)

I'm not quite sure this will work out for every case, but my gross idea is like this:

Imagine an user trying to get an answare about some kind of problem. He searches with Google and dumps into the most obvious article about (for example) breweries, even if he really want to know something about a specific beer (Groelch or Mac or whatever). He can't find anything about it so he makes an additional search (at the page hopefully), gets a result list, reads through a lot of articles and then finally finds what he searches for. Then he leaves.

Now, imagine a function that pushes new visited pages on a small page list and a function popping that list each time the search result page is visited. The page list is stored in a cookie. This small page list is then reported to a special logging server by a AJAX request. It can't just piggyback as the final page usually will not lead to a new request,

the user simply leaves.

Later a lot of such page lists an be analyzed and compared to the known link structure. If a pair of pages consistently emerges in the log without having a parent - child relation then you know a link is missing.

Some guestimates says that you need more than 100 page views before something like this can detect obvious missing links. For Norwegian (bokmål) Wikipedia that is about 2-3 months of statistics for half the article base, but note that the accumulated stats would be rectified by the page redirect information from the database as a link very seldom are dropped it is usually added.

Well, something like that. I was wondering about running a test case but given some previous discussion I concluded that I would get a go on this.

It is also possible to analyze the article relations where the user goes back to Googles result list but that is somewhat more evolved.

John

Platonides skrev:

...
John at Darkstar wrote:

...
If someone wants to work on this I have some ideas to make something usefull out of this log, but I'm a bit short on time. Basically its two ideas that are really usefull; one is to figure out which articles are most interesting to show in a portal and the other is how to detect articles with missing linking between them. John

How are you planning to detect articles which 'should have links' between them?

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Alex

6 Jun 6 Jun

4:04 a.m.

Peter Gervai wrote:

...

Hello,

I see I've created quite a stir around, but so far nothing really useful popped up. :-(

But I see that one from Neil:

...
Yes, modifying the http://stats.grok.se/ systems looks like the way to go.

For me it doesn't really seem to be, since it seems to be using an extremely dumbed down version of input, which only contains page views and [unreliable] byte counters. Most probably it would require large rewrites, and a magical new data source.

...
What do people actually want to see from the traffic data? Do they want referrers, anonymized user trails, or what?

Are you old enough to remember stats.wikipedia.org? As far as I remember originally it ran webalizer, then something else, then nothing. If you check a webalizer stat you'll see what's in it. We are using, or we used until our nice fellow editors broke it, awstats, which basically provides the same with more caching.

Most used and useful stats are page views (daily and hourly stats are pretty useful too), referrers, visitor domain and provider stats, os and browser stats, screen resolution stats, bot activity stats, visitor duration and depth, among probably others.

At a brief glance I could replicate the grok.se stats easily since it seems to work out of http://dammit.lt/wikistats/, but it's completely useless for anything beyond page hit count.

Is there a possibility to write a code which process raw squid data? Who do I have to bribe? :-/

We do have http://stats.wikimedia.org/ which includes things like http://stats.wikimedia.org/EN/VisitorsSampledLogOrigins.htm

-- Alex (wikipedia:en:User:Mr.Z-man)

Platonides

7 Jun 7 Jun

2:24 a.m.

New subject: Hotlinking (was Re: Unbreaking statistics)

Alex wrote:

...

We do have http://stats.wikimedia.org/ which includes things like http://stats.wikimedia.org/EN/VisitorsSampledLogOrigins.htm

I see on that list pretty high the site www.musicistheheartofoursoul.com Looking at the page, they include many images from wikimedia servers, hotlinking them and without link to the image page.

Moreover, they aren't even free images but Fair Use ones uploaded on enwiki.

Shouldn't we politely ask them to make a local copy?

Brian

3:11 a.m.

New subject: Hotlinking (was Re: Unbreaking statistics)

Two things 1. There isn't a good way to get an image dump

2. Allowing hotlinking seems to fit nicely within the WMF mission.

On Sat, Jun 6, 2009 at 6:24 PM, Platonides Platonides@gmail.com wrote:

...

Alex wrote:

...
We do have http://stats.wikimedia.org/ which includes things like http://stats.wikimedia.org/EN/VisitorsSampledLogOrigins.htm

I see on that list pretty high the site www.musicistheheartofoursoul.com Looking at the page, they include many images from wikimedia servers, hotlinking them and without link to the image page.

Moreover, they aren't even free images but Fair Use ones uploaded on enwiki.

Shouldn't we politely ask them to make a local copy?

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

David Gerard

9:11 a.m.

New subject: Hotlinking (was Re: Unbreaking statistics)

2009/6/7 Brian Brian.Mingus@colorado.edu:

...

Two things

There isn't a good way to get an image dump

Allowing hotlinking seems to fit nicely within the WMF mission.

Hotlinking isn't generally allowed, but using Commons as a remote repository on your own MediaWiki is.

- d.

Brian

9:55 a.m.

New subject: Hotlinking (was Re: Unbreaking statistics)

What do you mean it's not allowed - it works. There is only one way to disallow it!

On Sun, Jun 7, 2009 at 1:11 AM, David Gerard dgerard@gmail.com wrote:

...

2009/6/7 Brian Brian.Mingus@colorado.edu:

...
Two things

There isn't a good way to get an image dump

Allowing hotlinking seems to fit nicely within the WMF mission.

Hotlinking isn't generally allowed, but using Commons as a remote repository on your own MediaWiki is.

d.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Aryeh Gregor

9:43 p.m.

New subject: Hotlinking (was Re: Unbreaking statistics)

On Sun, Jun 7, 2009 at 3:11 AM, David Gerarddgerard@gmail.com wrote:

...

Hotlinking isn't generally allowed

Who says? Mirroring all pages isn't allowed, but I've never heard that we have a problem with hotlinking images, as long as the bandwidth use is reasonable.

On Sun, Jun 7, 2009 at 6:00 AM, John at Darkstarvacuum@jeb.no wrote:

...

The person hotlinking can have a resonable fair use claim, but the site serving a fair use image for someone else would most likely be in serious trouble defending its position.

It's not technically possible for us to stop people from hotlinking images without possible adverse side effects on our own users. What specific anti-hotlinking technique do you propose? Just talking to specific sites that rate high enough on the charts?

...

If you have some reasoning that this is not the case it would be interesting, as the fair use images from english Wikipedia can be moved to Commons if this is correct.

That doesn't make sense. Fair use images are prohibited on Commons because they're not in line with Commons' mission. What relevance does that have to hotlinking?

John at Darkstar

9:12 a.m.

New subject: Hotlinking (was Re: Unbreaking statistics)

Hotlinking fair use images are something that should not be possible. John

Brian skrev:

...

Two things

There isn't a good way to get an image dump

Allowing hotlinking seems to fit nicely within the WMF mission.

On Sat, Jun 6, 2009 at 6:24 PM, Platonides Platonides@gmail.com wrote:

...
Alex wrote:

...
We do have http://stats.wikimedia.org/ which includes things like http://stats.wikimedia.org/EN/VisitorsSampledLogOrigins.htm

I see on that list pretty high the site www.musicistheheartofoursoul.com Looking at the page, they include many images from wikimedia servers, hotlinking them and without link to the image page.

Moreover, they aren't even free images but Fair Use ones uploaded on enwiki.

Shouldn't we politely ask them to make a local copy?

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Brian

9:56 a.m.

New subject: Hotlinking (was Re: Unbreaking statistics)

What on earth are you talking about?

On Sun, Jun 7, 2009 at 1:12 AM, John at Darkstar vacuum@jeb.no wrote:

...

Hotlinking fair use images are something that should not be possible. John

Brian skrev:

...
Two things

There isn't a good way to get an image dump

Allowing hotlinking seems to fit nicely within the WMF mission.

On Sat, Jun 6, 2009 at 6:24 PM, Platonides Platonides@gmail.com wrote:

...
Alex wrote:

...
We do have http://stats.wikimedia.org/ which includes things like http://stats.wikimedia.org/EN/VisitorsSampledLogOrigins.htm

I see on that list pretty high the site

www.musicistheheartofoursoul.com

...
...
Looking at the page, they include many images from wikimedia servers, hotlinking them and without link to the image page.

Moreover, they aren't even free images but Fair Use ones uploaded on enwiki.

Shouldn't we politely ask them to make a local copy?

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

John at Darkstar

10:37 a.m.

New subject: Hotlinking (was Re: Unbreaking statistics)

Platonides comment on fair use images, you say in point 2 that hotlinking is something that fits nicely withing the wmf mission, I say it should not be possible to hotlink fair use images.

How would you argue that serving _fair_use_images_ for someone else is within wmf mission? How would you argue that such use of the images does not violates the copyright owners rights to the images?

John

Brian skrev:

...

What on earth are you talking about?

On Sun, Jun 7, 2009 at 1:12 AM, John at Darkstar vacuum@jeb.no wrote:

...
Hotlinking fair use images are something that should not be possible. John

Brian skrev:

...
Two things

There isn't a good way to get an image dump

Allowing hotlinking seems to fit nicely within the WMF mission.

On Sat, Jun 6, 2009 at 6:24 PM, Platonides Platonides@gmail.com wrote:

...
Alex wrote:

...
We do have http://stats.wikimedia.org/ which includes things like http://stats.wikimedia.org/EN/VisitorsSampledLogOrigins.htm

I see on that list pretty high the site

www.musicistheheartofoursoul.com

...
...
Looking at the page, they include many images from wikimedia servers, hotlinking them and without link to the image page.

Moreover, they aren't even free images but Fair Use ones uploaded on enwiki.

Shouldn't we politely ask them to make a local copy?

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Robert Rohde

11:46 a.m.

New subject: Hotlinking (was Re: Unbreaking statistics)

On Sun, Jun 7, 2009 at 1:37 AM, John at Darkstarvacuum@jeb.no wrote:

...

Platonides comment on fair use images, you say in point 2 that hotlinking is something that fits nicely withing the wmf mission, I say it should not be possible to hotlink fair use images.

How would you argue that serving _fair_use_images_ for someone else is within wmf mission? How would you argue that such use of the images does not violates the copyright owners rights to the images?

At the risk of stating the obvious, the person hotlinking the image could also have an entirely reasonable fair use claim.

-Robert Rohde

John at Darkstar

noon

New subject: Hotlinking (was Re: Unbreaking statistics)

The person hotlinking can have a resonable fair use claim, but the site serving a fair use image for someone else would most likely be in serious trouble defending its position. If you have some reasoning that this is not the case it would be interesting, as the fair use images from english Wikipedia can be moved to Commons if this is correct.

Robert Rohde skrev:

...

On Sun, Jun 7, 2009 at 1:37 AM, John at Darkstarvacuum@jeb.no wrote:

...
Platonides comment on fair use images, you say in point 2 that hotlinking is something that fits nicely withing the wmf mission, I say it should not be possible to hotlink fair use images.

How would you argue that serving _fair_use_images_ for someone else is within wmf mission? How would you argue that such use of the images does not violates the copyright owners rights to the images?

At the risk of stating the obvious, the person hotlinking the image could also have an entirely reasonable fair use claim.

-Robert Rohde

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Gerard Meijssen

5:05 p.m.

New subject: Hotlinking (was Re: Unbreaking statistics)

Hoi, Is this discussion about policy relevant to this mailing list ? Thanks, GerardM

2009/6/7 John at Darkstar vacuum@jeb.no

...

The person hotlinking can have a resonable fair use claim, but the site serving a fair use image for someone else would most likely be in serious trouble defending its position. If you have some reasoning that this is not the case it would be interesting, as the fair use images from english Wikipedia can be moved to Commons if this is correct.

Robert Rohde skrev:

...
On Sun, Jun 7, 2009 at 1:37 AM, John at Darkstarvacuum@jeb.no wrote:

...
Platonides comment on fair use images, you say in point 2 that hotlinking is something that fits nicely withing the wmf mission, I say it should not be possible to hotlink fair use images.

How would you argue that serving _fair_use_images_ for someone else is within wmf mission? How would you argue that such use of the images does not violates the copyright owners rights to the images?

At the risk of stating the obvious, the person hotlinking the image could also have an entirely reasonable fair use claim.

-Robert Rohde

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

David Gerard

5:32 p.m.

New subject: Hotlinking (was Re: Unbreaking statistics)

2009/6/7 Gerard Meijssen gerard.meijssen@gmail.com:

...

Is this discussion about policy relevant to this mailing list ?

Somewhat:.

If we officially don't like hotlinking, is it reasonable to disable hotlinking from Wikimedia sites? If so, can it be done without breaking remote file repo use of Commons?

- d.

Brian

5:42 p.m.

New subject: Hotlinking (was Re: Unbreaking statistics)

I propose that we include a link back to the image description page in the EXIF data so that the images are automatically CC-BY-SA attributed when hotlinked.

This whole thread is silly to me - someone is worried about their statistics getting messed up and have thrown our mission out the window.

On Sun, Jun 7, 2009 at 9:32 AM, David Gerard dgerard@gmail.com wrote:

...

2009/6/7 Gerard Meijssen gerard.meijssen@gmail.com:

...
Is this discussion about policy relevant to this mailing list ?

Somewhat:.

If we officially don't like hotlinking, is it reasonable to disable hotlinking from Wikimedia sites? If so, can it be done without breaking remote file repo use of Commons?

d.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Brian

5:52 p.m.

New subject: Hotlinking (was Re: Unbreaking statistics)

A mission which does specifically does not include: * Policing all use of our content

On Sun, Jun 7, 2009 at 9:42 AM, Brian Brian.Mingus@colorado.edu wrote:

...

I propose that we include a link back to the image description page in the EXIF data so that the images are automatically CC-BY-SA attributed when hotlinked.

This whole thread is silly to me - someone is worried about their statistics getting messed up and have thrown our mission out the window.

On Sun, Jun 7, 2009 at 9:32 AM, David Gerard dgerard@gmail.com wrote:

...
2009/6/7 Gerard Meijssen gerard.meijssen@gmail.com:

...
Is this discussion about policy relevant to this mailing list ?

Somewhat:.

If we officially don't like hotlinking, is it reasonable to disable hotlinking from Wikimedia sites? If so, can it be done without breaking remote file repo use of Commons?

d.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

5645

Age (days ago)

5646

Last active (days ago)

wikitech-l@lists.wikimedia.org

25 comments

11 participants

tags (0)

participants (11)

Alex
Aryeh Gregor
Brian
David Gerard
Gerard Meijssen
Gregory Maxwell
John at Darkstar
Peter Gervai
Platonides
Robert Rohde
Tim Starling