Wikipedia Top 25

List overview All Threads
Download

newer

older

Wikipedia editor racial...

Re: [Analytics] [Wikitech-l]...

Noneof MicrosoftsBusiness

11 Jul 2013 11 Jul '13

6:54 a.m.

We've been working on tracking down the top 25 articles for each week, but as you can see http://en.wikipedia.org/wiki/Wikipedia:5000 it requires determining which rankings are due to actual human views and which are due to bots, and recently, the bots have been having a field day. I've been asked by the creator of the list to ask you for help and/or advice on how to use analytics to separate human from non-human views. Please let me know if there's anything that can be done. Thanks

Attachments:

attachment.htm (text/html — 877 bytes)

Show replies by date

Jeremy Baron

11 Jul 11 Jul

7:09 a.m.

On Wed, Jul 10, 2013 at 9:54 PM, Noneof MicrosoftsBusiness phonenumberofthebeast@hotmail.com wrote:

...

We've been working on tracking down the top 25 articles for each week, but as you can see

http://en.wikipedia.org/wiki/Wikipedia:5000

it requires determining which rankings are due to actual human views and which are due to bots, and recently, the bots have been having a field day. I've been asked by the creator of the list to ask you for help and/or advice on how to use analytics to separate human from non-human views. Please let me know if there's anything that can be done.

I think at this point that would either require a change to the format of the domas (anonymized) stats or an NDA and maybe some other approvals. (or kraken! but rumor is that's not yet ready for the general public)

-Jeremy

Toby Negrin

1:54 p.m.

We have some bot information from wikistats herehttp://stats.wikimedia.org/#bots. I don't think it's particularly actionable for what you are doing, but it might be interesting directionally.

-Toby

On Wed, Jul 10, 2013 at 3:09 PM, Jeremy Baron jeremy@tuxmachine.com wrote:

...

On Wed, Jul 10, 2013 at 9:54 PM, Noneof MicrosoftsBusiness phonenumberofthebeast@hotmail.com wrote:

...
We've been working on tracking down the top 25 articles for each week,

but

...
as you can see

http://en.wikipedia.org/wiki/Wikipedia:5000

it requires determining which rankings are due to actual human views and which are due to bots, and recently, the bots have been having a field

day.

...
I've been asked by the creator of the list to ask you for help and/or

advice

...
on how to use analytics to separate human from non-human views. Please

let

...
me know if there's anything that can be done.

I think at this point that would either require a change to the format of the domas (anonymized) stats or an NDA and maybe some other approvals. (or kraken! but rumor is that's not yet ready for the general public)

-Jeremy

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Jane Darnell

4:15 p.m.

Hmm, This one really has me stumped: http://stats.grok.se/en/latest90/Yahoo! That is not a wikibump, but some sort of structural thing. The only thing I can think of is that some sort of popular band, manga character, or porn queen in China has been named Yahoo! Jane

2013/7/11, Toby Negrin tnegrin@wikimedia.org:

...

We have some bot information from wikistats herehttp://stats.wikimedia.org/#bots. I don't think it's particularly actionable for what you are doing, but it might be interesting directionally.

-Toby

On Wed, Jul 10, 2013 at 3:09 PM, Jeremy Baron jeremy@tuxmachine.com wrote:

...
On Wed, Jul 10, 2013 at 9:54 PM, Noneof MicrosoftsBusiness phonenumberofthebeast@hotmail.com wrote:

...
We've been working on tracking down the top 25 articles for each week,

but

...
as you can see

http://en.wikipedia.org/wiki/Wikipedia:5000

it requires determining which rankings are due to actual human views and which are due to bots, and recently, the bots have been having a field

day.

...
I've been asked by the creator of the list to ask you for help and/or

advice

...
on how to use analytics to separate human from non-human views. Please

let

...
me know if there's anything that can be done.

I think at this point that would either require a change to the format of the domas (anonymized) stats or an NDA and maybe some other approvals. (or kraken! but rumor is that's not yet ready for the general public)

-Jeremy

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Federico Leva (Nemo)

5:37 p.m.

Jane Darnell, 11/07/2013 09:15:

...

Hmm, This one really has me stumped: http://stats.grok.se/en/latest90/Yahoo! That is not a wikibump, but some sort of structural thing. The only thing I can think of is that some sort of popular band, manga character, or porn queen in China has been named Yahoo!

Or someone (e.g. Yahoo!) has linked it from some prominent webpage (but only in English? other languages seem not affected) or some stocks holder (e.g. Yahoo!) is running simple "crwalers" to skew pageviews stats and make them appear flat so that nobody can make stocks value forecasts using them.

Nemo

Jörn Hees

7:08 p.m.

Hi,

On 11.07.2013, at 10:37, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...

Jane Darnell, 11/07/2013 09:15:

...
Hmm, This one really has me stumped: http://stats.grok.se/en/latest90/Yahoo! That is not a wikibump, but some sort of structural thing. The only thing I can think of is that some sort of popular band, manga character, or porn queen in China has been named Yahoo!

Or someone (e.g. Yahoo!) has linked it from some prominent webpage (but only in English? other languages seem not affected) or some stocks holder (e.g. Yahoo!) is running simple "crwalers" to skew pageviews stats and make them appear flat so that nobody can make stocks value forecasts using them.

Yupp, i also think this is a weird anomaly… the causes can be very weird though, as we found out back in 2010-11 when the views to the "initial" page were suddenly very skewed: http://infodisiac.com/blog/2010/11/page-views-anomaly-in-october-resolved/#c... Domas was able to find the cause by sampling some of the requests and found that all had the same referrer. Turned out it was an online ads page that had an error in their html which tried to load the page as background image.

It's hard to analyze / clean this stuff while maintaining privacy. I remember there was a survey sent out to several (linked) open data researchers a while ago how the Wikimedia foundation could provide better stats. My reply was something along these lines: Provide more stats with every line in the hourly pageview stats: - how many different IP addresses cause the accesses (better: how many accesses per IP address (avg + stddev)) - how many different referrers cause the accesses (better: how many accesses per referrer (avg + stddev)) - how many accesses come from wikimedia IPs (toolserver, some bots) - ip address count for top 5 (or 10) originating countries (get some geolocation in)

I think the top 3 aren't really computationally expensive but would really improve our ability to clean the view stats. The 4th always was on my wishlist, but would require some more work for reverse ip->geolocation lookup.

Cheers, Jörn

Jane Darnell

7:33 p.m.

You're right, it would be extremely helpful to know "how many different IP addresses" cause the accesses. The last three things, though definitely desirable, are less important. For reporting you could just filter by some ratio of unique IP's vs page views (i.e. only include in your top25 report when at least half of the page views are from unique IPs).

2013/7/11, Jörn Hees wikistats@joernhees.de:

...

Hi,

On 11.07.2013, at 10:37, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...
Jane Darnell, 11/07/2013 09:15:

...
Hmm, This one really has me stumped: http://stats.grok.se/en/latest90/Yahoo! That is not a wikibump, but some sort of structural thing. The only thing I can think of is that some sort of popular band, manga character, or porn queen in China has been named Yahoo!

Or someone (e.g. Yahoo!) has linked it from some prominent webpage (but only in English? other languages seem not affected) or some stocks holder (e.g. Yahoo!) is running simple "crwalers" to skew pageviews stats and make them appear flat so that nobody can make stocks value forecasts using them.

Yupp, i also think this is a weird anomaly… the causes can be very weird though, as we found out back in 2010-11 when the views to the "initial" page were suddenly very skewed: http://infodisiac.com/blog/2010/11/page-views-anomaly-in-october-resolved/#c... Domas was able to find the cause by sampling some of the requests and found that all had the same referrer. Turned out it was an online ads page that had an error in their html which tried to load the page as background image.

It's hard to analyze / clean this stuff while maintaining privacy. I remember there was a survey sent out to several (linked) open data researchers a while ago how the Wikimedia foundation could provide better stats. My reply was something along these lines: Provide more stats with every line in the hourly pageview stats:

how many different IP addresses cause the accesses (better: how many

accesses per IP address (avg + stddev))

how many different referrers cause the accesses (better: how many accesses

per referrer (avg + stddev))

how many accesses come from wikimedia IPs (toolserver, some bots)

ip address count for top 5 (or 10) originating countries (get some

geolocation in)

I think the top 3 aren't really computationally expensive but would really improve our ability to clean the view stats. The 4th always was on my wishlist, but would require some more work for reverse ip->geolocation lookup.

Cheers, Jörn

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Noneof MicrosoftsBusiness

7:43 p.m.

Is there a way to do this? Forgive me, I'm not exactly computer-illiterate, but this under-the-hood stuff is not something I'm familiar with.

...

Date: Thu, 11 Jul 2013 12:33:28 +0200 From: jane023@gmail.com To: analytics@lists.wikimedia.org Subject: Re: [Analytics] Wikipedia Top 25

You're right, it would be extremely helpful to know "how many different IP addresses" cause the accesses. The last three things, though definitely desirable, are less important. For reporting you could just filter by some ratio of unique IP's vs page views (i.e. only include in your top25 report when at least half of the page views are from unique IPs).

2013/7/11, Jörn Hees wikistats@joernhees.de:

...
Hi,

On 11.07.2013, at 10:37, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...
Jane Darnell, 11/07/2013 09:15:

...
Hmm, This one really has me stumped: http://stats.grok.se/en/latest90/Yahoo! That is not a wikibump, but some sort of structural thing. The only thing I can think of is that some sort of popular band, manga character, or porn queen in China has been named Yahoo!

Or someone (e.g. Yahoo!) has linked it from some prominent webpage (but only in English? other languages seem not affected) or some stocks holder (e.g. Yahoo!) is running simple "crwalers" to skew pageviews stats and make them appear flat so that nobody can make stocks value forecasts using them.

Yupp, i also think this is a weird anomaly… the causes can be very weird though, as we found out back in 2010-11 when the views to the "initial" page were suddenly very skewed: http://infodisiac.com/blog/2010/11/page-views-anomaly-in-october-resolved/#c... Domas was able to find the cause by sampling some of the requests and found that all had the same referrer. Turned out it was an online ads page that had an error in their html which tried to load the page as background image.

It's hard to analyze / clean this stuff while maintaining privacy. I remember there was a survey sent out to several (linked) open data researchers a while ago how the Wikimedia foundation could provide better stats. My reply was something along these lines: Provide more stats with every line in the hourly pageview stats:

how many different IP addresses cause the accesses (better: how many

accesses per IP address (avg + stddev))

how many different referrers cause the accesses (better: how many accesses

per referrer (avg + stddev))

how many accesses come from wikimedia IPs (toolserver, some bots)

ip address count for top 5 (or 10) originating countries (get some

geolocation in)

I think the top 3 aren't really computationally expensive but would really improve our ability to clean the view stats. The 4th always was on my wishlist, but would require some more work for reverse ip->geolocation lookup.

Cheers, Jörn

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Jane Darnell

8:09 p.m.

Unfortunately, no. See Jörn's mail where he says he has requested this information along with page views, but hasn't got it yet (probably because of Wikipedia's privacy policy). If "Domas was able to find the cause by sampling some of the requests" then that pretty much means that Domas couldn't get the info any other way, and if Domas can't, then I don't think anyone else can either. You can always try mailing Erik Zachte (infodisiac stats website) for his opinion though.

2013/7/11, Noneof MicrosoftsBusiness phonenumberofthebeast@hotmail.com:

...

Is there a way to do this? Forgive me, I'm not exactly computer-illiterate, but this under-the-hood stuff is not something I'm familiar with.

...
Date: Thu, 11 Jul 2013 12:33:28 +0200 From: jane023@gmail.com To: analytics@lists.wikimedia.org Subject: Re: [Analytics] Wikipedia Top 25

You're right, it would be extremely helpful to know "how many different IP addresses" cause the accesses. The last three things, though definitely desirable, are less important. For reporting you could just filter by some ratio of unique IP's vs page views (i.e. only include in your top25 report when at least half of the page views are from unique IPs).

2013/7/11, Jörn Hees wikistats@joernhees.de:

...
Hi,

On 11.07.2013, at 10:37, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...
Jane Darnell, 11/07/2013 09:15:

...
Hmm, This one really has me stumped: http://stats.grok.se/en/latest90/Yahoo! That is not a wikibump, but some sort of structural thing. The only thing I can think of is that some sort of popular band, manga character, or porn queen in China has been named Yahoo!

Or someone (e.g. Yahoo!) has linked it from some prominent webpage (but only in English? other languages seem not affected) or some stocks holder (e.g. Yahoo!) is running simple "crwalers" to skew pageviews stats and make them appear flat so that nobody can make stocks value forecasts using them.

Yupp, i also think this is a weird anomaly… the causes can be very weird though, as we found out back in 2010-11 when the views to the "initial" page were suddenly very skewed: http://infodisiac.com/blog/2010/11/page-views-anomaly-in-october-resolved/#c... Domas was able to find the cause by sampling some of the requests and found that all had the same referrer. Turned out it was an online ads page that had an error in their html which tried to load the page as background image.

It's hard to analyze / clean this stuff while maintaining privacy. I remember there was a survey sent out to several (linked) open data researchers a while ago how the Wikimedia foundation could provide better stats. My reply was something along these lines: Provide more stats with every line in the hourly pageview stats:

how many different IP addresses cause the accesses (better: how many

accesses per IP address (avg + stddev))

how many different referrers cause the accesses (better: how many

accesses per referrer (avg + stddev))

how many accesses come from wikimedia IPs (toolserver, some bots)

ip address count for top 5 (or 10) originating countries (get some

geolocation in)

I think the top 3 aren't really computationally expensive but would really improve our ability to clean the view stats. The 4th always was on my wishlist, but would require some more work for reverse ip->geolocation lookup.

Cheers, Jörn

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Matthew Flaschen

20 Jul 20 Jul

9:39 a.m.

On 07/11/2013 06:08 AM, Jörn Hees wrote:

...

The 4th always was on my wishlist, but would require some more work for reverse ip->geolocation lookup.

I don't know the details, but we have a service for this (it's used for https://bits.wikimedia.org/geoiplookup)..

Matt Flaschen

Erik Zachte

11 Jul 11 Jul

11:26 p.m.

Jeremy,

Some background:

So we are talking about search engine crawlers here, right?

Here are most active crawlers:

http://stats.wikimedia.org/wikimedia/squids/SquidReportCrawlers.htm

for Google there is special page with more depth:

http://stats.wikimedia.org/wikimedia/squids/SquidReportGoogle.htm

It's been a long standing request to filter crawler data from page views.

We almost did it a year ago, and planned to have two sets of counts in Domas' files (one with crawlers included, one without)..

I'm not sure what came in the way. Diederik can tell you more about that, and current status.

It would cut our page views by about 20%.

The test we planned to implement is pretty simple: test 'user agent' field for 'crawler/spider/bot/http' reject if any occurs.

Note user agent string is completely unregulated, but an informal rule is to include url only on crawler requests.

BTW crawlers 'bots' not to be confused with Mediawiki bots:

http://stats.wikimedia.org/EN/BotActivityMatrixEdits.htm

http://stats.wikimedia.org/EN/BotActivityMatrixCreates.htm

Erik Zachte

From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Toby Negrin Sent: Thursday, July 11, 2013 6:55 AM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Wikipedia Top 25

We have some bot information from wikistats here http://stats.wikimedia.org/#bots . I don't think it's particularly actionable for what you are doing, but it might be interesting directionally.

-Toby

On Wed, Jul 10, 2013 at 3:09 PM, Jeremy Baron jeremy@tuxmachine.com wrote:

On Wed, Jul 10, 2013 at 9:54 PM, Noneof MicrosoftsBusiness phonenumberofthebeast@hotmail.com wrote:

...

We've been working on tracking down the top 25 articles for each week, but as you can see

http://en.wikipedia.org/wiki/Wikipedia:5000

it requires determining which rankings are due to actual human views and which are due to bots, and recently, the bots have been having a field

day.

...

I've been asked by the creator of the list to ask you for help and/or

advice

...

on how to use analytics to separate human from non-human views. Please let me know if there's anything that can be done.

-Jeremy

_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Michael Wilkes

18 Jul 18 Jul

5:53 p.m.

Erik,

When trying to identify non-human traffic, it can also be helpful to exclude any session that begins (or contains) a request for /robots.txt. Do you know if a mechanism exists that could flag such sessions?

Even though many crawlers and spiders do not respect the contents of the file, they almost always request it. This can help exclude non-human page requests where the user agent string has been set to something that does not contain a bot/crawler url.

--Michael

On Thu, Jul 11, 2013 at 4:26 PM, Erik Zachte ezachte@wikimedia.org wrote:

...

Jeremy,****

Some background:****

So we are talking about search engine crawlers here, right?****

Here are most active crawlers:****

http://stats.wikimedia.org/wikimedia/squids/SquidReportCrawlers.htm****

for Google there is special page with more depth:****

http://stats.wikimedia.org/wikimedia/squids/SquidReportGoogle.htm****

It's been a long standing request to filter crawler data from page views.

We almost did it a year ago, and planned to have two sets of counts in Domas' files (one with crawlers included, one without).. ****

I'm not sure what came in the way. Diederik can tell you more about that, and current status.****

It would cut our page views by about 20%. ****

The test we planned to implement is pretty simple: test 'user agent' field for 'crawler/spider/bot/http' reject if any occurs.****

Note user agent string is completely unregulated, but an informal rule is to include url only on crawler requests. ****

BTW crawlers 'bots' not to be confused with Mediawiki bots:****

http://stats.wikimedia.org/EN/BotActivityMatrixEdits.htm****

http://stats.wikimedia.org/EN/BotActivityMatrixCreates.htm****

Erik Zachte****

*From:* analytics-bounces@lists.wikimedia.org [mailto: analytics-bounces@lists.wikimedia.org] *On Behalf Of *Toby Negrin *Sent:* Thursday, July 11, 2013 6:55 AM *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.

*Subject:* Re: [Analytics] Wikipedia Top 25****

We have some bot information from wikistats herehttp://stats.wikimedia.org/#bots. I don't think it's particularly actionable for what you are doing, but it might be interesting directionally.****

-Toby****

On Wed, Jul 10, 2013 at 3:09 PM, Jeremy Baron jeremy@tuxmachine.com wrote:****

On Wed, Jul 10, 2013 at 9:54 PM, Noneof MicrosoftsBusiness phonenumberofthebeast@hotmail.com wrote:

...
We've been working on tracking down the top 25 articles for each week,

but

...
as you can see

http://en.wikipedia.org/wiki/Wikipedia:5000

it requires determining which rankings are due to actual human views and which are due to bots, and recently, the bots have been having a field

day.

...
I've been asked by the creator of the list to ask you for help and/or

advice

...
on how to use analytics to separate human from non-human views. Please

let

...
me know if there's anything that can be done.****

I think at this point that would either require a change to the format of the domas (anonymized) stats or an NDA and maybe some other approvals. (or kraken! but rumor is that's not yet ready for the general public)

-Jeremy

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics****

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Michael Wilkes mobile +31 6 39629706 skype handle eclectiqus www.eclectiq.com

4179

Age (days ago)

4189

Last active (days ago)

analytics@lists.wikimedia.org

11 comments

9 participants

tags (0)

participants (9)

Erik Zachte
Federico Leva (Nemo)
Jane Darnell
Jeremy Baron
Jörn Hees
Matthew Flaschen
Michael Wilkes
Noneof MicrosoftsBusiness
Toby Negrin