On Fri, Dec 28, 2012 at 10:24 AM, John Vandenberg jayvdb@gmail.com wrote:
Is favicon only in the Chinese Wikipedia top 100?
It seems so, and is odd if the problem is a web browser bug.
John Vandenberg. sent from Galaxy Note On Dec 28, 2012 4:07 PM, "Johan Gunnarsson" johan.gunnarsson@gmail.com wrote:
On Fri, Dec 28, 2012 at 5:33 AM, John Vandenberg jayvdb@gmail.com wrote:
Hi Johan,
Thank you for the lovely data at
https://toolserver.org/~johang/2012.html
I posted that link to my facebook (below if you want to join in there), and a few language specific facebook groups, and there have been some concerns raised about the results, which I'll list below.
These lists are getting some traction in the press so it would be good to understand it better.
http://guardian.co.uk/technology/blog/2012/dec/27/wikipedia-most-viewed
Cool, cool.
Why is [[zh:Favicon]] #2?
The data doesnt appear to support that
http://stats.grok.se/zh/201201/Favicon http://stats.grok.se/zh/latest90/Favicon
My post-processing filtering follows redirects to find the "true" title. In this case the page Favicon.ico redirects to Favicon. This is probably due to broken browsers trying to load the icon.
Number 1 in French is a plant native to asia. The stats for December
disagree
https://en.wikipedia.org/wiki/Ilex_crenata http://stats.grok.se/fr/201212/Houx_cr%C3%A9nel%C3%A9
French's Ilex_crenata redirects to Houx_crénelé.
Ilex_crenata had huge traffic in April: http://stats.grok.se/fr/201204/Ilex_crenata
There are a bunch of spikes like this. I can't really explain it. I talked to Domas Mituzas (the maintainer of the original dumps I use) yesterday and he suggested it might be bots going crazy for whatever reason. I'd love to filter all these false positives, but haven't been able to come up with an easy way to do it.
Might be possible with access to logs with the user-agent string, but that would probably inflate the dataset size even more. It's already past the terabyte. However that could probably be solved by sampling (for example) 1/100 of the entries.
Comments and ideas are welcome!
Number 1 in German is Cul de sac. This is odd, but matches the stats http://stats.grok.se/de/201207/Sackgasse
RIght. This one is funny. It has huge traffic on weekdays only. Deserted on weekends.
This has been noted on the dewiki village pump before. The most
interesting guess therehttps://de.wikipedia.org/wiki/Wikipedia:Fragen_zur_Wikipedia#Sackgasse_als_Top_Artikel_.3F.21(by Benutzer:YMS): There might be a web filtering software installed on workplace PCs in companies which redirects all prohibited URLs to the German Wikipedia on cul-de-sac. This would explain the weekly pattern, and also http://stats.grok.se/de/201112/Sackgasse (December 25-26 are holidays in Germany, and many employees take the rest of the year off).
Number 1 in Dutch is a Chinese mountain. The stats for December
disagree
July/August agree: http://stats.grok.se/nl/201208/Hua_Shan
Number 4 in Hebrew is zipper. The stats for December disagree http://stats.grok.se/he/201212/%D7%A8%D7%95%D7%9B%D7%A1%D7%9F
April agrees: http://stats.grok.se/he/201204/%D7%A8%D7%95%D7%9B%D7%A1%D7%9F
Number 2 in Spanish is '@'. This is odd, but matches the stats http://stats.grok.se/es/201212/Arroba_%28s%C3%ADmbolo%29
-- John Vandenberg https://www.facebook.com/johnmark.vandenberg
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l