Hi I was goofing around with the Wikipedia page counts dumps and noticed some strange anomalies.
For example: The page "Double-entry_bookkeeping_system" had 55921 page views on pagecounts-20150306-070000.gz Where it only had 54 views on pagecounts-20150306-100000.gz (3 hours later).
Is there a bug in the page counting system? How likely is it to have a sharp peak of interest in Double-entry_bookkeeping_system?
Best regards Roni Wiener, Keotic
It's more likely that it's just an attack by automata, rather than a sharp peak of genuine interest. Since 20150306 is within the last 30 days I can look and check, and will do so now.
On 8 March 2015 at 15:18, Roni Wiener roni.wiener@keotic.com wrote:
Hi
I was goofing around with the Wikipedia page counts dumps and noticed some strange anomalies.
For example:
The page “Double-entry_bookkeeping_system” had 55921 page views on pagecounts-20150306-070000.gz
Where it only had 54 views on pagecounts-20150306-100000.gz (3 hours later).
Is there a bug in the page counting system? How likely is it to have a sharp peak of interest in Double-entry_bookkeeping_system?
Best regards
Roni Wiener, Keotic
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Well, the raw Double-entry_bookkeeping_system only has 14k views in that hour, so I have to assume that (55k-14k) views are coming from some oddly localised URI. Not sanitising input is...one of the many things we should fix.
But, I would warn you that this is likely automata. Some things I have seen that would explain it:
1. Live mirrors. Spammers (largely operating off Wordpress instances) steal our content because it looks like legit tests and fools particularly stupid spam/SEO filters. They normally do this through live mirroring, so we get all the random hits from people who click through on their emails. 2. Automata. There is not, to my knowledge, any automata filtering performed on the pageviews data at the moment. I had hoped it would be my next priority after the pageviews definition itself, and I hope whoever is tasked with picking up improving our pageviews infrastructure works on it. Analytics can do very simple things to make this better; time will tell whether making things better on this front is actually a priority.
On 9 March 2015 at 11:59, Oliver Keyes okeyes@wikimedia.org wrote:
It's more likely that it's just an attack by automata, rather than a sharp peak of genuine interest. Since 20150306 is within the last 30 days I can look and check, and will do so now.
On 8 March 2015 at 15:18, Roni Wiener roni.wiener@keotic.com wrote:
Hi
I was goofing around with the Wikipedia page counts dumps and noticed some strange anomalies.
For example:
The page “Double-entry_bookkeeping_system” had 55921 page views on pagecounts-20150306-070000.gz
Where it only had 54 views on pagecounts-20150306-100000.gz (3 hours later).
Is there a bug in the page counting system? How likely is it to have a sharp peak of interest in Double-entry_bookkeeping_system?
Best regards
Roni Wiener, Keotic
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation