Hey all,
After several weeks of work to switch all the scripts over and backfill, all the Discovery dashboards now have the ability to filter crawlers and automated software out from graphs where that is relevant. You should notice a simple checkbox on, for example, the Zero Results Rate data or Wikidata Query Service traffic.
While a bit of backfilling is still waiting on the servers syncing up, this work is essentially complete, and provides another way to look at data on how people are using search (and who those people are). It was a heck of a lot of work, by both myself and Mikhail, but it's hopefully valuable :).
For Discovery Analytics,
This is awesome. Roughly, by eye, it looks like automata are about 2% of ZRR overall and 5% of ZRR for fulltext search, which was around 15% before the holidays (and lower over the holidays—during The Time of Unreliable User Behavior).
Is there a write up for this project? I know it had to be a ton of work, and I'm curious about the details (possibly more so than most).
Do you think you got most of them? Or was the result high-precision but not exhaustive?
Thanks for working on this!
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Mon, Jan 4, 2016 at 1:29 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey all,
After several weeks of work to switch all the scripts over and backfill, all the Discovery dashboards now have the ability to filter crawlers and automated software out from graphs where that is relevant. You should notice a simple checkbox on, for example, the Zero Results Rate data or Wikidata Query Service traffic.
While a bit of backfilling is still waiting on the servers syncing up, this work is essentially complete, and provides another way to look at data on how people are using search (and who those people are). It was a heck of a lot of work, by both myself and Mikhail, but it's hopefully valuable :).
For Discovery Analytics,
-- Oliver Keyes Count Logula Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
(Links: the dashboards live at http://discovery.wmflabs.org/ and an example of automata filtering can be seen at http://discovery.wmflabs.org/metrics/#failure_rate !)
That is, 2% and 5% lower? You're looking at percentages so where the lines vary between checkbox options it'll be different proportions. Unless there's a graph I'm missing :D
On 4 January 2016 at 13:45, Trey Jones tjones@wikimedia.org wrote:
This is awesome. Roughly, by eye, it looks like automata are about 2% of ZRR overall and 5% of ZRR for fulltext search, which was around 15% before the holidays (and lower over the holidays—during The Time of Unreliable User Behavior).
Is there a write up for this project? I know it had to be a ton of work, and I'm curious about the details (possibly more so than most).
Do you think you got most of them? Or was the result high-precision but not exhaustive?
Thanks for working on this!
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Mon, Jan 4, 2016 at 1:29 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey all,
After several weeks of work to switch all the scripts over and backfill, all the Discovery dashboards now have the ability to filter crawlers and automated software out from graphs where that is relevant. You should notice a simple checkbox on, for example, the Zero Results Rate data or Wikidata Query Service traffic.
While a bit of backfilling is still waiting on the servers syncing up, this work is essentially complete, and provides another way to look at data on how people are using search (and who those people are). It was a heck of a lot of work, by both myself and Mikhail, but it's hopefully valuable :).
For Discovery Analytics,
-- Oliver Keyes Count Logula Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
That is, 2% and 5% lower?
Pretty much. I was thinking that the ZRR goes down by about 2% when you exclude automata, so, more precisely, zero-result automata queries make up 2% of all queries.
Anyway, cool stuff.
On Mon, Jan 4, 2016 at 2:53 PM, Oliver Keyes okeyes@wikimedia.org wrote:
(Links: the dashboards live at http://discovery.wmflabs.org/ and an example of automata filtering can be seen at http://discovery.wmflabs.org/metrics/#failure_rate !)
That is, 2% and 5% lower? You're looking at percentages so where the lines vary between checkbox options it'll be different proportions. Unless there's a graph I'm missing :D
On 4 January 2016 at 13:45, Trey Jones tjones@wikimedia.org wrote:
This is awesome. Roughly, by eye, it looks like automata are about 2% of
ZRR
overall and 5% of ZRR for fulltext search, which was around 15% before
the
holidays (and lower over the holidays—during The Time of Unreliable User Behavior).
Is there a write up for this project? I know it had to be a ton of work,
and
I'm curious about the details (possibly more so than most).
Do you think you got most of them? Or was the result high-precision but
not
exhaustive?
Thanks for working on this!
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Mon, Jan 4, 2016 at 1:29 PM, Oliver Keyes okeyes@wikimedia.org
wrote:
Hey all,
After several weeks of work to switch all the scripts over and backfill, all the Discovery dashboards now have the ability to filter crawlers and automated software out from graphs where that is relevant. You should notice a simple checkbox on, for example, the Zero Results Rate data or Wikidata Query Service traffic.
While a bit of backfilling is still waiting on the servers syncing up, this work is essentially complete, and provides another way to look at data on how people are using search (and who those people are). It was a heck of a lot of work, by both myself and Mikhail, but it's hopefully valuable :).
For Discovery Analytics,
-- Oliver Keyes Count Logula Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Oliver Keyes Count Logula Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
What is with issue that we have a weekly cycle (exactly?) where there is a 4% difference in the success in half a week, EVERY WEEK!
With the number of searches done on the site, that seems like an aberration that a each Sunday is a more accurate search day!?! Analytical gremlins of data capture, or not even bots work Sundays?
On Tue, 5 Jan 2016 06:54 Oliver Keyes okeyes@wikimedia.org wrote:
(Links: the dashboards live at http://discovery.wmflabs.org/ and an example of automata filtering can be seen at http://discovery.wmflabs.org/metrics/#failure_rate !)
That is, 2% and 5% lower? You're looking at percentages so where the lines vary between checkbox options it'll be different proportions. Unless there's a graph I'm missing :D
On 4 January 2016 at 13:45, Trey Jones tjones@wikimedia.org wrote:
This is awesome. Roughly, by eye, it looks like automata are about 2% of
ZRR
overall and 5% of ZRR for fulltext search, which was around 15% before
the
holidays (and lower over the holidays—during The Time of Unreliable User Behavior).
Is there a write up for this project? I know it had to be a ton of work,
and
I'm curious about the details (possibly more so than most).
Do you think you got most of them? Or was the result high-precision but
not
exhaustive?
Thanks for working on this!
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Mon, Jan 4, 2016 at 1:29 PM, Oliver Keyes okeyes@wikimedia.org
wrote:
Hey all,
After several weeks of work to switch all the scripts over and backfill, all the Discovery dashboards now have the ability to filter crawlers and automated software out from graphs where that is relevant. You should notice a simple checkbox on, for example, the Zero Results Rate data or Wikidata Query Service traffic.
While a bit of backfilling is still waiting on the servers syncing up, this work is essentially complete, and provides another way to look at data on how people are using search (and who those people are). It was a heck of a lot of work, by both myself and Mikhail, but it's hopefully valuable :).
For Discovery Analytics,
-- Oliver Keyes Count Logula Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Oliver Keyes Count Logula Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
The other way around, I think; only bots work Sundays. We know a lot of the search queries that don't work /shouldn't/ work: they're producing no results because they're nonsense, or spam, or someone being silly through the API. Normal human traffic rises on a Monday to peak on a Tuesday, and begins to drop down again towards the end of the week and weekend. What this means is that the proportion of traffic coming from non-humans is greater on the weekends (because fewer people are browsing) and that increases the impact of automata on the zero results rate for those days.
On 4 January 2016 at 23:28, billinghurst billinghurstwiki@gmail.com wrote:
What is with issue that we have a weekly cycle (exactly?) where there is a 4% difference in the success in half a week, EVERY WEEK!
With the number of searches done on the site, that seems like an aberration that a each Sunday is a more accurate search day!?! Analytical gremlins of data capture, or not even bots work Sundays?
On Tue, 5 Jan 2016 06:54 Oliver Keyes okeyes@wikimedia.org wrote:
(Links: the dashboards live at http://discovery.wmflabs.org/ and an example of automata filtering can be seen at http://discovery.wmflabs.org/metrics/#failure_rate !)
That is, 2% and 5% lower? You're looking at percentages so where the lines vary between checkbox options it'll be different proportions. Unless there's a graph I'm missing :D
On 4 January 2016 at 13:45, Trey Jones tjones@wikimedia.org wrote:
This is awesome. Roughly, by eye, it looks like automata are about 2% of ZRR overall and 5% of ZRR for fulltext search, which was around 15% before the holidays (and lower over the holidays—during The Time of Unreliable User Behavior).
Is there a write up for this project? I know it had to be a ton of work, and I'm curious about the details (possibly more so than most).
Do you think you got most of them? Or was the result high-precision but not exhaustive?
Thanks for working on this!
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Mon, Jan 4, 2016 at 1:29 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey all,
After several weeks of work to switch all the scripts over and backfill, all the Discovery dashboards now have the ability to filter crawlers and automated software out from graphs where that is relevant. You should notice a simple checkbox on, for example, the Zero Results Rate data or Wikidata Query Service traffic.
While a bit of backfilling is still waiting on the servers syncing up, this work is essentially complete, and provides another way to look at data on how people are using search (and who those people are). It was a heck of a lot of work, by both myself and Mikhail, but it's hopefully valuable :).
For Discovery Analytics,
-- Oliver Keyes Count Logula Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Oliver Keyes Count Logula Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
As Billinghurst was kind enough to point out, I got that the wrong way round. Serves me right for replying to emails at 6am!
The weekly effect is definitely seasonal, which supports the idea that it's not artificial; as to what causes it, either (1) humans are fallible and "regular" automata is less-so (irregular automata, who knows?) or (2) as Nemo suggests, people are searching for different things, which would be fascinating to analyse if we could :/
On 5 January 2016 at 06:04, Oliver Keyes okeyes@wikimedia.org wrote:
The other way around, I think; only bots work Sundays. We know a lot of the search queries that don't work /shouldn't/ work: they're producing no results because they're nonsense, or spam, or someone being silly through the API. Normal human traffic rises on a Monday to peak on a Tuesday, and begins to drop down again towards the end of the week and weekend. What this means is that the proportion of traffic coming from non-humans is greater on the weekends (because fewer people are browsing) and that increases the impact of automata on the zero results rate for those days.
On 4 January 2016 at 23:28, billinghurst billinghurstwiki@gmail.com wrote:
What is with issue that we have a weekly cycle (exactly?) where there is a 4% difference in the success in half a week, EVERY WEEK!
With the number of searches done on the site, that seems like an aberration that a each Sunday is a more accurate search day!?! Analytical gremlins of data capture, or not even bots work Sundays?
On Tue, 5 Jan 2016 06:54 Oliver Keyes okeyes@wikimedia.org wrote:
(Links: the dashboards live at http://discovery.wmflabs.org/ and an example of automata filtering can be seen at http://discovery.wmflabs.org/metrics/#failure_rate !)
That is, 2% and 5% lower? You're looking at percentages so where the lines vary between checkbox options it'll be different proportions. Unless there's a graph I'm missing :D
On 4 January 2016 at 13:45, Trey Jones tjones@wikimedia.org wrote:
This is awesome. Roughly, by eye, it looks like automata are about 2% of ZRR overall and 5% of ZRR for fulltext search, which was around 15% before the holidays (and lower over the holidays—during The Time of Unreliable User Behavior).
Is there a write up for this project? I know it had to be a ton of work, and I'm curious about the details (possibly more so than most).
Do you think you got most of them? Or was the result high-precision but not exhaustive?
Thanks for working on this!
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Mon, Jan 4, 2016 at 1:29 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey all,
After several weeks of work to switch all the scripts over and backfill, all the Discovery dashboards now have the ability to filter crawlers and automated software out from graphs where that is relevant. You should notice a simple checkbox on, for example, the Zero Results Rate data or Wikidata Query Service traffic.
While a bit of backfilling is still waiting on the servers syncing up, this work is essentially complete, and provides another way to look at data on how people are using search (and who those people are). It was a heck of a lot of work, by both myself and Mikhail, but it's hopefully valuable :).
For Discovery Analytics,
-- Oliver Keyes Count Logula Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Oliver Keyes Count Logula Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Oliver Keyes Count Logula Wikimedia Foundation
billinghurst, 05/01/2016 05:28:
With the number of searches done on the site, that seems like an aberration that a each Sunday is a more accurate search day!?!
It might be a legit seasonality too, many interpretations are possible. Maybe on sunday people make more predictable and boring searches, such as TV series and blockbusters which always give a search result. :)
Nemo