Hi,
I just noticed someone ran a query from 2012 to 2013 as a timeseries by hour. This... creates a *lot* of data. For the cohort they used, it's about 1.8 million pieces of data. Should we cap report sizes somehow? It doesn't pose any immediate dangers other than taking up a lot of resources and computation time, as well as IO time spent logging the results (the log is currently acting as rudimentary backup - perhaps this is ill conceived).
In this case it looks like maybe it was a mistake, so one idea is to warn the user that they are about to generate a lot of data, and to ask them to confirm.
Thoughts?
Dan
Good suggestion from Steven:
No hourly reports over a month long, No daily reports over a year long. Does that seem fair?
Dan
On Sat, Nov 2, 2013 at 12:00 AM, Dan Andreescu dandreescu@wikimedia.orgwrote:
Hi,
I just noticed someone ran a query from 2012 to 2013 as a timeseries by hour. This... creates a *lot* of data. For the cohort they used, it's about 1.8 million pieces of data. Should we cap report sizes somehow? It doesn't pose any immediate dangers other than taking up a lot of resources and computation time, as well as IO time spent logging the results (the log is currently acting as rudimentary backup - perhaps this is ill conceived).
In this case it looks like maybe it was a mistake, so one idea is to warn the user that they are about to generate a lot of data, and to ask them to confirm.
Thoughts?
Dan
so, assuming that user wasn’t me <kidding>…. how about some kind of throttling for non-WMF users?
The limits sound fair anyway, but I see external researchers (and even community members interested in historical data) using this tool to collect very long data series.
Dario
On Nov 1, 2013, at 9:34 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Good suggestion from Steven:
No hourly reports over a month long, No daily reports over a year long. Does that seem fair?
Dan
On Sat, Nov 2, 2013 at 12:00 AM, Dan Andreescu dandreescu@wikimedia.org wrote: Hi,
I just noticed someone ran a query from 2012 to 2013 as a timeseries by hour. This... creates a *lot* of data. For the cohort they used, it's about 1.8 million pieces of data. Should we cap report sizes somehow? It doesn't pose any immediate dangers other than taking up a lot of resources and computation time, as well as IO time spent logging the results (the log is currently acting as rudimentary backup - perhaps this is ill conceived).
In this case it looks like maybe it was a mistake, so one idea is to warn the user that they are about to generate a lot of data, and to ask them to confirm.
Thoughts?
Dan
Wikimetrics mailing list Wikimetrics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimetrics
On Fri, Nov 1, 2013 at 11:02 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
The limits sound fair anyway, but I see external researchers (and even community members interested in historical data) using this tool to collect very long data series.
I think that use case is out of scope for Wikimetrics. It's getting dangerously close to using Wikimetrics as a general data platform or service, rather than sticking to getting human-readable results for standardized metrics. It's okay to go back months or years in time, but not simultaneously at a level of detail not interpretable except with further heavy processing of the result.
that’s correct, the original plan was to build an API.
On Nov 1, 2013, at 11:08 PM, Steven Walling swalling@wikimedia.org wrote:
On Fri, Nov 1, 2013 at 11:02 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote: The limits sound fair anyway, but I see external researchers (and even community members interested in historical data) using this tool to collect very long data series.
I think that use case is out of scope for Wikimetrics. It's getting dangerously close to using Wikimetrics as a general data platform or service, rather than sticking to getting human-readable results for standardized metrics. It's okay to go back months or years in time, but not simultaneously at a level of detail not interpretable except with further heavy processing of the result.
-- Steven Walling, Product Manager https://wikimediafoundation.org/ _______________________________________________ Wikimetrics mailing list Wikimetrics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimetrics
and that’s why we need throttling anyway
On Nov 1, 2013, at 11:17 PM, Dario Taraborelli dario@wikimedia.org wrote:
that’s correct, the original plan was to build an API.
On Nov 1, 2013, at 11:08 PM, Steven Walling swalling@wikimedia.org wrote:
On Fri, Nov 1, 2013 at 11:02 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote: The limits sound fair anyway, but I see external researchers (and even community members interested in historical data) using this tool to collect very long data series.
I think that use case is out of scope for Wikimetrics. It's getting dangerously close to using Wikimetrics as a general data platform or service, rather than sticking to getting human-readable results for standardized metrics. It's okay to go back months or years in time, but not simultaneously at a level of detail not interpretable except with further heavy processing of the result.
-- Steven Walling, Product Manager https://wikimediafoundation.org/ _______________________________________________ Wikimetrics mailing list Wikimetrics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimetrics
Well, Dario, it was actually someone at WMF. But I don't think that should matter much. Let's do this as a compromise:
If someone runs an hourly report longer than a month and a daily report longer than a year, we give them a warning telling them what's going to happen. If they say OK, we have to assume they know what they're doing and they really need the data.
I know I accidentally ran a really long query once, so we'd at least guard against that. Like I said though, even that crazy long query last night didn't cause any huge problems. It just used up a bit of memory and slowed access to the wikimetrics server for a few hours. There are a couple of simple monitoring, tracing, and backup improvements I could make in order to alleviate that as well. So if it keeps happening despite the warning, I'll just do that.
Dan
On Sat, Nov 2, 2013 at 2:18 AM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
and that’s why we need throttling anyway
On Nov 1, 2013, at 11:17 PM, Dario Taraborelli dario@wikimedia.org wrote:
that’s correct, the original plan was to build an API.
On Nov 1, 2013, at 11:08 PM, Steven Walling swalling@wikimedia.org wrote:
On Fri, Nov 1, 2013 at 11:02 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
The limits sound fair anyway, but I see external researchers (and even community members interested in historical data) using this tool to collect very long data series.
I think that use case is out of scope for Wikimetrics. It's getting dangerously close to using Wikimetrics as a general data platform or service, rather than sticking to getting human-readable results for standardized metrics. It's okay to go back months or years in time, but not simultaneously at a level of detail not interpretable except with further heavy processing of the result.
-- Steven Walling, Product Manager https://wikimediafoundation.org/ _______________________________________________ Wikimetrics mailing list Wikimetrics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimetrics
Wikimetrics mailing list Wikimetrics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimetrics
Dan,
I think the warning is important and would be useful for prevention of this type of query as a mistake. I have seen this almost happen, and with the rate at which Sarah and our interns have been pulling data I know I have heard them wince some at choosing the wrong command at times. Anyway, I support your idea to institute a warning.
Thanks,
Jaime
wikimetrics@lists.wikimedia.org