On Tue, Jul 23, 2019 at 1:23 PM Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
Will this be approachable: My 2 hour query will actually finally return results into my 1gig csv.zip file?
Not sure about 2 hours, as again it'd be a service that would be open to a wide community, and time is the most limited resource of all - once 2-hour query is running, that means the resource to serve it is consumed for 2 hours and not available to anybody else. Even with batching, we only have 24 hours per day which we'd be able to run only 12 such queries (well, parallelism exists, but let's not complicate it too much for the sake of example) and then the 13th person would have to wait the whole day for their query to be even run. Without some limit you'd have to book it months in advance like a posh restaurant :) Of course, it's a consideration of resources available and demand for such queries, so we'd have to see what the precise limit is when we get there. Maybe there are no 13 people to run such queries and we'd be ok.
Was thinking the same thing. I wouldn't form a 2 hour query, just saying. In actuality, I'd spend the day or two to download the data dump.
Also, with live updates, long queries create other technical challenges (if query is running for 2 hours, the database has basically to keep the snapshot it runs on for 2 hours, which may make it much less efficient). We could of course have non-live-updates database, but updating it then would be a bit tricky as loading full dump takes a week now and catching up for that week takes even more time (hello, Achilles, hello, Tortoise). We're working on improving those, but for now 2 hour queries may be poorly compatible with both resources we have and the model we have. Shorter queries though may definitely be possible - we'd need to find the boundary that is safe given the current resources.
Yeap, agreed. Its a balancing act, even we do that in the enterprise, where even extremely large companies still have budgets. But the CEO and his reports come first, yah? :)
Thanks for the explanations Stas to confirm my assumptions there. Let's continue to focus on the 80% of common user queries and save the 20% like in my special cases to point users to the data dumps and say "roll your own kid, and have fun while doing it!"