Amir, in case you hadn't seen it, your memory is correct. This was considered in the past. See https://phabricator.wikimedia.org/T215858#6631859.
On Tue, Nov 17, 2020 at 2:47 PM Amir Sarabadani ladsgroup@gmail.com wrote:
Hello, Actually Jaime's email gave me an idea. Why not having a separate actual data lake? Like a hadoop cluster, it can even take the data from analytics cluster (after being sanitized of course). I remember there were some discussions about having a hadoop or Presto cluster in WM Cloud.
Has this been considered?
Thanks.
On Tue, Nov 17, 2020 at 8:05 PM Brooke Storm bstorm@wikimedia.org wrote:
ACN: Thanks! We’ve created a ticket for that one to help collaborate and surface the process here: https://phabricator.wikimedia.org/T267992 Anybody working on that, please add info there.
Brooke Storm Staff SRE Wikimedia Cloud Services bstorm@wikimedia.org IRC: bstorm
On Nov 17, 2020, at 12:01 PM, AntiCompositeNumber < anticompositenumber@gmail.com> wrote:
I took a look at converting the query used for GreenC Bot's Job 10, which tracks enwiki files that "shadow" a different file on Commons. It is currently run daily, and the query executes in about 60-90 seconds. I tried three methods to recreate that query without a SQL cross-database join. The naive method of "just give me all the files" didn't work because it timed out somewhere. The paginated version of that query was on track to take over 5 hours to complete. A similar method that emulates a subquery instead of a join was projected to take about 6 hours. Both stopped early because I got bored of watching them and PAWS doesn't work unattended. I also wasn't able to properly test them because people kept fixing the shadowed files before the script got to them. The code is at < https://public.paws.wmcloud.org/User:AntiCompositeBot/ShadowsCommonsQuery.ip...
.
ACN
On Tue, Nov 17, 2020 at 1:02 PM Maarten Dammers maarten@mdammers.nl wrote:
Hi Joaquin,
On 16-11-2020 21:42, Joaquin Oltra Hernandez wrote:
Hi Maarten,
I believe this work started many years ago, and it was paused, and recently restarted because of the stability and performance problems in the last years.
You do realize the current setup was announced as new 3 years ago? See https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_server... .
I'm sorry about the extra work this will cause, I hope the improved stability and performance will make it worth it for you, and that you will reconsider and migrate your code to work on the new architecture (or reach out for specific help if you need it).
No, saying sorry won't make it right and no, it won't make it worth it for me. If I want very stable access to a single wiki, I'll use the API of that wiki.
-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation
It currently doesn't really feel to me that you're advocating for the developers, it feels more like you're the unlucky person having to sell the bad WMF management decisions to the angry developers.
Maarten
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
-- Amir (he/him)
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud