Hello again! Thinking about this more, I'm wondering if it makes sense to have a tool to assist with parsing the dblists at noc.wikimedia.org. I know the official recommendation is to not to connect to slices, but the issue is how to work locally. I alone maintain many tools that are capable of connecting to any database. I have a single bash alias I use to set up my SSH tunnel. When I start a new tool, I just give it 127.0.0.1 as the host and the 4711 as the port number. Easy peasy. I can't imagine trying to instruct a newbie how to contribute to tool Foo (which requires a tunnel to enwiki on port 1234), and tool Bar (tunnel to frwiki on port 5678), etc. etc... perhaps it's best to establish a standard system for developers working locally? For the truly "global" tools like I talked about before, we have to use slices, and though they may not change much it's a lot of work to check the dblists manually.

So, my thoughts are this tool could do two things:
1) A webservice with a form where you enter in your username, local MySQL port, and whether you want to use the analytics or web replicas. After submitting, it prints the necessary command, something like:
        ssh -L 4711:s1.web.db.svc.eqiad.wmflabs:3306 -L 4712:s2.web.db.svc.eqiad.wmflabs:3306 … username@login.toolforge.org
2) A public API for tools to use to get the slice given a database name.

For both, it goes by the dblists at noc.wikimedia.org (with some caching to improve response time).

So in the README for my tool, I tell the developer to go to the above to get the command they should use to set up the local SSH tunnel. The README file could even link to a pre-filled form to ensure the port numbers align with what that tool expects. This way, the developer doesn't even need to add port numbers and what not to a .env file or what have you, since the tool goes by what the above tool outputs (though you could provide a means to override this, in the event the developer has other things running on those ports). Hopefully what I'm saying makes sense.

Is this a stupid idea? I might go ahead and build a tool for the #2 use case, at least, because right now I will have to reinvent the wheel for at least three "global" tools that I maintain. We could consider also adding this logic to libraries, such as ToolforgeBundle which is for PHP/Symfony apps running on Cloud Services.

~ MA

On Mon, Nov 23, 2020 at 10:53 AM Nicholas Skaggs <nskaggs@wikimedia.org> wrote:
Amir, in case you hadn't seen it, your memory is correct. This was considered in the past. See https://phabricator.wikimedia.org/T215858#6631859.

On Tue, Nov 17, 2020 at 2:47 PM Amir Sarabadani <ladsgroup@gmail.com> wrote:
Hello,
Actually Jaime's email gave me an idea. Why not having a separate actual data lake? Like a hadoop cluster, it can even take the data from analytics cluster (after being sanitized of course). I remember there were some discussions about having a hadoop or Presto cluster in WM Cloud.

Has this been considered?

Thanks.

On Tue, Nov 17, 2020 at 8:05 PM Brooke Storm <bstorm@wikimedia.org> wrote:
ACN: Thanks! We’ve created a ticket for that one to help collaborate and surface the process here: https://phabricator.wikimedia.org/T267992
Anybody working on that, please add info there.

Brooke Storm
Staff SRE
Wikimedia Cloud Services
IRC: bstorm

On Nov 17, 2020, at 12:01 PM, AntiCompositeNumber <anticompositenumber@gmail.com> wrote:

I took a look at converting the query used for GreenC Bot's Job 10,
which tracks enwiki files that "shadow" a different file on Commons.
It is currently run daily, and the query executes in about 60-90
seconds. I tried three methods to recreate that query without a SQL
cross-database join. The naive method of "just give me all the files"
didn't work because it timed out somewhere. The paginated version of
that query was on track to take over 5 hours to complete. A similar
method that emulates a subquery instead of a join was projected to
take about 6 hours. Both stopped early because I got bored of watching
them and PAWS doesn't work unattended. I also wasn't able to properly
test them because people kept fixing the shadowed files before the
script got to them. The code is at
<https://public.paws.wmcloud.org/User:AntiCompositeBot/ShadowsCommonsQuery.ipynb>.

ACN

On Tue, Nov 17, 2020 at 1:02 PM Maarten Dammers <maarten@mdammers.nl> wrote:

Hi Joaquin,

On 16-11-2020 21:42, Joaquin Oltra Hernandez wrote:

Hi Maarten,

I believe this work started many years ago, and it was paused, and recently restarted because of the stability and performance problems in the last years.

You do realize the current setup was announced as new 3 years ago? See https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_servers_ready_for_use/ .

I'm sorry about the extra work this will cause, I hope the improved stability and performance will make it worth it for you, and that you will reconsider and migrate your code to work on the new architecture (or reach out for specific help if you need it).

No, saying sorry won't make it right and no, it won't make it worth it for me. If I want very stable access to a single wiki, I'll use the API of that wiki.

--
Joaquin Oltra Hernandez
Developer Advocate - Wikimedia Foundation

It currently doesn't really feel to me that you're advocating for the developers, it feels more like you're the unlucky person having to sell the bad WMF management decisions to the angry developers.

Maarten

_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud


--
Amir (he/him)

_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud


--
Nicholas Skaggs
Engineering Manager, Cloud Services
Wikimedia Foundation
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud