Hi Cristina,

Happy to see you here :) Just to add on top of  Jaime's answer, here you have an example for python-based app in Toolforge.

Hope this helps,
Best,
Diego

On Fri, Sep 17, 2021 at 3:12 PM Jaime Crespo <jcrespo@wikimedia.org> wrote:
On Fri, Sep 17, 2021 at 3:03 PM Cristina Gava via Analytics <analytics@lists.wikimedia.org> wrote:
Hi Jaime,

Thank you so much for the thorough reply :) All the references are super useful and I'll go through them now. I'll start with Toolforge, since it seems there is consensus on it being the most appropriate tool, and leave the dumps for later if needed.
I'll keep you posted.

It will depend a lot on the type of research needed. For example, ( to be the devil's advocate, with a simple example) if you wanted to count the total number of words written in Wikipedia and observe its frequency- (meaning reading all edits in history), dumps would be a way better option in this case, as wikireplicas only have access to medatada, not the actual data. On top of that, reading sequentially all edits will be much faster from a downloaded bundle, while on the live MariaDB database the access is faster for small portions with specific conditions or small to medium ranges.

I think starting with wikireplicas and later going for the dumps if you see it not working for you is a totally reasonable decision, in general, as it will require less investment on your local setup.

--
Jaime Crespo
<http://wikimedia.org>
_______________________________________________
Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-leave@lists.wikimedia.org