Not that I disagree with Jaimes' points but I want to toot the PAWS horn a bit here. These days my understanding is that the DB access in PAWS is basically the same as in Toolforge, and the limitation for keeping your server up without access is 24 hours. So it can be used beyond testing for a lot of use cases.
Chico Venancio
Em sex., 17 de set. de 2021 às 03:55, Jaime Crespo jcrespo@wikimedia.org escreveu:
My preferred choice now would be to query the specific tables in the
Wikipedia database, in the same way this is done through the Quarry tool
I would like to build a standalone script, without having to depend on a
user interface like Quarry
I am not interested in accessing sensible and private data.
Based on your needs, it seems to me that what you want is direct database access to the wikireplicas.[0] As you said, toolforge (cloud in general) will grant you that- direct access to query the database. It is intended for the general public and generally does not require private data access.
What is the difference between Toolforge and PAWS? PAWS is a similar idea as Quarry- it is a more friendly interface to access a subset of Cloud services, including the wikireplicas. You won't need to setup an account or software (other than a regular Mediawiki account and browser access) and it is the best way to share snippets of code and a workflow very quickly, and requires very little setup. If you are familiar with Jupyter notebooks, that is just an installation of that. Here is an example notebook: [9] However, it is not suitable for heavy querying or recurrent automated actions/full scripts that do things on their own (you will go from being limited by Quarry to being limited by PAWS), so that's why in your case I would suggest going through the longer process of getting Toolforge access. You can still test PAWS very easily[8]- and decide for yourself if a notebook is enough for you, or you need just standalone scripting.
Wikireplicas it is a shared environment, so you will share the resources with the rest of users- you will not get dedicated resources, and will get rate-limited if you use so heavy querying that you prevent the rest of users for also using it. That is why in some cases, some people prefer to download the dumps and analyze them on their local computers- which they can do as fast as they want.
As it will be a real-time copy of production databases, it will use MariaDB, which means it is possible to do analytic-like analysis, although not optimized for it. Eg. calculating the total number of revisions will require reading the entire revision table! But if you said you would want Quarry but with scripting, that is the best alternative- Quarry uses wikireplicas! :-)
To get a toolforge account, which is part of the Wikimedia Cloud Services, you will need to:
- Create a Wikimedia developer account: [1]
- Create an ssh key [2]
- Request access [3]
- Create a new tool [4]
- Access your tool and start developing [5]
From this point, what you will do will depend on your chosen scripting language, but there is documentation at Wikitech [6] and a lot of support options from its users [7] - I recommend contacting other users on IRC or mailing list if stuck-, but you will be able to query the database directly with your custom queries!
Hope this is useful,
Jaime
[0] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database [1] < https://wikitech.wikimedia.org/wiki/Help:Create_a_Wikimedia_developer_accoun...
[2] <https://www.mediawiki.org/wiki/Gerrit/Tutorial#Generate_a_new_SSH_key
[3] < https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Quickstart#Getting_star...
[4] < https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Tool_Accounts#Create_to...
[5] < https://wikitech.wikimedia.org/wiki/Help:Access_to_Toolforge_instances_with_...
[6] https://wikitech.wikimedia.org/wiki/Portal:Toolforge [7] < https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Quickstart#Communicatio...
[8] https://wikitech.wikimedia.org/wiki/PAWS [9] < https://public.paws.wmcloud.org/User:JHernandez_(WMF)/Accessing%20Wikireplic...
On Fri, Sep 17, 2021 at 1:08 AM Cristina Gava via Analytics < analytics@lists.wikimedia.org> wrote:
[1] https://meta.wikimedia.org/wiki/Research:Data [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake _______________________________________________ Analytics mailing list -- analytics@lists.wikimedia.org To unsubscribe send an email to analytics-leave@lists.wikimedia.org
-- Jaime Crespo http://wikimedia.org _______________________________________________ Analytics mailing list -- analytics@lists.wikimedia.org To unsubscribe send an email to analytics-leave@lists.wikimedia.org