Not that I disagree with Jaimes' points but I want to toot the PAWS horn a
bit here. These days my understanding is that the DB access in PAWS is
basically the same as in Toolforge, and the limitation for keeping your
server up without access is 24 hours. So it can be used beyond testing for
a lot of use cases.
Chico Venancio
Em sex., 17 de set. de 2021 às 03:55, Jaime Crespo <jcrespo(a)wikimedia.org>
escreveu:
My preferred
choice now would be to query the specific tables in the
Wikipedia database, in the
same way this is done through the Quarry tool
I would like to build a standalone script,
without having to depend on a
user interface like Quarry
I am not interested in accessing sensible and
private data.
Based on your needs, it seems to me that what you want is direct database
access to the wikireplicas.[0] As you said, toolforge (cloud in general)
will grant you that- direct access to query the database. It is intended
for the general public and generally does not require private data access.
What is the difference between Toolforge and PAWS? PAWS is a similar idea
as Quarry- it is a more friendly interface to access a subset of Cloud
services, including the wikireplicas. You won't need to setup an account or
software (other than a regular Mediawiki account and browser access) and it
is the best way to share snippets of code and a workflow very quickly, and
requires very little setup. If you are familiar with Jupyter notebooks,
that is just an installation of that. Here is an example notebook: [9]
However, it is not suitable for heavy querying or recurrent automated
actions/full scripts that do things on their own (you will go from being
limited by Quarry to being limited by PAWS), so that's why in your case I
would suggest going through the longer process of getting Toolforge access.
You can still test PAWS very easily[8]- and decide for yourself if a
notebook is enough for you, or you need just standalone scripting.
Wikireplicas it is a shared environment, so you will share the resources
with the rest of users- you will not get dedicated resources, and will get
rate-limited if you use so heavy querying that you prevent the rest of
users for also using it. That is why in some cases, some people prefer to
download the dumps and analyze them on their local computers- which they
can do as fast as they want.
As it will be a real-time copy of production databases, it will use
MariaDB, which means it is possible to do analytic-like analysis, although
not optimized for it. Eg. calculating the total number of revisions will
require reading the entire revision table! But if you said you would want
Quarry but with scripting, that is the best alternative- Quarry uses
wikireplicas! :-)
To get a toolforge account, which is part of the Wikimedia Cloud Services,
you will need to:
1. Create a Wikimedia developer account: [1]
2. Create an ssh key [2]
3. Request access [3]
4. Create a new tool [4]
5. Access your tool and start developing [5]
From this point, what you will do will depend on your chosen scripting
language, but there is documentation at Wikitech [6] and a lot of support
options from its users [7] - I recommend contacting other users on IRC or
mailing list if stuck-, but you will be able to query the database directly
with your custom queries!
Hope this is useful,
--
Jaime
[0] <https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database>
[1] <
https://wikitech.wikimedia.org/wiki/Help:Create_a_Wikimedia_developer_accou…
[2]
<https://www.mediawiki.org/wiki/Gerrit/Tutorial#Generate_a_new_SSH_key
[3] <
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Quickstart#Getting_sta…
[4] <
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Tool_Accounts#Create_t…
[5] <
https://wikitech.wikimedia.org/wiki/Help:Access_to_Toolforge_instances_with…
[6]
<https://wikitech.wikimedia.org/wiki/Portal:Toolforge>
[7] <
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Quickstart#Communicati…
[8]
<https://wikitech.wikimedia.org/wiki/PAWS>
[9] <
https://public.paws.wmcloud.org/User:JHernandez_(WMF)/Accessing%20Wikirepli…
On Fri, Sep 17, 2021 at 1:08 AM Cristina Gava via Analytics <
analytics(a)lists.wikimedia.org> wrote:
[1]
https://meta.wikimedia.org/wiki/Research:Data
[2]
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake
_______________________________________________
Analytics mailing list -- analytics(a)lists.wikimedia.org
To unsubscribe send an email to analytics-leave(a)lists.wikimedia.org
--
Jaime Crespo
<http://wikimedia.org>
_______________________________________________
Analytics mailing list -- analytics(a)lists.wikimedia.org
To unsubscribe send an email to analytics-leave(a)lists.wikimedia.org