TL;DR I would like to access wikipedia's articles' metadata (such as # edits, pageviews etc). I need to access a big volume of instances in order to train and maintain an online classifier and the API seems not sustainable. I was wondering which tool is the most appropriate for this task.
Hello everyone,
It is my first time interacting in this mailing list, so I will be happy to receive further feedbacks on how to better interact with the community :) I crossposted this message to Wiki-research-l as well.
I am trying to access Wikipedia meta data in a streaming and time/resource sustainable manner. By meta data I mean many of the voices that can be found in the statistics of a wiki article, such as edits, editors list, page views etc. I would like to do such for an online classifier type of structure: retrieve the data from a big number of wiki pages every tot time and use it as input for predictions.
I tried to use the Wiki API, however it is time and resource expensive, both for me and Wikipedia.
My preferred choice now would be to query the specific tables in the Wikipedia database, in the same way this is done through the Quarry tool. The problem with Quarry is that I would like to build a standalone script, without having to depend on a user interface like Quarry. Do you think that this is possible? I am still fairly new to all of this and I don’t know exactly which is the best direction. I saw [1] that I could access wiki replicas both through Toolforge and PAWS, however I didn’t understand which one would serve me better, could I ask you for some feedback?
Also, as far as I understood [2], directly accessing the DB through Hive is too technical for what I need, right? Especially because it seems that I would need an account with production shell access and I honestly don’t think that I would be granted access to it. Also, I am not interested in accessing sensible and private data.
Last resource is parsing analytics dumps, however this seems less organic in the way of retrieving and polishing the data. As also, it would be strongly decentralised and physical-machine dependent, unless I upload the polished data online every time.
Sorry for this long message, but I thought it was better to give you a clearer picture (hoping this is clear enough). If you could give me even some hint it would be highly appreciated.
Best, Cristina
My preferred choice now would be to query the specific tables in the
Wikipedia database, in the same way this is done through the Quarry tool
I would like to build a standalone script, without having to depend on a
user interface like Quarry
I am not interested in accessing sensible and private data.
Based on your needs, it seems to me that what you want is direct database access to the wikireplicas.[0] As you said, toolforge (cloud in general) will grant you that- direct access to query the database. It is intended for the general public and generally does not require private data access.
What is the difference between Toolforge and PAWS? PAWS is a similar idea as Quarry- it is a more friendly interface to access a subset of Cloud services, including the wikireplicas. You won't need to setup an account or software (other than a regular Mediawiki account and browser access) and it is the best way to share snippets of code and a workflow very quickly, and requires very little setup. If you are familiar with Jupyter notebooks, that is just an installation of that. Here is an example notebook: [9] However, it is not suitable for heavy querying or recurrent automated actions/full scripts that do things on their own (you will go from being limited by Quarry to being limited by PAWS), so that's why in your case I would suggest going through the longer process of getting Toolforge access. You can still test PAWS very easily[8]- and decide for yourself if a notebook is enough for you, or you need just standalone scripting.
Wikireplicas it is a shared environment, so you will share the resources with the rest of users- you will not get dedicated resources, and will get rate-limited if you use so heavy querying that you prevent the rest of users for also using it. That is why in some cases, some people prefer to download the dumps and analyze them on their local computers- which they can do as fast as they want.
As it will be a real-time copy of production databases, it will use MariaDB, which means it is possible to do analytic-like analysis, although not optimized for it. Eg. calculating the total number of revisions will require reading the entire revision table! But if you said you would want Quarry but with scripting, that is the best alternative- Quarry uses wikireplicas! :-)
To get a toolforge account, which is part of the Wikimedia Cloud Services, you will need to: 1. Create a Wikimedia developer account: [1] 2. Create an ssh key [2] 3. Request access [3] 4. Create a new tool [4] 5. Access your tool and start developing [5]
From this point, what you will do will depend on your chosen scripting language, but there is documentation at Wikitech [6] and a lot of support options from its users [7] - I recommend contacting other users on IRC or mailing list if stuck-, but you will be able to query the database directly with your custom queries!
Hope this is useful, -- Jaime
[0] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database [1] < https://wikitech.wikimedia.org/wiki/Help:Create_a_Wikimedia_developer_accoun...
[2] https://www.mediawiki.org/wiki/Gerrit/Tutorial#Generate_a_new_SSH_key [3] < https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Quickstart#Getting_star...
[4] < https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Tool_Accounts#Create_to...
[5] < https://wikitech.wikimedia.org/wiki/Help:Access_to_Toolforge_instances_with_...
[6] https://wikitech.wikimedia.org/wiki/Portal:Toolforge [7] < https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Quickstart#Communicatio...
[8] https://wikitech.wikimedia.org/wiki/PAWS [9] < https://public.paws.wmcloud.org/User:JHernandez_(WMF)/Accessing%20Wikireplic...
On Fri, Sep 17, 2021 at 1:08 AM Cristina Gava via Analytics < analytics@lists.wikimedia.org> wrote:
[1] https://meta.wikimedia.org/wiki/Research:Data [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake _______________________________________________ Analytics mailing list -- analytics@lists.wikimedia.org To unsubscribe send an email to analytics-leave@lists.wikimedia.org
Hi Jaime,
Thank you so much for the thorough reply :) All the references are super useful and I'll go through them now. I'll start with Toolforge, since it seems there is consensus on it being the most appropriate tool, and leave the dumps for later if needed. I'll keep you posted.
Cristina
On Fri, Sep 17, 2021 at 3:03 PM Cristina Gava via Analytics < analytics@lists.wikimedia.org> wrote:
Hi Jaime,
Thank you so much for the thorough reply :) All the references are super useful and I'll go through them now. I'll start with Toolforge, since it seems there is consensus on it being the most appropriate tool, and leave the dumps for later if needed. I'll keep you posted.
It will depend a lot on the type of research needed. For example, ( to be the devil's advocate, with a simple example) if you wanted to count the total number of words written in Wikipedia and observe its frequency- (meaning reading all edits in history), dumps would be a way better option in this case, as wikireplicas only have access to medatada, not the actual data. On top of that, reading sequentially all edits will be much faster from a downloaded bundle, while on the live MariaDB database the access is faster for small portions with specific conditions or small to medium ranges.
I think starting with wikireplicas and later going for the dumps if you see it not working for you is a totally reasonable decision, in general, as it will require less investment on your local setup.
Hi Cristina,
Happy to see you here :) Just to add on top of Jaime's answer, here you have an example for python-based app https://wikitech.wikimedia.org/wiki/Help:Toolforge/My_first_Flask_OAuth_tool in Toolforge.
Hope this helps, Best, Diego
On Fri, Sep 17, 2021 at 3:12 PM Jaime Crespo jcrespo@wikimedia.org wrote:
On Fri, Sep 17, 2021 at 3:03 PM Cristina Gava via Analytics < analytics@lists.wikimedia.org> wrote:
Hi Jaime,
Thank you so much for the thorough reply :) All the references are super useful and I'll go through them now. I'll start with Toolforge, since it seems there is consensus on it being the most appropriate tool, and leave the dumps for later if needed. I'll keep you posted.
It will depend a lot on the type of research needed. For example, ( to be the devil's advocate, with a simple example) if you wanted to count the total number of words written in Wikipedia and observe its frequency- (meaning reading all edits in history), dumps would be a way better option in this case, as wikireplicas only have access to medatada, not the actual data. On top of that, reading sequentially all edits will be much faster from a downloaded bundle, while on the live MariaDB database the access is faster for small portions with specific conditions or small to medium ranges.
I think starting with wikireplicas and later going for the dumps if you see it not working for you is a totally reasonable decision, in general, as it will require less investment on your local setup.
-- Jaime Crespo http://wikimedia.org _______________________________________________ Analytics mailing list -- analytics@lists.wikimedia.org To unsubscribe send an email to analytics-leave@lists.wikimedia.org
Hi Diego,
Happy to meet you here too :) Awesome thanks a lot, will definitely look into that.
Best, Cristina
Hi Cristina, have you had a chance to read https://dumps.wikimedia.org/other/mediawiki_history/readme.html more closely? It sounds a lot like what you might need. We're consolidating all the confusing pageview dumps into a single one as well:
https://dumps.wikimedia.org/other/pageview_complete/readme.html
Let us know if you have any questions.
On Fri, Sep 17, 2021 at 09:40 Cristina Gava via Analytics < analytics@lists.wikimedia.org> wrote:
Hi Diego,
Happy to meet you here too :) Awesome thanks a lot, will definitely look into that.
Best, Cristina _______________________________________________ Analytics mailing list -- analytics@lists.wikimedia.org To unsubscribe send an email to analytics-leave@lists.wikimedia.org
Hi Dan,
Thanks a lot. I think I bumped into that link at some point and then I wasn't able to come across it again. There is a point that is not entirely clear to me
"Thus, note that incremental downloads of these dumps may generate inconsistent data. Consider using EventStreams for real time updates on MediaWiki changes (API docs)."
I am planning to retrieve updated versions of the metadata regularly. So I guess I have to use EventStream to access the recent changes? AFAIU there recent changes come from the RecentChanges table [1]. So what would be a proper stream of actions? For example:
1. Dowload the mediawiki_history dump once and parse it 2. For every new update of my data pool, access recent changes through event stream as per [2]
Did understand this correctly?
Last thing, in the pageview archive there are three types of file: automated, spider and user. Am I right in understanding that "user" relates to pageviews operated by real persons, while "automated" and "spider" by programs (not sure about the difference between the two)?
Cristina
[1] https://www.mediawiki.org/wiki/Manual:Recentchanges_table [2] https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams
Hi Cristina!
In regards to the question:
Last thing, in the pageview archive there are three types of file:
automated, spider and user. Am I right in understanding that "user" relates to pageviews operated by real persons, while "automated" and "spider" by programs (not sure about the difference between the two)?
Yes, "user" relates to pageviews operated by real people. "Spider" pageviews are those generated by self-declared bots, the ones that are identified as such in their UserAgent header (for instance web crawlers). "Automated" pageviews are those generated by bots that are not identified as such. They are labelled separately because we use different methods for labelling them: the spider pageviews are identified by parsing the UserAgent string, and the automated ones are identified with request pattern heuristics.
Hope this helps!
On Fri, Sep 17, 2021 at 5:47 PM Cristina Gava via Analytics < analytics@lists.wikimedia.org> wrote:
Hi Dan,
Thanks a lot. I think I bumped into that link at some point and then I wasn't able to come across it again. There is a point that is not entirely clear to me
"Thus, note that incremental downloads of these dumps may generate inconsistent data. Consider using EventStreams for real time updates on MediaWiki changes (API docs)."
I am planning to retrieve updated versions of the metadata regularly. So I guess I have to use EventStream to access the recent changes? AFAIU there recent changes come from the RecentChanges table [1]. So what would be a proper stream of actions? For example:
- Dowload the mediawiki_history dump once and parse it
- For every new update of my data pool, access recent changes through
event stream as per [2]
Did understand this correctly?
Last thing, in the pageview archive there are three types of file: automated, spider and user. Am I right in understanding that "user" relates to pageviews operated by real persons, while "automated" and "spider" by programs (not sure about the difference between the two)?
Cristina
[1] https://www.mediawiki.org/wiki/Manual:Recentchanges_table [2] https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams _______________________________________________ Analytics mailing list -- analytics@lists.wikimedia.org To unsubscribe send an email to analytics-leave@lists.wikimedia.org
"Thus, note that incremental downloads of these dumps may generate inconsistent data. Consider using EventStreams for real time updates on MediaWiki changes (API docs)."
I can see how that's confusing. I'll try to re-word it and then answer your other questions below. So this is basically saying that if you want to process the whole history every month, this dataset will work ok. But if you plan on doing something like:
* It's 2021-09 and you download the whole dump and process it * In 2021-10 the new dump comes out, you download it but only process the events with timestamps between 2021-09 and 2021-10
That won't work. Because some of the updates might be done to historical records with timestamps long before 2021-09. That little quirk is what allows us to add high value fields like "time between this revision and the next revision" or "is this revision deleted at some point in the future".
I am planning to retrieve updated versions of the metadata regularly. So I
guess I have to use EventStream to access the recent changes? AFAIU there recent changes come from the RecentChanges table [1]. So what would be a proper stream of actions? For example:
- Dowload the mediawiki_history dump once and parse it
- For every new update of my data pool, access recent changes through
event stream as per [2]
Did understand this correctly?
This would work ok, but would indeed be a bit more complicated. If you absolutely need data every minute, hour, or day, then this would be one choice. One downside is that it would be hard to compute some of the fields we provide in the whole dump, so if you can wait a month to get the refreshed dump then that's better. The options Jaime gave might work better, it would depend on your requirements and what you're comfortable with. In the long term we hope to release a version of this dataset updated more frequently, hopefully daily (but this is more than a year away).
Last thing, in the pageview archive there are three types of file:
automated, spider and user. Am I right in understanding that "user" relates to pageviews operated by real persons, while "automated" and "spider" by programs (not sure about the difference between the two)?
Yes, user is our best heuristic-algorithm guess at what part of our traffic is initiated by humans (or smart members of other species :)). Automated is our guess at bot traffic that doesn't identify itself. And spider is traffic that identifies itself (such as the google crawler bot).
Ooooh I see, makes a lot of sense! The current idea would be to have a monthly refresh of our data pool, so perhaps this eases up requirements a bit.
As I said we were a bit reluctant in diving straight into dumps, because it felt less portable. Though after discussing with you all I now have a better idea of what the picture is and which dumps I should consider, so I will weight this option more as well.
I'll let you know, most probably will come back soon with other questions :)
Cristina
I see. Honestly at the moment I couldn't predict if my needs are fair or too much for the wiki replicas, so yeah I guess following this order might be reasonable.
Not that I disagree with Jaimes' points but I want to toot the PAWS horn a bit here. These days my understanding is that the DB access in PAWS is basically the same as in Toolforge, and the limitation for keeping your server up without access is 24 hours. So it can be used beyond testing for a lot of use cases.
Chico Venancio
Em sex., 17 de set. de 2021 às 03:55, Jaime Crespo jcrespo@wikimedia.org escreveu:
My preferred choice now would be to query the specific tables in the
Wikipedia database, in the same way this is done through the Quarry tool
I would like to build a standalone script, without having to depend on a
user interface like Quarry
I am not interested in accessing sensible and private data.
Based on your needs, it seems to me that what you want is direct database access to the wikireplicas.[0] As you said, toolforge (cloud in general) will grant you that- direct access to query the database. It is intended for the general public and generally does not require private data access.
What is the difference between Toolforge and PAWS? PAWS is a similar idea as Quarry- it is a more friendly interface to access a subset of Cloud services, including the wikireplicas. You won't need to setup an account or software (other than a regular Mediawiki account and browser access) and it is the best way to share snippets of code and a workflow very quickly, and requires very little setup. If you are familiar with Jupyter notebooks, that is just an installation of that. Here is an example notebook: [9] However, it is not suitable for heavy querying or recurrent automated actions/full scripts that do things on their own (you will go from being limited by Quarry to being limited by PAWS), so that's why in your case I would suggest going through the longer process of getting Toolforge access. You can still test PAWS very easily[8]- and decide for yourself if a notebook is enough for you, or you need just standalone scripting.
Wikireplicas it is a shared environment, so you will share the resources with the rest of users- you will not get dedicated resources, and will get rate-limited if you use so heavy querying that you prevent the rest of users for also using it. That is why in some cases, some people prefer to download the dumps and analyze them on their local computers- which they can do as fast as they want.
As it will be a real-time copy of production databases, it will use MariaDB, which means it is possible to do analytic-like analysis, although not optimized for it. Eg. calculating the total number of revisions will require reading the entire revision table! But if you said you would want Quarry but with scripting, that is the best alternative- Quarry uses wikireplicas! :-)
To get a toolforge account, which is part of the Wikimedia Cloud Services, you will need to:
- Create a Wikimedia developer account: [1]
- Create an ssh key [2]
- Request access [3]
- Create a new tool [4]
- Access your tool and start developing [5]
From this point, what you will do will depend on your chosen scripting language, but there is documentation at Wikitech [6] and a lot of support options from its users [7] - I recommend contacting other users on IRC or mailing list if stuck-, but you will be able to query the database directly with your custom queries!
Hope this is useful,
Jaime
[0] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database [1] < https://wikitech.wikimedia.org/wiki/Help:Create_a_Wikimedia_developer_accoun...
[2] <https://www.mediawiki.org/wiki/Gerrit/Tutorial#Generate_a_new_SSH_key
[3] < https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Quickstart#Getting_star...
[4] < https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Tool_Accounts#Create_to...
[5] < https://wikitech.wikimedia.org/wiki/Help:Access_to_Toolforge_instances_with_...
[6] https://wikitech.wikimedia.org/wiki/Portal:Toolforge [7] < https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Quickstart#Communicatio...
[8] https://wikitech.wikimedia.org/wiki/PAWS [9] < https://public.paws.wmcloud.org/User:JHernandez_(WMF)/Accessing%20Wikireplic...
On Fri, Sep 17, 2021 at 1:08 AM Cristina Gava via Analytics < analytics@lists.wikimedia.org> wrote:
[1] https://meta.wikimedia.org/wiki/Research:Data [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake _______________________________________________ Analytics mailing list -- analytics@lists.wikimedia.org To unsubscribe send an email to analytics-leave@lists.wikimedia.org
-- Jaime Crespo http://wikimedia.org _______________________________________________ Analytics mailing list -- analytics@lists.wikimedia.org To unsubscribe send an email to analytics-leave@lists.wikimedia.org