Hello everyone,
It is my first time interacting in this mailing list, so I will be happy to receive further feedbacks on how to better interact with the community :)
I am trying to access Wikipedia meta data in a streaming and time/resource sustainable manner. By meta data I mean many of the voices that can be found in the statistics of a wiki article, such as edits, editors list, page views etc. I would like to do such for an online classifier type of structure: retrieve the data from a big number of wiki pages every tot time and use it as input for predictions.
I tried to use the Wiki API, however it is time and resource expensive, both for me and Wikipedia.
My preferred choice now would be to query the specific tables in the Wikipedia database, in the same way this is done through the Quarry tool. The problem with Quarry is that I would like to build a standalone script, without having to depend on a user interface like Quarry. Do you think that this is possible? I am still fairly new to all of this and I don't know exactly which is the best direction. I saw [1]https://meta.wikimedia.org/wiki/Research:Data that I could access wiki replicas both through Toolforge and PAWS, however I didn't understand which one would serve me better, could I ask you for some feedback?
Also, as far as I understood [2]https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake, directly accessing the DB through Hive is too technical for what I need, right? Especially because it seems that I would need an account with production shell access and I honestly don't think that I would be granted access to it. Also, I am not interested in accessing sensible and private data.
Last resource is parsing analytics dumps, however this seems less organic in the way of retrieving and polishing the data. As also, it would be strongly decentralised and physical-machine dependent, unless I upload the polished data online every time.
Sorry for this long message, but I thought it was better to give you a clearer picture (hoping this is clear enough). If you could give me even some hint it would be highly appreciated.
Best, Cristina
[1] https://meta.wikimedia.org/wiki/Research:Data [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake
Hi Cristina,
I'd recommend Toolforge, which I used to run regular queries that power some of my bot tools. For an example of a Python script I run there to query info and ftp it to somewhere I can easily access, see: https://bitbucket.org/mikepeel/wikicode/src/master/query_enwp_articles_no_wi...
Thanks, Mike
On 16/9/21 16:42:31, Gava, Cristina via Wikimedia-l wrote:
Hello everyone,
It is my first time interacting in this mailing list, so I will be happy to receive further feedbacks on how to better interact with the community :)
I am trying to access Wikipedia meta data in a streaming and time/resource sustainable manner. By meta data I mean many of the voices that can be found in the statistics of a wiki article, such as edits, editors list, page views etc.
I would like to do such for an online classifier type of structure: retrieve the data from a big number of wiki pages every tot time and use it as input for predictions.
I tried to use the Wiki API, however it is time and resource expensive, both for me and Wikipedia.
My preferred choice now would be to query the specific tables in the Wikipedia database, in the same way this is done through the Quarry tool. The problem with Quarry is that I would like to build a standalone script, without having to depend on a user interface like Quarry. Do you think that this is possible? I am still fairly new to all of this and I don’t know exactly which is the best direction.
I saw [1] https://meta.wikimedia.org/wiki/Research:Data that I could access wiki replicas both through Toolforge and PAWS, however I didn’t understand which one would serve me better, could I ask you for some feedback?
Also, as far as I understood [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake, directly accessing the DB through Hive is too technical for what I need, right? Especially because it seems that I would need an account with production shell access and I honestly don’t think that I would be granted access to it. Also, I am not interested in accessing sensible and private data.
Last resource is parsing analytics dumps, however this seems less organic in the way of retrieving and polishing the data. As also, it would be strongly decentralised and physical-machine dependent, unless I upload the polished data online every time.
Sorry for this long message, but I thought it was better to give you a clearer picture (hoping this is clear enough). If you could give me even some hint it would be highly appreciated.
Best,
Cristina
[1] https://meta.wikimedia.org/wiki/Research:Data https://meta.wikimedia.org/wiki/Research:Data
[2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
Mike's suggestion is good. You would likely get better responses by asking this question to the Wikimedia developers, so I am forwarding to that list.
Risker
On Thu, 16 Sept 2021 at 14:04, Gava, Cristina via Wikimedia-l < wikimedia-l@lists.wikimedia.org> wrote:
Hello everyone,
It is my first time interacting in this mailing list, so I will be happy to receive further feedbacks on how to better interact with the community :)
I am trying to access Wikipedia meta data in a streaming and time/resource sustainable manner. By meta data I mean many of the voices that can be found in the statistics of a wiki article, such as edits, editors list, page views etc.
I would like to do such for an online classifier type of structure: retrieve the data from a big number of wiki pages every tot time and use it as input for predictions.
I tried to use the Wiki API, however it is time and resource expensive, both for me and Wikipedia.
My preferred choice now would be to query the specific tables in the Wikipedia database, in the same way this is done through the Quarry tool. The problem with Quarry is that I would like to build a standalone script, without having to depend on a user interface like Quarry. Do you think that this is possible? I am still fairly new to all of this and I don’t know exactly which is the best direction.
I saw [1] https://meta.wikimedia.org/wiki/Research:Data that I could access wiki replicas both through Toolforge and PAWS, however I didn’t understand which one would serve me better, could I ask you for some feedback?
Also, as far as I understood [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake, directly accessing the DB through Hive is too technical for what I need, right? Especially because it seems that I would need an account with production shell access and I honestly don’t think that I would be granted access to it. Also, I am not interested in accessing sensible and private data.
Last resource is parsing analytics dumps, however this seems less organic in the way of retrieving and polishing the data. As also, it would be strongly decentralised and physical-machine dependent, unless I upload the polished data online every time.
Sorry for this long message, but I thought it was better to give you a clearer picture (hoping this is clear enough). If you could give me even some hint it would be highly appreciated.
Best,
Cristina
[1] https://meta.wikimedia.org/wiki/Research:Data
[2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake _______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/... To unsubscribe send an email to wikimedia-l-leave@lists.wikimedia.org
wikimedia-l@lists.wikimedia.org