Hi:
I'm completely new to analytics in Wikimedia.
We are working with a heritage institution in a GLAM project and they are interested in access statistics for the resources they have released in Wikimedia. I think I got the point about how the pageviews concept is and how to use it but, as far as I understand, it's not possible to get details like article pageviews, for example, per country. Is this correct? If so, what should be the way to get (or process) the information to produce the data?
Also, I'm reading about the resulting format[1] but I can't find the related logs.
Any suggestions? Thanks.
[1] https://meta.wikimedia.org/wiki/Research:Page_view#Resulting_format
Hi Ismael, responses inline:
On Tue, Dec 20, 2022 at 1:05 PM Ismael Olea ismael@olea.org wrote:
I'm completely new to analytics in Wikimedia.
Welcome! :)
We are working with a heritage institution in a GLAM project and they are
interested in access statistics for the resources they have released in Wikimedia.
Wikimedia CH and Wikimedia Israel have worked on some dashboards showing GLAM statistics. You may find their projects https://glamwikidashboard.org/ interesting. We are currently working https://phabricator.wikimedia.org/T325065 with Wikimedia Israel to move their dashboard to our cloud infrastructure and eventually update our APIs to better serve them with the data they need. Until then, it may be interesting to see what statistics they've focused on and how they get them from the publicly available data we already provide. You can see all this in their source code: https://github.com/yonathan06/cassandra-GLAM-tools
I think I got the point about how the pageviews concept is and how to use it but, as far as I understand, it's not possible to get details like article pageviews, for example, per country. Is this correct?
We have an ongoing project to release per-country per-article pageview information. It's hard for privacy reasons, and we are building a privacy system that takes all that into account. For now, we have pageviews by country at a high level https://stats.wikimedia.org/#/all-projects/reading/page-views-by-country and most viewed articles https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews#Most_viewed_articles_per_country by country. I'm linking to different parts of our data ecosystem so you can get familiar with it.
If so, what should be the way to get (or process) the information to produce the data?
The only way is to help with the ongoing (and complex) differential privacy work https://phabricator.wikimedia.org/T307245
Also, I'm reading about the resulting format[1] but I can't find the
related logs.
Any suggestions? Thanks.
[1] https://meta.wikimedia.org/wiki/Research:Page_view#Resulting_format
I can see how that can be misleading. For GLAMs, usually you would want to download media request statistics https://dumps.wikimedia.org/other/mediacounts/daily/, as the glamwikidashboard I mentioned above does. (They are currently working on getting as much as they can from the media requests api https://wikitech.wikimedia.org/wiki/Analytics/AQS/Mediarequests instead). If you are indeed interested in pageviews, the definition you linked to talks about the data internally available. Can I ask you to elaborate a bit more on why you need per-country data?
On Tue, Dec 20, 2022 at 7:28 PM Dan Andreescu dandreescu@wikimedia.org wrote:
Hi Ismael, responses inline:
Wikimedia CH and Wikimedia Israel have worked on some dashboards showing GLAM statistics. You may find their projects https://glamwikidashboard.org/ interesting.
I'm pretty aware and love the application. Yonatan was very kind adding this institution too: https://glamwikidashboard.org/IANDPH
We are currently working https://phabricator.wikimedia.org/T325065 with Wikimedia Israel to move their dashboard to our cloud infrastructure
This is awesome. Congrats and thanks.
If so, what should be the way to get (or process) the information to
produce the data?
The only way is to help with the ongoing (and complex) differential privacy work https://phabricator.wikimedia.org/T307245
I have systems background but probably this could be outside my skills. How could I help?
[1] https://meta.wikimedia.org/wiki/Research:Page_view#Resulting_format
If you are indeed interested in pageviews, the definition you linked to talks about the data internally available.
Oh!
Can I ask you to elaborate a bit more on why you need per-country data?
Well, First I've been looking for the most useful tools and sources available (and found very interesting many of them[1]). Second, in this precise case we are running a pilot project in which has been published some academic project results as Wikivoyage itineraries (3 in EN and 3 in ES). These are the articles we are interested in tracking now.
About the rationale, one of the bigger drivers nowadays is the well known link between heritage, tourism and sustainability (example: the Sustainable Development Goals), so there is a trend to better analyze this context to study and plan. Usually touristic destinations have very well defined countries of origin. The best you know the origin, the best you can plan. Also there should be another positive impact in Wikimedia: new incentives for institutions to create or translate articles to the relevant languages. Always restrited to the heritage domain. Here in Spain tourism is one of the main economic sectors and anything providing intelligence would help for better planning and conservation.
Also, we have identified a new potential activity area about doing intelligence analysis of trends in heritage (interest of the public, changes in institutional focuses, new relevant practices, etc), not only about the Spanish one but worldwide. This is also an scientific institution and would find it very useful to collect the most precise traces available (with absolute respect to the users privacy) to look for signals they could use to refocus/prioritize their institutional goals.
So, this is it.
[1] https://toolhub.wikimedia.org/lists/277
The only way is to help with the ongoing (and complex) differential
privacy work https://phabricator.wikimedia.org/T307245
I have systems background but probably this could be outside my skills. How could I help?
Hm, it's some tricky programming work, I'm not 100% sure of the latest status or opportunities to get involved, but I'm cc-ing Hal Triedman to see if he has thoughts. (Hal see archive https://lists.wikimedia.org/hyperkitty/list/analytics@lists.wikimedia.org/thread/IKL3WOQ2UY7IMMCUTV7EYGT6PFVFLVCA/ )
[1] https://meta.wikimedia.org/wiki/Research:Page_view#Resulting_format
If you are indeed interested in pageviews, the definition you linked to talks about the data internally available.
Oh!
Can I ask you to elaborate a bit more on why you need per-country data?
Well, First I've been looking for the most useful tools and sources available (and found very interesting many of them[1]). Second, in this precise case we are running a pilot project in which has been published some academic project results as Wikivoyage itineraries (3 in EN and 3 in ES). These are the articles we are interested in tracking now.
About the rationale, one of the bigger drivers nowadays is the well known link between heritage, tourism and sustainability (example: the Sustainable Development Goals), so there is a trend to better analyze this context to study and plan. Usually touristic destinations have very well defined countries of origin. The best you know the origin, the best you can plan. Also there should be another positive impact in Wikimedia: new incentives for institutions to create or translate articles to the relevant languages. Always restrited to the heritage domain. Here in Spain tourism is one of the main economic sectors and anything providing intelligence would help for better planning and conservation.
Also, we have identified a new potential activity area about doing intelligence analysis of trends in heritage (interest of the public, changes in institutional focuses, new relevant practices, etc), not only about the Spanish one but worldwide. This is also an scientific institution and would find it very useful to collect the most precise traces available (with absolute respect to the users privacy) to look for signals they could use to refocus/prioritize their institutional goals.
So, this is it.
This is indeed a very interesting use case and a chance for this data to be very helpful. Unfortunately to my naive eyes, this granularity of data also carries a big privacy cost. The only way to get to it would be a research collaboration, but there are *lots* of requests for those and not enough researchers to help facilitate. I'm honestly not sure there's an easy way around this... but I'll keep thinking about it and I know it'll be useful for Hal to see this kind of request and add it to his back burner. Thanks for detailing!
Hi all,
Ismael, thanks so much for reaching out about this. Unfortunately, I think Dan is right when he says that the granularity of data carries a big privacy cost. We're working hard to try and lower the threshold of daily unique visitors by country in order to be released from 1000 to 90, but it seems like these Wikivoyage itineraries are likely to have less than 90 daily unique visitors in most countries. Either way, we're hoping to start releasing daily pageviews by country in January, so you should check to see if your pages are in the dataset once that release is live.
If you want unfettered access to the data (split up by country), you should pursue a research partnership. Besides that, you can likely use some existing tools (like the Pageviews API https://wikimedia.org/api/rest_v1/#/Pageviews%20data/get_metrics_pageviews_top__project___access___year___month___day_ or pageviews.wmcloud.org) to get a sense of the data. I'll be sure to reach back out once the differentially-private data is released so that you might be able to check on the relevant pages!
Thanks again for reaching out :)
Hal
On Wed, Dec 21, 2022 at 12:07 PM Dan Andreescu dandreescu@wikimedia.org wrote:
The only way is to help with the ongoing (and complex) differential
privacy work https://phabricator.wikimedia.org/T307245
I have systems background but probably this could be outside my skills. How could I help?
Hm, it's some tricky programming work, I'm not 100% sure of the latest status or opportunities to get involved, but I'm cc-ing Hal Triedman to see if he has thoughts. (Hal see archive https://lists.wikimedia.org/hyperkitty/list/analytics@lists.wikimedia.org/thread/IKL3WOQ2UY7IMMCUTV7EYGT6PFVFLVCA/ )
[1] https://meta.wikimedia.org/wiki/Research:Page_view#Resulting_format
If you are indeed interested in pageviews, the definition you linked to talks about the data internally available.
Oh!
Can I ask you to elaborate a bit more on why you need per-country data?
Well, First I've been looking for the most useful tools and sources available (and found very interesting many of them[1]). Second, in this precise case we are running a pilot project in which has been published some academic project results as Wikivoyage itineraries (3 in EN and 3 in ES). These are the articles we are interested in tracking now.
About the rationale, one of the bigger drivers nowadays is the well known link between heritage, tourism and sustainability (example: the Sustainable Development Goals), so there is a trend to better analyze this context to study and plan. Usually touristic destinations have very well defined countries of origin. The best you know the origin, the best you can plan. Also there should be another positive impact in Wikimedia: new incentives for institutions to create or translate articles to the relevant languages. Always restrited to the heritage domain. Here in Spain tourism is one of the main economic sectors and anything providing intelligence would help for better planning and conservation.
Also, we have identified a new potential activity area about doing intelligence analysis of trends in heritage (interest of the public, changes in institutional focuses, new relevant practices, etc), not only about the Spanish one but worldwide. This is also an scientific institution and would find it very useful to collect the most precise traces available (with absolute respect to the users privacy) to look for signals they could use to refocus/prioritize their institutional goals.
So, this is it.
This is indeed a very interesting use case and a chance for this data to be very helpful. Unfortunately to my naive eyes, this granularity of data also carries a big privacy cost. The only way to get to it would be a research collaboration, but there are *lots* of requests for those and not enough researchers to help facilitate. I'm honestly not sure there's an easy way around this... but I'll keep thinking about it and I know it'll be useful for Hal to see this kind of request and add it to his back burner. Thanks for detailing!
Either way, we're hoping to start releasing daily pageviews by country in January, so you should check to see if your pages are in the dataset once that release is live.
This is really interesting. Where can I subscribe to be up to date? Some Phabricator issue? A wiki page? When the service is ready, How old will the published data be? I mean, do you plan to publish data of past years/months or just start from the service roll out? I guess the answer will be the last one.
If you want unfettered access to the data (split up by country), you should pursue a research partnership.
This is very interesting also. I'll read about it, but I understand the most practical approach would be the new pageviews thing.
Thanks both Hal and Dan for the precise and helpful answers-
Il 21/12/22 19:55, Ismael Olea ha scritto:
About the rationale, one of the bigger drivers nowadays is the well known link between heritage, tourism and sustainability (example: the Sustainable Development Goals), so there is a trend to better analyze this context to study and plan. Usually touristic destinations have very well defined countries of origin.
That's also an example where correlation between language and country of origin has been useful! I'd love to see a replication of Hinnosaar's "Wikipedia matters" but with GLAM contributions and across different languages. http://marit.hinnosaar.net/wikipediamatters.pdf
Federico
Hi all To clarify.
Cassandra is a Wikimedia CH project and Wikimedia CH worked on it for more than 5 years and spending a consistent budget.
The project is open and everyone can install.
Glamwikidashboard is a fork of Cassandra.
But It means that new improvements will be released ONLY in the repository of Cassandra.
Cassandra is quite similar to a data warehouse, it means that the increase of resources is quite huge.
So what you see of Cassandra is only the top of the iceberg. What is more important is the architecture.
It's important to me to know that WM IL is moving to a cloud because I anticipate them that a small server was not sufficient, but I am quite sure that a cloud will be lesser good.
Cassandra is not only a software but it's an architecture based on virtual servers and having SSD and RAM based repository to speed up the performance.
In addition Wikimedia CH offers a service based solution to GLAM supporting them in case of problems.
This is the reason why Glamwikidashboard is a part of Cassandra that has been reviewed as user interface but it's a fork and it's not Cassandra.
It's like having the car body of Ferrari but not the engine of Ferrari and the service of a Ferrari.
Kind regards
On Tue, 20 Dec 2022, 19:29 Dan Andreescu, dandreescu@wikimedia.org wrote:
Hi Ismael, responses inline:
On Tue, Dec 20, 2022 at 1:05 PM Ismael Olea ismael@olea.org wrote:
I'm completely new to analytics in Wikimedia.
Welcome! :)
We are working with a heritage institution in a GLAM project and they are
interested in access statistics for the resources they have released in Wikimedia.
Wikimedia CH and Wikimedia Israel have worked on some dashboards showing GLAM statistics. You may find their projects https://glamwikidashboard.org/ interesting. We are currently working https://phabricator.wikimedia.org/T325065 with Wikimedia Israel to move their dashboard to our cloud infrastructure and eventually update our APIs to better serve them with the data they need. Until then, it may be interesting to see what statistics they've focused on and how they get them from the publicly available data we already provide. You can see all this in their source code: https://github.com/yonathan06/cassandra-GLAM-tools
I think I got the point about how the pageviews concept is and how to use it but, as far as I understand, it's not possible to get details like article pageviews, for example, per country. Is this correct?
We have an ongoing project to release per-country per-article pageview information. It's hard for privacy reasons, and we are building a privacy system that takes all that into account. For now, we have pageviews by country at a high level https://stats.wikimedia.org/#/all-projects/reading/page-views-by-country and most viewed articles https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews#Most_viewed_articles_per_country by country. I'm linking to different parts of our data ecosystem so you can get familiar with it.
If so, what should be the way to get (or process) the information to produce the data?
The only way is to help with the ongoing (and complex) differential privacy work https://phabricator.wikimedia.org/T307245
Also, I'm reading about the resulting format[1] but I can't find the
related logs.
Any suggestions? Thanks.
[1] https://meta.wikimedia.org/wiki/Research:Page_view#Resulting_format
I can see how that can be misleading. For GLAMs, usually you would want to download media request statistics https://dumps.wikimedia.org/other/mediacounts/daily/, as the glamwikidashboard I mentioned above does. (They are currently working on getting as much as they can from the media requests api https://wikitech.wikimedia.org/wiki/Analytics/AQS/Mediarequests instead). If you are indeed interested in pageviews, the definition you linked to talks about the data internally available. Can I ask you to elaborate a bit more on why you need per-country data? _______________________________________________ Analytics mailing list -- analytics@lists.wikimedia.org To unsubscribe send an email to analytics-leave@lists.wikimedia.org
On Fri, Dec 23, 2022 at 12:26 PM Ilario Valdelli valdelli@gmail.com wrote:
Hi all To clarify.
Thanks for explaining.
What's your opinion about these related API changes? https://phabricator.wikimedia.org/T321702
We have excluded the use of API since the start.
The single solution is to replicate the data to have better performance. The software developed a lot the features to explore the historical evolution of the data and this meant a different architecture.
As you probably know, the difference between a database and a data warehouse is exactly the introduction of a third dimension (the time) that transforms everything in a cube. It means that the data increases a lot.
In addition, exactly for that, the data warehouse CAN replicate data to improve the performances. This replication and redundancy increases the database more and more.
We have spent two year to prepare the backend exactly to have a local replication of the data that would have been helpful also for other applications.
As you know, Wikimedia Ch developed also the Map Service together with Cassandra.
Kind regards
On Fri, Dec 23, 2022 at 12:47 PM Ismael Olea ismael@olea.org wrote:
On Fri, Dec 23, 2022 at 12:26 PM Ilario Valdelli valdelli@gmail.com wrote:
Hi all To clarify.
Thanks for explaining.
What's your opinion about these related API changes? https://phabricator.wikimedia.org/T321702 --
Ismael Olea
http://olea.org/diario/ _______________________________________________ Analytics mailing list -- analytics@lists.wikimedia.org To unsubscribe send an email to analytics-leave@lists.wikimedia.org
Il 20/12/22 20:03, Ismael Olea ha scritto:
We are working with a heritage institution in a GLAM project and they are interested in access statistics for the resources they have released in Wikimedia. I think I got the point about how the pageviews concept is and how to use it but, as far as I understand, it's not possible to get details like article pageviews, for example, per country.
Depending on what you're interested in, it might be a sufficiently good approximation to look at usage by language.
The short case study about BEIC https://doi.org/10.4403/jlis.it-12481 can give some ideas for statistics to track. See also * https://commons.wikimedia.org/wiki/Commons:BEIC (brief overview) * https://it.wikipedia.org/wiki/Progetto:GLAM/BEIC/2015-07#Sommario (analysis in Italian)
If you know the totals for the downloads from Commons, and you get some idea of the distribution by looking at the usage by language or other sources, that might be enough. There's always a certain level of uncertainty, so the exact absolute numbers are rarely that telling. BEIC for example was interested in the (order of magnitude of the) totals and it was useful to know the approximate share of traffic/interest from outside Italy (was it 1, 10 or 99 %?) and how much was due to ongoing "external" interest.
Note that in the mediacounts you can get additional hints by checking the share of requests coming from typical visits on Wikipedia (default thumbnail sizes), visits on Commons or downloads (raw files) and hotlinks.
Federico
I'm gonna check carefully your tips. Thanks a lot.
On Wed, Dec 21, 2022 at 3:41 PM Federico Leva (Nemo) nemowiki@gmail.com wrote:
Il 20/12/22 20:03, Ismael Olea ha scritto:
We are working with a heritage institution in a GLAM project and they are interested in access statistics for the resources they have released in Wikimedia. I think I got the point about how the pageviews concept is and how to use it but, as far as I understand, it's not possible to get details like article pageviews, for example, per country.
Depending on what you're interested in, it might be a sufficiently good approximation to look at usage by language.
The short case study about BEIC https://doi.org/10.4403/jlis.it-12481 can give some ideas for statistics to track. See also
- https://commons.wikimedia.org/wiki/Commons:BEIC (brief overview)
- https://it.wikipedia.org/wiki/Progetto:GLAM/BEIC/2015-07#Sommario
(analysis in Italian)
If you know the totals for the downloads from Commons, and you get some idea of the distribution by looking at the usage by language or other sources, that might be enough. There's always a certain level of uncertainty, so the exact absolute numbers are rarely that telling. BEIC for example was interested in the (order of magnitude of the) totals and it was useful to know the approximate share of traffic/interest from outside Italy (was it 1, 10 or 99 %?) and how much was due to ongoing "external" interest.
Note that in the mediacounts you can get additional hints by checking the share of requests coming from typical visits on Wikipedia (default thumbnail sizes), visits on Commons or downloads (raw files) and hotlinks.
Federico