Hey there!
I just wrote a script that fetches data from the AQS new pages endpoint https://wikimedia.org/api/rest_v1/#!/Edited_pages_data/get_metrics_edited_pages_new_project_editor_type_page_type_granularity_start_end in order to prepare the our monthly health metrics (T199459 https://phabricator.wikimedia.org/T199459).
However, it seems like that endpoint doesn't yet have monthly data for September. For example, a query for Commons with a start of July 1 and and an end of October 1 https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/commons.wikimedia.org/all-editor-types/content/monthly/20180701/20181001 returns only data for July and August. What's the schedule for updating this data?
To be honest, I feel pretty frustrated by this. Wikistats 1 generates data on content pages with a delay of 10-15 days after the end of the month, which has made it difficult for us to provide timely metrics to executives and the board. I had assumed (to a degree that I didn't even check) that by switching to this API, we would instead only have to deal with the delay in generating the mediawiki_history snapshot (5-7 days after the end of the month). But that doesn't seem to be the case.
It should be updated soon, the jobs are all done successfully. But currently we do expect this kind of lag, I'll explain why.
When we started we were sqooping at the beginning of the month and the processing takes something like 4 days total, most of it sqooping. But this put too much load on the database serves too close to the beginning of the month when a bunch of other stuff is running. So we had to move it back to the 5th of the month [1]. Add 4 days onto that and we end up finishing around the 9th of the month. We don't like this at all and we're trying to figure out a better way to import the data incrementally so we can just start processing when we have all of it. It's unfortunate but we couldn't foresee the infrastructure limitation, too much was up in the air about even where we would sqoop from when we started this work. Joseph and I have a weekly meeting to discuss moving towards a more incremental approach, and this task is the parent task to watch for now: https://phabricator.wikimedia.org/T193650 (priority is low because we have too many other commitments, but it's something I'd love to see before we call wikistats 2 "production" quality)
[1] https://github.com/wikimedia/puppet/blob/28b78985d3612a6e19720be1fe8eef5f0df...
On Wed, Oct 10, 2018 at 10:00 PM Neil Patel Quinn nquinn@wikimedia.org wrote:
Hey there!
I just wrote a script that fetches data from the AQS new pages endpoint https://wikimedia.org/api/rest_v1/#!/Edited_pages_data/get_metrics_edited_pages_new_project_editor_type_page_type_granularity_start_end in order to prepare the our monthly health metrics (T199459 https://phabricator.wikimedia.org/T199459).
However, it seems like that endpoint doesn't yet have monthly data for September. For example, a query for Commons with a start of July 1 and and an end of October 1 https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/commons.wikimedia.org/all-editor-types/content/monthly/20180701/20181001 returns only data for July and August. What's the schedule for updating this data?
To be honest, I feel pretty frustrated by this. Wikistats 1 generates data on content pages with a delay of 10-15 days after the end of the month, which has made it difficult for us to provide timely metrics to executives and the board. I had assumed (to a degree that I didn't even check) that by switching to this API, we would instead only have to deal with the delay in generating the mediawiki_history snapshot (5-7 days after the end of the month). But that doesn't seem to be the case. -- Neil Patel Quinn https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF (he/him/his) product analyst, Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Wikistats 1 generates data on content pages with a delay of 10-15 days
after the end of the month This is true for full snapshots (for the reasons we have discussed before and that Dan has described on this thread). You can expect data to be available on the API soon after the 10th, but it is unlikely that it will be there before the 10th as we do not start the process until the 5th.
Now, data - as you now- is streamed real time, every second. So it is only the full reconstruction of events, the full snapshot, that takes several days to build. Have you looked into using the real time events when the next month snapshot is yet not available?
On Wed, Oct 10, 2018 at 7:48 PM Dan Andreescu dandreescu@wikimedia.org wrote:
It should be updated soon, the jobs are all done successfully. But currently we do expect this kind of lag, I'll explain why.
When we started we were sqooping at the beginning of the month and the processing takes something like 4 days total, most of it sqooping. But this put too much load on the database serves too close to the beginning of the month when a bunch of other stuff is running. So we had to move it back to the 5th of the month [1]. Add 4 days onto that and we end up finishing around the 9th of the month. We don't like this at all and we're trying to figure out a better way to import the data incrementally so we can just start processing when we have all of it. It's unfortunate but we couldn't foresee the infrastructure limitation, too much was up in the air about even where we would sqoop from when we started this work. Joseph and I have a weekly meeting to discuss moving towards a more incremental approach, and this task is the parent task to watch for now: https://phabricator.wikimedia.org/T193650 (priority is low because we have too many other commitments, but it's something I'd love to see before we call wikistats 2 "production" quality)
[1] https://github.com/wikimedia/puppet/blob/28b78985d3612a6e19720be1fe8eef5f0df...
On Wed, Oct 10, 2018 at 10:00 PM Neil Patel Quinn nquinn@wikimedia.org wrote:
Hey there!
I just wrote a script that fetches data from the AQS new pages endpoint https://wikimedia.org/api/rest_v1/#!/Edited_pages_data/get_metrics_edited_pages_new_project_editor_type_page_type_granularity_start_end in order to prepare the our monthly health metrics (T199459 https://phabricator.wikimedia.org/T199459).
However, it seems like that endpoint doesn't yet have monthly data for September. For example, a query for Commons with a start of July 1 and and an end of October 1 https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/commons.wikimedia.org/all-editor-types/content/monthly/20180701/20181001 returns only data for July and August. What's the schedule for updating this data?
To be honest, I feel pretty frustrated by this. Wikistats 1 generates data on content pages with a delay of 10-15 days after the end of the month, which has made it difficult for us to provide timely metrics to executives and the board. I had assumed (to a degree that I didn't even check) that by switching to this API, we would instead only have to deal with the delay in generating the mediawiki_history snapshot (5-7 days after the end of the month). But that doesn't seem to be the case. -- Neil Patel Quinn https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF (he/him/his) product analyst, Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Data has been updated this morning (CEST).
On Thu, Oct 11, 2018 at 5:13 AM Nuria Ruiz nuria@wikimedia.org wrote:
Wikistats 1 generates data on content pages with a delay of 10-15 days
after the end of the month This is true for full snapshots (for the reasons we have discussed before and that Dan has described on this thread). You can expect data to be available on the API soon after the 10th, but it is unlikely that it will be there before the 10th as we do not start the process until the 5th.
Now, data - as you now- is streamed real time, every second. So it is only the full reconstruction of events, the full snapshot, that takes several days to build. Have you looked into using the real time events when the next month snapshot is yet not available?
On Wed, Oct 10, 2018 at 7:48 PM Dan Andreescu dandreescu@wikimedia.org wrote:
It should be updated soon, the jobs are all done successfully. But currently we do expect this kind of lag, I'll explain why.
When we started we were sqooping at the beginning of the month and the processing takes something like 4 days total, most of it sqooping. But this put too much load on the database serves too close to the beginning of the month when a bunch of other stuff is running. So we had to move it back to the 5th of the month [1]. Add 4 days onto that and we end up finishing around the 9th of the month. We don't like this at all and we're trying to figure out a better way to import the data incrementally so we can just start processing when we have all of it. It's unfortunate but we couldn't foresee the infrastructure limitation, too much was up in the air about even where we would sqoop from when we started this work. Joseph and I have a weekly meeting to discuss moving towards a more incremental approach, and this task is the parent task to watch for now: https://phabricator.wikimedia.org/T193650 (priority is low because we have too many other commitments, but it's something I'd love to see before we call wikistats 2 "production" quality)
[1] https://github.com/wikimedia/puppet/blob/28b78985d3612a6e19720be1fe8eef5f0df...
On Wed, Oct 10, 2018 at 10:00 PM Neil Patel Quinn nquinn@wikimedia.org wrote:
Hey there!
I just wrote a script that fetches data from the AQS new pages endpoint https://wikimedia.org/api/rest_v1/#!/Edited_pages_data/get_metrics_edited_pages_new_project_editor_type_page_type_granularity_start_end in order to prepare the our monthly health metrics (T199459 https://phabricator.wikimedia.org/T199459).
However, it seems like that endpoint doesn't yet have monthly data for September. For example, a query for Commons with a start of July 1 and and an end of October 1 https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/commons.wikimedia.org/all-editor-types/content/monthly/20180701/20181001 returns only data for July and August. What's the schedule for updating this data?
To be honest, I feel pretty frustrated by this. Wikistats 1 generates data on content pages with a delay of 10-15 days after the end of the month, which has made it difficult for us to provide timely metrics to executives and the board. I had assumed (to a degree that I didn't even check) that by switching to this API, we would instead only have to deal with the delay in generating the mediawiki_history snapshot (5-7 days after the end of the month). But that doesn't seem to be the case. -- Neil Patel Quinn https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF (he/him/his) product analyst, Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics