Hello! I have some questions about the latency of some Wikipedia REST endpoints from
https://wikimedia.org/api/rest_v1
I see that I can get very recent pageviews data, e.g.
https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia/a...
accessed now, on 2018/03/22, at 0249 UTC, gives me an hourly pageviews on the English Wikipedia at timestamp "2018032200", so with about ~4 hours latency, very nice!
In contrast, asking for the daily number of edits via
https://wikimedia.org/api/rest_v1/metrics/edits/aggregate/en.wikipedia/all-e...
only gives me data up to the end of February, with no March data. This makes me think the daily datasets are generated only once a month? How might I gain access to more recent daily data like the "rest_v1/metrics/edits" endpoints?
Thanks!
Hello Ahmed, nice to meet you!
As a data analyst who constantly works with the edit data, I would love to have it updated daily too. But there are serious infrastructural limitations that make that very difficult.
Both the edit data and pageview data that you're talking about come from the Hadoop-based Analytics Data Lake https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake. However, because of limitations in the underlying MediaWiki application databases https://www.mediawiki.org/wiki/Manual:Database_layout that Hive pulls edit data from, the data requires some complex reconstruction and denormalization https://wikitech.wikimedia.org/wiki/Analytics/Systems/Data_Lake/Edits/Pipeline that takes several days to a week. This mostly affects the historical data, but the reconstruction currently has to be done for all history at once because historical data sometimes changes long after the fact in the MediaWiki databases. So the entire dataset is regenerated every month, which would be impossible to do daily.
I'm sure there are strategies that could ultimately fix these problems, but I'm also sure that they would take great effort to implement, so unfortunately that's unlikely to happen anytime soon.
In the meantime, you may be able to work around these issues by using the public replicas of the application database https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#Connecting_to_the_database_replicas s. Unlike with the API, you'd have to do the computation yourself, but it is updated in (near) real-time. Quarry https://meta.wikimedia.org/wiki/Research:Quarry is an excellent, easy-to-use tool for running SQL queries on those replicas.
I'm not an expert on the Data Lake, but I'm pretty sure this is broadly accurate. Corrections from the Analytics team welcome :)
On 22 March 2018 at 08:21, Ahmed Fasih wuzzyview@gmail.com wrote:
Hello! I have some questions about the latency of some Wikipedia REST endpoints from
https://wikimedia.org/api/rest_v1
I see that I can get very recent pageviews data, e.g.
https://wikimedia.org/api/rest_v1/metrics/pageviews/ aggregate/en.wikipedia/all-access/all-agents/hourly/2018032100/2018032300
accessed now, on 2018/03/22, at 0249 UTC, gives me an hourly pageviews on the English Wikipedia at timestamp "2018032200", so with about ~4 hours latency, very nice!
In contrast, asking for the daily number of edits via
https://wikimedia.org/api/rest_v1/metrics/edits/ aggregate/en.wikipedia/all-editor-types/all-page-types/ daily/20180225/20180321
only gives me data up to the end of February, with no March data. This makes me think the daily datasets are generated only once a month? How might I gain access to more recent daily data like the "rest_v1/metrics/edits" endpoints?
Thanks!
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On 22 March 2018 at 13:41, Neil Patel Quinn nquinn@wikimedia.org wrote:
Both the edit data and pageview data that you're talking about come from the Hadoop-based Analytics Data Lake https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake. However, because of limitations in the underlying MediaWiki application databases https://www.mediawiki.org/wiki/Manual:Database_layout *that Hive pulls edit data from*, the data requires some complex reconstruction and denormalization https://wikitech.wikimedia.org/wiki/Analytics/Systems/Data_Lake/Edits/Pipeline that takes several days to a week.
Sorry, I garbled that a little. It's more correct to say: "because of limitations in the underlying MediaWiki application databases *that are the source of the edit data*, the data requires..."
really good summary of the situation, Neil, I'm bookmarking this and will re-use it when people ask :)
On Thu, Mar 22, 2018 at 7:07 AM, Neil Patel Quinn nquinn@wikimedia.org wrote:
On 22 March 2018 at 13:41, Neil Patel Quinn nquinn@wikimedia.org wrote:
Both the edit data and pageview data that you're talking about come from the Hadoop-based Analytics Data Lake https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake. However, because of limitations in the underlying MediaWiki application databases https://www.mediawiki.org/wiki/Manual:Database_layout *that Hive pulls edit data from*, the data requires some complex reconstruction and denormalization https://wikitech.wikimedia.org/wiki/Analytics/Systems/Data_Lake/Edits/Pipeline that takes several days to a week.
Sorry, I garbled that a little. It's more correct to say: "because of limitations in the underlying MediaWiki application databases *that are the source of the edit data*, the data requires..."
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Neil, thank you so much for your insightful comments!
I was able to use Quarry to get the number of edits on English Wikipedia yesterday, so I can indeed get recent data from it—hooray!!!
I also used it to cross-check against the REST API for February 28th:
https://quarry.wmflabs.org/query/25783
and I see that Quarry reports 168668 while the REST API reports 169754 edits for the same period (less than 1% error). I'll do some digging to see if the difference is from the denormalization you mentioned, or other reasons why they disagree.
Maybe one more question:
the data requires some complex reconstruction and denormalization that takes several days to a week. This mostly affects the historical data, but the reconstruction currently has to be done for all history at once because historical data sometimes changes long after the fact in the MediaWiki databases. So the entire dataset is regenerated every month, which would be impossible to do daily.
A Wikipedian (hi Yurik!) guessed that this full scan over the data is needed because Wikipedia admins have the authority to make changes to the history of an article (e.g., if a rogue editor posted copyrighted information that shouldn't ever be visible in the changelog). If articles' changelogs were append-only, then the operation could pick up where it left off, rather than starting from scratch, but this isn't the case, so a full scan is needed. Is this a good understanding?
Again, many thanks!
Ahmed
PS. In case of an overabundance of curiosity, my little project is at https://github.com/fasiha/wikiatrisk :)
On 23 March 2018 at 07:02, Ahmed Fasih wuzzyview@gmail.com wrote:
Neil, thank you so much for your insightful comments!
No problem. It's always a good feeling when you know the answer to someone else's question :)
I was able to use Quarry to get the number of edits on English Wikipedia yesterday, so I can indeed get recent data from it—hooray!!!
I also used it to cross-check against the REST API for February 28th:
https://quarry.wmflabs.org/query/25783
and I see that Quarry reports 168668 while the REST API reports 169754 edits for the same period (less than 1% error). I'll do some digging to see if the difference is from the denormalization you mentioned, or other reasons why they disagree.
The first thing to consider is that when a Wikipedia page is deleted, all the corresponding rows from the revision table are moved to a separate archive table https://www.mediawiki.org/wiki/Manual:Archive_table (probably for reasons that made much more sense years ago). However, in the Data Lake and therefore the REST API, there's no such separation.
This query is one way to get a combined count: https://quarry.wmflabs.org/query/25794
However, combining the two tables yields 171 346 edits, which makes the Data Lake count about 1% *lower *than the application database count.
At the moment, I can't think of a good reason for that, but I'm sure others on this list know.
Hi Ahmed and Neil, Super interesting project you have Ahmed :) Thanks Neil for the very precise you had to Ahmed's question !
Some comments about number disparity below:
and I see that Quarry reports 168668 while the REST API reports 169754 edits for the same period (less than 1% error).
Those two metrics (quarry and API) refer to the exact same datatet: revisions from any user type on any page type for 2018-02-28 day, on enwiki.
The first thing to consider is that when a Wikipedia page is deleted, all the corresponding rows from the revision table are moved to a separate archive table https://www.mediawiki.org/wiki/Manual:Archive_table (probably for reasons that made much more sense years ago). However, in the Data Lake and therefore the REST API, there's no such separation.
This query is one way to get a combined count: https://quarry.wmflabs.org/query/25794
However, combining the two tables yields 171 346 edits, which makes the Data Lake count about 1% *lower *than the application database count.
When computing revisions with deleted ones on the datalake, we end up with the same exact number found by the Quaryy query: 171346
Now about the difference between Quarry and API on revisions without deletes, it is mostly due to recently deleted data (there still are 126 revisions difference that I don't understand https://quarry.wmflabs.org/query/25796) . Cheers ! Joseph
Thank you Joseph and Neil, that is so helpful!
Is it possible the 126 edits discrepancy for February 28th will be corrected the next time the data regeneration/denormalization is run, at the end of the month, to generate the daily data for the REST API? I ask not because 126 edits (<0.01% error!) is that important but just to try and understand better this cool dataset's caveats. I'll double-check this in a few days, when the data is re-analyzed for generating March's daily data on the REST API.
Also, did the guess about why generating the daily data for the REST API requires a full scan of all historic data—did that make sense? I'd been given the hypothesis that, because in exceptional situations Wikipedia admins can edit the changelog to completely purge an edit (e.g., of copyrighted information), the number of edits for some date in the past might change. So to account for this possibility, the simplest thing to do is roll through all the data again. (I ask this because today we have a lot of interest in append-only logs, like in Dat, Secure Scuttlebutt, and of course blockchains—systems where information cannot be repudiated after it's published. If Wikipedia rejects append-only logs and allows official history to be changed, per this hypothesis, then that's a really weighty argument against those systems.)
Anyway, I'll try writing Quarry versions of the subset of REST API endpoints for which I want daily updates (edits, new pages, etc.—new users might be the hardest?), and see what's the best way to get daily updates for all of them (i.e., edit the query every day, create a new query for each day, etc.). Using Quarry seems much easier than generating these daily numbers from the Wikimedia EventStreams:
https://stream.wikimedia.org/?doc
Again, many, many thanks!
Hi Ahmed,
In my opinion the 126 discrepancy is due to deletes/restores complex patterns. The notion of 'fixed' is not super clear to me here :)
About the data being updated monthly because of a full history scan, you're mostly right. Here is a summary of my view on it: - user and page tables maintain 'states', and we wanted to be able to show historical values (what was the page-title of that page at the time of that revision) - This process uses the log table, is quite complex, and was originally designed for the whole history. - Indeed sometimes history is updated in mediawiki, and we want to reflect those patches. - In any case, even if using more up-to-date data from stream or regular queries to the database, a full history-reconstruction would have been needed as a starting point - And that's what we have now.
It has always been in the plan to use streams to be more reactive in updating recent data, but it has not yet been done.
Joseph
On Fri, Mar 23, 2018 at 8:12 PM, Ahmed Fasih wuzzyview@gmail.com wrote:
Thank you Joseph and Neil, that is so helpful!
Is it possible the 126 edits discrepancy for February 28th will be corrected the next time the data regeneration/denormalization is run, at the end of the month, to generate the daily data for the REST API? I ask not because 126 edits (<0.01% error!) is that important but just to try and understand better this cool dataset's caveats. I'll double-check this in a few days, when the data is re-analyzed for generating March's daily data on the REST API.
Also, did the guess about why generating the daily data for the REST API requires a full scan of all historic data—did that make sense? I'd been given the hypothesis that, because in exceptional situations Wikipedia admins can edit the changelog to completely purge an edit (e.g., of copyrighted information), the number of edits for some date in the past might change. So to account for this possibility, the simplest thing to do is roll through all the data again. (I ask this because today we have a lot of interest in append-only logs, like in Dat, Secure Scuttlebutt, and of course blockchains—systems where information cannot be repudiated after it's published. If Wikipedia rejects append-only logs and allows official history to be changed, per this hypothesis, then that's a really weighty argument against those systems.)
Anyway, I'll try writing Quarry versions of the subset of REST API endpoints for which I want daily updates (edits, new pages, etc.—new users might be the hardest?), and see what's the best way to get daily updates for all of them (i.e., edit the query every day, create a new query for each day, etc.). Using Quarry seems much easier than generating these daily numbers from the Wikimedia EventStreams:
https://stream.wikimedia.org/?doc
Again, many, many thanks!
(I ask this
because today we have a lot of interest in append-only logs, like in Dat, Secure Scuttlebutt, and of course blockchains—systems where information cannot be repudiated after it's published. If Wikipedia rejects append-only logs and allows official history to be changed, per this hypothesis, then that's a really weighty argument against those systems.)
Wikipedia does allow history to be changed, but this has caused lots of problems with the Analytics pipelines. It's the main reason you can't get fresh data in the same way you get it for pageviews, even though pageview data is 3 orders of magnitude bigger. So I wouldn't follow Wikipedia's example too closely here :)
Also, I followed up and added the the FAQ: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Wikistats/Metrics/FAQ#...
On Mon, Mar 26, 2018 at 10:46 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
(I ask this
because today we have a lot of interest in append-only logs, like in Dat, Secure Scuttlebutt, and of course blockchains—systems where information cannot be repudiated after it's published. If Wikipedia rejects append-only logs and allows official history to be changed, per this hypothesis, then that's a really weighty argument against those systems.)
Wikipedia does allow history to be changed, but this has caused lots of problems with the Analytics pipelines. It's the main reason you can't get fresh data in the same way you get it for pageviews, even though pageview data is 3 orders of magnitude bigger. So I wouldn't follow Wikipedia's example too closely here :)