On 23 March 2018 at 07:02, Ahmed Fasih wuzzyview@gmail.com wrote:
Neil, thank you so much for your insightful comments!
No problem. It's always a good feeling when you know the answer to someone else's question :)
I was able to use Quarry to get the number of edits on English Wikipedia yesterday, so I can indeed get recent data from it—hooray!!!
I also used it to cross-check against the REST API for February 28th:
https://quarry.wmflabs.org/query/25783
and I see that Quarry reports 168668 while the REST API reports 169754 edits for the same period (less than 1% error). I'll do some digging to see if the difference is from the denormalization you mentioned, or other reasons why they disagree.
The first thing to consider is that when a Wikipedia page is deleted, all the corresponding rows from the revision table are moved to a separate archive table https://www.mediawiki.org/wiki/Manual:Archive_table (probably for reasons that made much more sense years ago). However, in the Data Lake and therefore the REST API, there's no such separation.
This query is one way to get a combined count: https://quarry.wmflabs.org/query/25794
However, combining the two tables yields 171 346 edits, which makes the Data Lake count about 1% *lower *than the application database count.
At the moment, I can't think of a good reason for that, but I'm sure others on this list know.