On 23 March 2018 at 07:02, Ahmed Fasih <wuzzyview(a)gmail.com> wrote:
Neil, thank you so much for your insightful comments!
No problem. It's always a good feeling when you know the answer to someone
else's question :)
I was able to use Quarry to get the number of edits on
English
Wikipedia yesterday, so I can indeed get recent data from it—hooray!!!
I also used it to cross-check against the REST API for February 28th:
https://quarry.wmflabs.org/query/25783
and I see that Quarry reports 168668 while the REST API reports 169754
edits for the same period (less than 1% error). I'll do some digging
to see if the difference is from the denormalization you mentioned, or
other reasons why they disagree.
The first thing to consider is that when a Wikipedia page is deleted, all
the corresponding rows from the revision table are moved to a separate archive
table <https://www.mediawiki.org/wiki/Manual:Archive_table> (probably for
reasons that made much more sense years ago). However, in the Data Lake and
therefore the REST API, there's no such separation.
This query is one way to get a combined count:
https://quarry.wmflabs.org/query/25794
However, combining the two tables yields 171 346 edits, which makes the
Data Lake count about 1% *lower *than the application database count.
At the moment, I can't think of a good reason for that, but I'm sure others
on this list know.