Hi Ahmed and Neil,
Super interesting project you have Ahmed :)
Thanks Neil for the very precise you had to Ahmed's question !
Some comments about number disparity below:
and I see that Quarry reports 168668 while the REST API reports 169754
edits for the same period (less than 1% error).
Those two metrics (quarry and API) refer to the exact same datatet:
revisions from any user type on any page type for 2018-02-28 day, on enwiki.
The first thing to consider is that when a Wikipedia
page is deleted, all
the corresponding rows from the revision table are moved to a separate archive
table <https://www.mediawiki.org/wiki/Manual:Archive_table> (probably for
reasons that made much more sense years ago). However, in the Data Lake and
therefore the REST API, there's no such separation.
This query is one way to get a combined count:
https://quarry.wmflabs.org/query/25794
However, combining the two tables yields 171 346
edits, which makes the
Data Lake count about 1% *lower *than the application database count.
When computing revisions with deleted ones on the datalake, we end up with
the same exact number found by the Quaryy query: 171346
Now about the difference between Quarry and API on revisions without
deletes, it is mostly due to recently deleted data (there still are 126
revisions difference that I don't understand
https://quarry.wmflabs.org/query/25796) .
Cheers !
Joseph