Hi Ahmed,

In my opinion the 126 discrepancy is due to deletes/restores complex patterns. The notion of 'fixed' is not super clear to me here :)

About the data being updated monthly because of a full history scan, you're mostly right. Here is a summary of my view on it:
 - user and page tables maintain 'states', and we wanted to be able to show historical values (what was the page-title of that page at the time of that revision) - This process uses the log table, is quite complex, and was originally designed for the whole history.
 - Indeed sometimes history is updated in mediawiki, and we want to reflect those patches.
 - In any case, even if using more up-to-date data from stream or regular queries to the database, a full history-reconstruction would have been needed as a starting point - And that's what we have now.

It has always been in the plan to use streams to be more reactive in updating recent data, but it has not yet been done.

Joseph

On Fri, Mar 23, 2018 at 8:12 PM, Ahmed Fasih <wuzzyview@gmail.com> wrote:
Thank you Joseph and Neil, that is so helpful!

Is it possible the 126 edits discrepancy for February 28th will be
corrected the next time the data regeneration/denormalization is run,
at the end of the month, to generate the daily data for the REST API?
I ask not because 126 edits (<0.01% error!) is that important but just
to try and understand better this cool dataset's caveats. I'll
double-check this in a few days, when the data is re-analyzed for
generating March's daily data on the REST API.

Also, did the guess about why generating the daily data for the REST
API requires a full scan of all historic data—did that make sense? I'd
been given the hypothesis that, because in exceptional situations
Wikipedia admins can edit the changelog to completely purge an edit
(e.g., of copyrighted information), the number of edits for some date
in the past might change. So to account for this possibility, the
simplest thing to do is roll through all the data again. (I ask this
because today we have a lot of interest in append-only logs, like in
Dat, Secure Scuttlebutt, and of course blockchains—systems where
information cannot be repudiated after it's published. If Wikipedia
rejects append-only logs and allows official history to be changed,
per this hypothesis, then that's a really weighty argument against
those systems.)

Anyway, I'll try writing Quarry versions of the subset of REST API
endpoints for which I want daily updates (edits, new pages, etc.—new
users might be the hardest?), and see what's the best way to get daily
updates for all of them (i.e., edit the query every day, create a new
query for each day, etc.). Using Quarry seems much easier than
generating these daily numbers from the Wikimedia EventStreams:

https://stream.wikimedia.org/?doc

Again, many, many thanks!



--
Joseph Allemandou
Data Engineer @ Wikimedia Foundation
IRC: joal