Hi Ahmed,
In my opinion the 126 discrepancy is due to deletes/restores complex patterns. The notion of 'fixed' is not super clear to me here :)
About the data being updated monthly because of a full history scan, you're mostly right. Here is a summary of my view on it: - user and page tables maintain 'states', and we wanted to be able to show historical values (what was the page-title of that page at the time of that revision) - This process uses the log table, is quite complex, and was originally designed for the whole history. - Indeed sometimes history is updated in mediawiki, and we want to reflect those patches. - In any case, even if using more up-to-date data from stream or regular queries to the database, a full history-reconstruction would have been needed as a starting point - And that's what we have now.
It has always been in the plan to use streams to be more reactive in updating recent data, but it has not yet been done.
Joseph
On Fri, Mar 23, 2018 at 8:12 PM, Ahmed Fasih wuzzyview@gmail.com wrote:
Thank you Joseph and Neil, that is so helpful!
Is it possible the 126 edits discrepancy for February 28th will be corrected the next time the data regeneration/denormalization is run, at the end of the month, to generate the daily data for the REST API? I ask not because 126 edits (<0.01% error!) is that important but just to try and understand better this cool dataset's caveats. I'll double-check this in a few days, when the data is re-analyzed for generating March's daily data on the REST API.
Also, did the guess about why generating the daily data for the REST API requires a full scan of all historic data—did that make sense? I'd been given the hypothesis that, because in exceptional situations Wikipedia admins can edit the changelog to completely purge an edit (e.g., of copyrighted information), the number of edits for some date in the past might change. So to account for this possibility, the simplest thing to do is roll through all the data again. (I ask this because today we have a lot of interest in append-only logs, like in Dat, Secure Scuttlebutt, and of course blockchains—systems where information cannot be repudiated after it's published. If Wikipedia rejects append-only logs and allows official history to be changed, per this hypothesis, then that's a really weighty argument against those systems.)
Anyway, I'll try writing Quarry versions of the subset of REST API endpoints for which I want daily updates (edits, new pages, etc.—new users might be the hardest?), and see what's the best way to get daily updates for all of them (i.e., edit the query every day, create a new query for each day, etc.). Using Quarry seems much easier than generating these daily numbers from the Wikimedia EventStreams:
https://stream.wikimedia.org/?doc
Again, many, many thanks!