In my opinion the 126 discrepancy is due to deletes/restores complex
patterns. The notion of 'fixed' is not super clear to me here :)
About the data being updated monthly because of a full history scan, you're
mostly right. Here is a summary of my view on it:
- user and page tables maintain 'states', and we wanted to be able to show
historical values (what was the page-title of that page at the time of that
revision) - This process uses the log table, is quite complex, and was
originally designed for the whole history.
- Indeed sometimes history is updated in mediawiki, and we want to reflect
- In any case, even if using more up-to-date data from stream or regular
queries to the database, a full history-reconstruction would have been
needed as a starting point - And that's what we have now.
It has always been in the plan to use streams to be more reactive in
updating recent data, but it has not yet been done.
On Fri, Mar 23, 2018 at 8:12 PM, Ahmed Fasih <wuzzyview(a)gmail.com> wrote:
Thank you Joseph and Neil, that is so helpful!
Is it possible the 126 edits discrepancy for February 28th will be
corrected the next time the data regeneration/denormalization is run,
at the end of the month, to generate the daily data for the REST API?
I ask not because 126 edits (<0.01% error!) is that important but just
to try and understand better this cool dataset's caveats. I'll
double-check this in a few days, when the data is re-analyzed for
generating March's daily data on the REST API.
Also, did the guess about why generating the daily data for the REST
API requires a full scan of all historic data—did that make sense? I'd
been given the hypothesis that, because in exceptional situations
Wikipedia admins can edit the changelog to completely purge an edit
(e.g., of copyrighted information), the number of edits for some date
in the past might change. So to account for this possibility, the
simplest thing to do is roll through all the data again. (I ask this
because today we have a lot of interest in append-only logs, like in
Dat, Secure Scuttlebutt, and of course blockchains—systems where
information cannot be repudiated after it's published. If Wikipedia
rejects append-only logs and allows official history to be changed,
per this hypothesis, then that's a really weighty argument against
Anyway, I'll try writing Quarry versions of the subset of REST API
endpoints for which I want daily updates (edits, new pages, etc.—new
users might be the hardest?), and see what's the best way to get daily
updates for all of them (i.e., edit the query every day, create a new
query for each day, etc.). Using Quarry seems much easier than
generating these daily numbers from the Wikimedia EventStreams:
Again, many, many thanks!
Data Engineer @ Wikimedia Foundation