"Thus, note that incremental downloads of these dumps may generate inconsistent data. Consider using EventStreams for real time updates on MediaWiki changes (API docs)."
I can see how that's confusing. I'll try to re-word it and then answer your other questions below. So this is basically saying that if you want to process the whole history every month, this dataset will work ok. But if you plan on doing something like:
* It's 2021-09 and you download the whole dump and process it * In 2021-10 the new dump comes out, you download it but only process the events with timestamps between 2021-09 and 2021-10
That won't work. Because some of the updates might be done to historical records with timestamps long before 2021-09. That little quirk is what allows us to add high value fields like "time between this revision and the next revision" or "is this revision deleted at some point in the future".
I am planning to retrieve updated versions of the metadata regularly. So I
guess I have to use EventStream to access the recent changes? AFAIU there recent changes come from the RecentChanges table [1]. So what would be a proper stream of actions? For example:
- Dowload the mediawiki_history dump once and parse it
- For every new update of my data pool, access recent changes through
event stream as per [2]
Did understand this correctly?
This would work ok, but would indeed be a bit more complicated. If you absolutely need data every minute, hour, or day, then this would be one choice. One downside is that it would be hard to compute some of the fields we provide in the whole dump, so if you can wait a month to get the refreshed dump then that's better. The options Jaime gave might work better, it would depend on your requirements and what you're comfortable with. In the long term we hope to release a version of this dataset updated more frequently, hopefully daily (but this is more than a year away).
Last thing, in the pageview archive there are three types of file:
automated, spider and user. Am I right in understanding that "user" relates to pageviews operated by real persons, while "automated" and "spider" by programs (not sure about the difference between the two)?
Yes, user is our best heuristic-algorithm guess at what part of our traffic is initiated by humans (or smart members of other species :)). Automated is our guess at bot traffic that doesn't identify itself. And spider is traffic that identifies itself (such as the google crawler bot).