"Thus, note that incremental downloads of these dumps may generate
inconsistent data. Consider using EventStreams for real time updates on
MediaWiki changes (API docs)."
I can see how that's confusing. I'll try to re-word it and then answer
your other questions below. So this is basically saying that if you want
to process the whole history every month, this dataset will work ok. But
if you plan on doing something like:
* It's 2021-09 and you download the whole dump and process it
* In 2021-10 the new dump comes out, you download it but only process the
events with timestamps between 2021-09 and 2021-10
That won't work. Because some of the updates might be done to historical
records with timestamps long before 2021-09. That little quirk is what
allows us to add high value fields like "time between this revision and the
next revision" or "is this revision deleted at some point in the future".
I am planning to retrieve updated versions of the metadata regularly. So I
guess I have to use EventStream to access the recent
changes? AFAIU there
recent changes come from the RecentChanges table [1]. So what would be a
proper stream of actions? For example:
1. Dowload the mediawiki_history dump once and parse it
2. For every new update of my data pool, access recent changes through
event stream as per [2]
Did understand this correctly?
This would work ok, but would indeed be a bit more complicated. If you
absolutely need data every minute, hour, or day, then this would be one
choice. One downside is that it would be hard to compute some of the
fields we provide in the whole dump, so if you can wait a month to get the
refreshed dump then that's better. The options Jaime gave might work
better, it would depend on your requirements and what you're comfortable
with. In the long term we hope to release a version of this dataset
updated more frequently, hopefully daily (but this is more than a year
away).
Last thing, in the pageview archive there are three types of file:
automated, spider and user. Am I right in
understanding that "user"
relates to pageviews operated by real persons, while "automated" and
"spider" by programs (not sure about the difference between the two)?
Yes, user is our best heuristic-algorithm guess at what part of our traffic
is initiated by humans (or smart members of other species :)). Automated
is our guess at bot traffic that doesn't identify itself. And spider is
traffic that identifies itself (such as the google crawler bot).