Hi Cristina!

In regards to the question:

Last thing, in the pageview archive there are three types of file: automated, spider and user.   Am I right in understanding that "user" relates to pageviews operated by real persons, while "automated" and "spider" by programs (not sure about the difference between the two)?
 
Yes, "user" relates to pageviews operated by real people. "Spider" pageviews are those generated by self-declared bots, the ones that are identified as such in their UserAgent header (for instance web crawlers). "Automated" pageviews are those generated by bots that are not identified as such. They are labelled separately because we use different methods for labelling them: the spider pageviews are identified by parsing the UserAgent string, and the automated ones are identified with request pattern heuristics.

Hope this helps!


On Fri, Sep 17, 2021 at 5:47 PM Cristina Gava via Analytics <analytics@lists.wikimedia.org> wrote:
Hi Dan,

Thanks a lot. I think I bumped into that link at some point and then I wasn't able to come across it again.
There is a point that is not entirely clear to me

"Thus, note that incremental downloads of these dumps may generate inconsistent data. Consider using EventStreams for real time updates on MediaWiki changes (API docs)."

I am planning to retrieve updated versions of the metadata regularly. So I guess I have to use EventStream to access the recent changes? AFAIU there recent changes come from the RecentChanges table [1]. So what would be a proper stream of actions? For example:

1. Dowload the mediawiki_history dump once and parse it
2. For every new update of my data pool, access recent changes through event stream as per [2]

Did understand this correctly?

Last thing, in the pageview archive there are three types of file: automated, spider and user.   Am I right in understanding that "user" relates to pageviews operated by real persons, while "automated" and "spider" by programs (not sure about the difference between the two)?

Cristina

[1] https://www.mediawiki.org/wiki/Manual:Recentchanges_table
[2] https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams
_______________________________________________
Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-leave@lists.wikimedia.org


--
Marcel Ruiz Forns (he/him)
Senior Software Engineer