Hi,
We are three graduate students at UC Berkeley, and we are currently working on a machine learning project for a class that we’re taking.
We’re using the page views data that we believe you maintain: https://dumps.wikimedia.org/other/pagecounts-raw/ https://dumps.wikimedia.org/other/pagecounts-raw/
We have two quick questions that we were hoping you could answer:
1) We found views with a size of -1 or 0. Does this mean the page doesn’t exist?
2) We found some articles have `size` that widely varies throughout the hourly snapshots of a day. Is that legitimate, or is there something odd with the data?
Thanks, Ugur
Hi Ugur,
The pagecounts-raw data is deprecated and hasn’t been updated for a few years. Have you seen the pagecounts-ez data? It is a merger of old pagecounts-raw and newer better pageviews data. You can find it here: https://dumps.wikimedia.org/other/pagecounts-ez/
As for the -1 view counts, that’s the first time I heard that problem. If it’s in the file it means the page exists but I have no idea what a negative count means, it shouldn’t be possible and I’m sure it doesn’t happen in the new data.
The size field is the bytes served, so it could vary because as the page is edited from one minute to the next. But I couldn’t tell you how reliable it is. One tip would be to look at the page history and see how many bytes the page has at each revision. You can do this using https://quarry.wmflabs.org and querying the revision table for rev_size during the hours you see pageviews. That way you can see the accuracy of the size data.
Good luck, and we’re here to help.
On Fri, Nov 17, 2017 at 10:18 Ugur Yildirim ugur.yildirim@berkeley.edu wrote:
Hi,
We are three graduate students at UC Berkeley, and we are currently working on a machine learning project for a class that we’re taking.
We’re using the page views data that we believe you maintain: https://dumps.wikimedia.org/other/pagecounts-raw/
We have two quick questions that we were hoping you could answer:
- We found views with a size of -1 or 0. Does this mean the page doesn’t
exist?
- We found some articles have `size` that widely varies throughout the
hourly snapshots of a day. Is that legitimate, or is there something odd with the data?
Thanks, Ugur _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics