Sounds great, I believe that normalization of the title will be very useful for future researchers and usages, so as adding the pageId. Currently it is not always straight forward to correlate the wikipedia page with the unnormalized title
On Mar 14, 2015, at 14:00, analytics-request@lists.wikimedia.org wrote:
Send Analytics mailing list submissions to analytics@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/analytics or, via email, send a message with subject or body 'help' to analytics-request@lists.wikimedia.org
You can reach the person managing the list at analytics-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Analytics digest..."
Today's Topics:
- [Technical][Request for Comment] A new format for the pageview dumps (Oliver Keyes)
Message: 1 Date: Fri, 13 Mar 2015 15:06:00 -0400 From: Oliver Keyes okeyes@wikimedia.org To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." analytics@lists.wikimedia.org, Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Subject: [Analytics] [Technical][Request for Comment] A new format for the pageview dumps Message-ID: CAAUQgdCSVg8htCS4VFDZJaL2rUEjmk5ZEA+zXcp3o9OWn-udfQ@mail.gmail.com Content-Type: text/plain; charset=UTF-8
So, we've got a new pageviews definition; it's nicely integrated and spitting out TRUE/FALSE values on each row with the best of em. But what does that mean for third-party researchers?
Well...not much, at the moment, because the data isn't being released somewhere. But one resource we do have that third-parties use a heck of a lot, is the per-page pageviews dumps on dumps.wikimedia.org.
Due to historical size constrains and decision-making (and by historical I mean: last decade) these have a number of weirdnesses in formatting terms; project identification is done using a notation style not really used anywhere else, mobile/zero/desktop appear on different lines, and the files are space-separated. I'd like to put some volunteer time into spitting out dumps in an easier-to-work-with format, using the new definition, to run in /parallel/ with the existing logs.
*The new format* At the moment we have the format:
project_notation - encoded_title - pageviews - bytes
This puts zero and mobile requests to pageX in a different place to desktop requests, requires some reconstruction of project_notation, and contains (for some use cases) extraneous information - that being the byte-count. The files are also headerless, unquoted and space-separated, which saves space but is sometimes...I think the term is "eeeeh-inducing".
What I'd like to use as a new format is:
full_project_url - encoded_title - desktop_pageviews - mobile_and_zero_pageviews
This file would:
- Include a header row;
- Be formatted as a tab-separated, rather than space-separated, file;
- Exclude bytecounts;
- Include desktop and mobile pageview counts on the same line;
- Use the full project URL ("en.wikivoyage.org") instead of the
pagecounts-specific notation ("en.v")
So, as a made-up example, instead of:
de.m.v Florence 32 9024 de.v Florence 920 7570
we'd end up with:
de.wikivoyage.org Florence 920 32
In the future we could also work to /normalise/ the title - replacing it with the page title that refers to the actual pageID. This won't impact legacy files, and is currently blocked on the Apps team, but should be viable as soon as that blocker goes away.
I've written a script capable of parsing and reformatting the legacy files, so we should be able to backfill in this new format too, if that's wanted (see below).
*The size constraints*
There really aren't any. Like I said, the historical rationale for a lot of these decisions seems to have been keeping the files small. But by putting requests to the same title from different site versions on the same line, and dropping byte-count, we save enough space that the resulting files are approximately the same size as the old ones - or in many cases, actually smaller.
*What I'm asking for*
Feedback! What do people think of the new format? What would they like to see that they don't? What don't they need, here? How useful would normalisation be? How useful would backfilling be?
*What I'm not asking for* WMF time! Like I said, this is a spare-time project; I've also got volunteers for Code Review and checking, too (Yuvi and Otto).
The replacement of the old files! Too many people depend on that format and that definition, and I don't want to make them sad.
Thoughts?
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
End of Analytics Digest, Vol 37, Issue 33
Makes sense :). So, normalised titles at a minimum, and ideally both normalised titles and pageID? (the disadvantage of just pageID is, of course, having to look the darn thing up. The disadvantage of title is that redirects that happen after-the-fact are a thing. Both would solve for this).
On 14 March 2015 at 17:35, Roni Wiener roni.wiener@keotic.com wrote:
Sounds great, I believe that normalization of the title will be very useful for future researchers and usages, so as adding the pageId. Currently it is not always straight forward to correlate the wikipedia page with the unnormalized title
On Mar 14, 2015, at 14:00, analytics-request@lists.wikimedia.org wrote:
Send Analytics mailing list submissions to analytics@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/analytics or, via email, send a message with subject or body 'help' to analytics-request@lists.wikimedia.org
You can reach the person managing the list at analytics-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Analytics digest..."
Today's Topics:
- [Technical][Request for Comment] A new format for the pageview dumps (Oliver Keyes)
Message: 1 Date: Fri, 13 Mar 2015 15:06:00 -0400 From: Oliver Keyes okeyes@wikimedia.org To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." analytics@lists.wikimedia.org, Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Subject: [Analytics] [Technical][Request for Comment] A new format for the pageview dumps Message-ID: CAAUQgdCSVg8htCS4VFDZJaL2rUEjmk5ZEA+zXcp3o9OWn-udfQ@mail.gmail.com Content-Type: text/plain; charset=UTF-8
So, we've got a new pageviews definition; it's nicely integrated and spitting out TRUE/FALSE values on each row with the best of em. But what does that mean for third-party researchers?
Well...not much, at the moment, because the data isn't being released somewhere. But one resource we do have that third-parties use a heck of a lot, is the per-page pageviews dumps on dumps.wikimedia.org.
Due to historical size constrains and decision-making (and by historical I mean: last decade) these have a number of weirdnesses in formatting terms; project identification is done using a notation style not really used anywhere else, mobile/zero/desktop appear on different lines, and the files are space-separated. I'd like to put some volunteer time into spitting out dumps in an easier-to-work-with format, using the new definition, to run in /parallel/ with the existing logs.
*The new format* At the moment we have the format:
project_notation - encoded_title - pageviews - bytes
This puts zero and mobile requests to pageX in a different place to desktop requests, requires some reconstruction of project_notation, and contains (for some use cases) extraneous information - that being the byte-count. The files are also headerless, unquoted and space-separated, which saves space but is sometimes...I think the term is "eeeeh-inducing".
What I'd like to use as a new format is:
full_project_url - encoded_title - desktop_pageviews - mobile_and_zero_pageviews
This file would:
- Include a header row;
- Be formatted as a tab-separated, rather than space-separated, file;
- Exclude bytecounts;
- Include desktop and mobile pageview counts on the same line;
- Use the full project URL ("en.wikivoyage.org") instead of the
pagecounts-specific notation ("en.v")
So, as a made-up example, instead of:
de.m.v Florence 32 9024 de.v Florence 920 7570
we'd end up with:
de.wikivoyage.org Florence 920 32
In the future we could also work to /normalise/ the title - replacing it with the page title that refers to the actual pageID. This won't impact legacy files, and is currently blocked on the Apps team, but should be viable as soon as that blocker goes away.
I've written a script capable of parsing and reformatting the legacy files, so we should be able to backfill in this new format too, if that's wanted (see below).
*The size constraints*
There really aren't any. Like I said, the historical rationale for a lot of these decisions seems to have been keeping the files small. But by putting requests to the same title from different site versions on the same line, and dropping byte-count, we save enough space that the resulting files are approximately the same size as the old ones - or in many cases, actually smaller.
*What I'm asking for*
Feedback! What do people think of the new format? What would they like to see that they don't? What don't they need, here? How useful would normalisation be? How useful would backfilling be?
*What I'm not asking for* WMF time! Like I said, this is a spare-time project; I've also got volunteers for Code Review and checking, too (Yuvi and Otto).
The replacement of the old files! Too many people depend on that format and that definition, and I don't want to make them sad.
Thoughts?
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
End of Analytics Digest, Vol 37, Issue 33
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
It seems that this schema will need to be modified to include new types of view counts. For example, it doesn't seem like views via the Wikipedia App are included (or maybe they are within "mobile"). If we wanted to have them separate, we'd need to add another column -- changing the data format/schema. For any additional sources of views that we'd need to separate in the future, we'd need to add another column again.
So, in a way, this format is less 'normalized' than the previous which would have out the counts from different sources on different lines. This will require a data consumer who wants to process historic files to be able to handle files with differing numbers of columns and think of the incoming data as "full_project_url - title - desktop_counts - mobile_counts - ..." where "..." could contain <something> or <nothing> but must be handled regardless.
I think that this is undesirable -- but not *that* undesirable.
-Aaron
On Sat, Mar 14, 2015 at 5:58 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Makes sense :). So, normalised titles at a minimum, and ideally both normalised titles and pageID? (the disadvantage of just pageID is, of course, having to look the darn thing up. The disadvantage of title is that redirects that happen after-the-fact are a thing. Both would solve for this).
On 14 March 2015 at 17:35, Roni Wiener roni.wiener@keotic.com wrote:
Sounds great, I believe that normalization of the title will be very
useful for future researchers and usages, so as adding the pageId.
Currently it is not always straight forward to correlate the wikipedia
page with the unnormalized title
On Mar 14, 2015, at 14:00, analytics-request@lists.wikimedia.org wrote:
Send Analytics mailing list submissions to analytics@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/analytics or, via email, send a message with subject or body 'help' to analytics-request@lists.wikimedia.org
You can reach the person managing the list at analytics-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Analytics digest..."
Today's Topics:
- [Technical][Request for Comment] A new format for the pageview dumps (Oliver Keyes)
Message: 1 Date: Fri, 13 Mar 2015 15:06:00 -0400 From: Oliver Keyes okeyes@wikimedia.org To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." analytics@lists.wikimedia.org, Research into Wikimedia content
and
communities wiki-research-l@lists.wikimedia.org Subject: [Analytics] [Technical][Request for Comment] A new format for the pageview dumps Message-ID: CAAUQgdCSVg8htCS4VFDZJaL2rUEjmk5ZEA+zXcp3o9OWn-udfQ@mail.gmail.com Content-Type: text/plain; charset=UTF-8
So, we've got a new pageviews definition; it's nicely integrated and spitting out TRUE/FALSE values on each row with the best of em. But what does that mean for third-party researchers?
Well...not much, at the moment, because the data isn't being released somewhere. But one resource we do have that third-parties use a heck of a lot, is the per-page pageviews dumps on dumps.wikimedia.org.
Due to historical size constrains and decision-making (and by historical I mean: last decade) these have a number of weirdnesses in formatting terms; project identification is done using a notation style not really used anywhere else, mobile/zero/desktop appear on different lines, and the files are space-separated. I'd like to put some volunteer time into spitting out dumps in an easier-to-work-with format, using the new definition, to run in /parallel/ with the existing logs.
*The new format* At the moment we have the format:
project_notation - encoded_title - pageviews - bytes
This puts zero and mobile requests to pageX in a different place to desktop requests, requires some reconstruction of project_notation, and contains (for some use cases) extraneous information - that being the byte-count. The files are also headerless, unquoted and space-separated, which saves space but is sometimes...I think the term is "eeeeh-inducing".
What I'd like to use as a new format is:
full_project_url - encoded_title - desktop_pageviews -
mobile_and_zero_pageviews
This file would:
- Include a header row;
- Be formatted as a tab-separated, rather than space-separated, file;
- Exclude bytecounts;
- Include desktop and mobile pageview counts on the same line;
- Use the full project URL ("en.wikivoyage.org") instead of the
pagecounts-specific notation ("en.v")
So, as a made-up example, instead of:
de.m.v Florence 32 9024 de.v Florence 920 7570
we'd end up with:
de.wikivoyage.org Florence 920 32
In the future we could also work to /normalise/ the title - replacing it with the page title that refers to the actual pageID. This won't impact legacy files, and is currently blocked on the Apps team, but should be viable as soon as that blocker goes away.
I've written a script capable of parsing and reformatting the legacy files, so we should be able to backfill in this new format too, if that's wanted (see below).
*The size constraints*
There really aren't any. Like I said, the historical rationale for a lot of these decisions seems to have been keeping the files small. But by putting requests to the same title from different site versions on the same line, and dropping byte-count, we save enough space that the resulting files are approximately the same size as the old ones - or in many cases, actually smaller.
*What I'm asking for*
Feedback! What do people think of the new format? What would they like to see that they don't? What don't they need, here? How useful would normalisation be? How useful would backfilling be?
*What I'm not asking for* WMF time! Like I said, this is a spare-time project; I've also got volunteers for Code Review and checking, too (Yuvi and Otto).
The replacement of the old files! Too many people depend on that format and that definition, and I don't want to make them sad.
Thoughts?
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
End of Analytics Digest, Vol 37, Issue 33
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
+1 to having both page_id and (current?) page_title
On Mar 14, 2015, at 18:58, Oliver Keyes okeyes@wikimedia.org wrote:
Makes sense :). So, normalised titles at a minimum, and ideally both normalised titles and pageID? (the disadvantage of just pageID is, of course, having to look the darn thing up. The disadvantage of title is that redirects that happen after-the-fact are a thing. Both would solve for this).
On 14 March 2015 at 17:35, Roni Wiener roni.wiener@keotic.com wrote:
Sounds great, I believe that normalization of the title will be very useful for future researchers and usages, so as adding the pageId. Currently it is not always straight forward to correlate the wikipedia page with the unnormalized title
On Mar 14, 2015, at 14:00, analytics-request@lists.wikimedia.org wrote:
Send Analytics mailing list submissions to analytics@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/analytics or, via email, send a message with subject or body 'help' to analytics-request@lists.wikimedia.org
You can reach the person managing the list at analytics-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Analytics digest..."
Today's Topics:
- [Technical][Request for Comment] A new format for the pageview dumps (Oliver Keyes)
Message: 1 Date: Fri, 13 Mar 2015 15:06:00 -0400 From: Oliver Keyes okeyes@wikimedia.org To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." analytics@lists.wikimedia.org, Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Subject: [Analytics] [Technical][Request for Comment] A new format for the pageview dumps Message-ID: CAAUQgdCSVg8htCS4VFDZJaL2rUEjmk5ZEA+zXcp3o9OWn-udfQ@mail.gmail.com Content-Type: text/plain; charset=UTF-8
So, we've got a new pageviews definition; it's nicely integrated and spitting out TRUE/FALSE values on each row with the best of em. But what does that mean for third-party researchers?
Well...not much, at the moment, because the data isn't being released somewhere. But one resource we do have that third-parties use a heck of a lot, is the per-page pageviews dumps on dumps.wikimedia.org.
Due to historical size constrains and decision-making (and by historical I mean: last decade) these have a number of weirdnesses in formatting terms; project identification is done using a notation style not really used anywhere else, mobile/zero/desktop appear on different lines, and the files are space-separated. I'd like to put some volunteer time into spitting out dumps in an easier-to-work-with format, using the new definition, to run in /parallel/ with the existing logs.
*The new format* At the moment we have the format:
project_notation - encoded_title - pageviews - bytes
This puts zero and mobile requests to pageX in a different place to desktop requests, requires some reconstruction of project_notation, and contains (for some use cases) extraneous information - that being the byte-count. The files are also headerless, unquoted and space-separated, which saves space but is sometimes...I think the term is "eeeeh-inducing".
What I'd like to use as a new format is:
full_project_url - encoded_title - desktop_pageviews - mobile_and_zero_pageviews
This file would:
- Include a header row;
- Be formatted as a tab-separated, rather than space-separated, file;
- Exclude bytecounts;
- Include desktop and mobile pageview counts on the same line;
- Use the full project URL ("en.wikivoyage.org") instead of the
pagecounts-specific notation ("en.v")
So, as a made-up example, instead of:
de.m.v Florence 32 9024 de.v Florence 920 7570
we'd end up with:
de.wikivoyage.org Florence 920 32
In the future we could also work to /normalise/ the title - replacing it with the page title that refers to the actual pageID. This won't impact legacy files, and is currently blocked on the Apps team, but should be viable as soon as that blocker goes away.
I've written a script capable of parsing and reformatting the legacy files, so we should be able to backfill in this new format too, if that's wanted (see below).
*The size constraints*
There really aren't any. Like I said, the historical rationale for a lot of these decisions seems to have been keeping the files small. But by putting requests to the same title from different site versions on the same line, and dropping byte-count, we save enough space that the resulting files are approximately the same size as the old ones - or in many cases, actually smaller.
*What I'm asking for*
Feedback! What do people think of the new format? What would they like to see that they don't? What don't they need, here? How useful would normalisation be? How useful would backfilling be?
*What I'm not asking for* WMF time! Like I said, this is a spare-time project; I've also got volunteers for Code Review and checking, too (Yuvi and Otto).
The replacement of the old files! Too many people depend on that format and that definition, and I don't want to make them sad.
Thoughts?
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
End of Analytics Digest, Vol 37, Issue 33
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Yay for +1s!
Aaron: I think your commentary (re: future maintainability being an argument for distinct rows) makes sense. Let's operate on that basis - it actually makes the query easier to write ;p.
Kevin: I'm not sure what value there'd be. I mean, there's page-size, maybe? But pageID gives us that (or should).
On 16 March 2015 at 14:20, Andrew Otto aotto@wikimedia.org wrote:
+1 to having both page_id and (current?) page_title
On Mar 14, 2015, at 18:58, Oliver Keyes okeyes@wikimedia.org wrote:
Makes sense :). So, normalised titles at a minimum, and ideally both normalised titles and pageID? (the disadvantage of just pageID is, of course, having to look the darn thing up. The disadvantage of title is that redirects that happen after-the-fact are a thing. Both would solve for this).
On 14 March 2015 at 17:35, Roni Wiener roni.wiener@keotic.com wrote:
Sounds great, I believe that normalization of the title will be very useful for future researchers and usages, so as adding the pageId. Currently it is not always straight forward to correlate the wikipedia page with the unnormalized title
On Mar 14, 2015, at 14:00, analytics-request@lists.wikimedia.org wrote:
Send Analytics mailing list submissions to analytics@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit https://lists.wikimedia.org/mailman/listinfo/analytics or, via email, send a message with subject or body 'help' to analytics-request@lists.wikimedia.org
You can reach the person managing the list at analytics-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Analytics digest..."
Today's Topics:
- [Technical][Request for Comment] A new format for the pageview dumps (Oliver Keyes)
Message: 1 Date: Fri, 13 Mar 2015 15:06:00 -0400 From: Oliver Keyes okeyes@wikimedia.org To: "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." analytics@lists.wikimedia.org, Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Subject: [Analytics] [Technical][Request for Comment] A new format for the pageview dumps Message-ID: CAAUQgdCSVg8htCS4VFDZJaL2rUEjmk5ZEA+zXcp3o9OWn-udfQ@mail.gmail.com Content-Type: text/plain; charset=UTF-8
So, we've got a new pageviews definition; it's nicely integrated and spitting out TRUE/FALSE values on each row with the best of em. But what does that mean for third-party researchers?
Well...not much, at the moment, because the data isn't being released somewhere. But one resource we do have that third-parties use a heck of a lot, is the per-page pageviews dumps on dumps.wikimedia.org.
Due to historical size constrains and decision-making (and by historical I mean: last decade) these have a number of weirdnesses in formatting terms; project identification is done using a notation style not really used anywhere else, mobile/zero/desktop appear on different lines, and the files are space-separated. I'd like to put some volunteer time into spitting out dumps in an easier-to-work-with format, using the new definition, to run in /parallel/ with the existing logs.
*The new format* At the moment we have the format:
project_notation - encoded_title - pageviews - bytes
This puts zero and mobile requests to pageX in a different place to desktop requests, requires some reconstruction of project_notation, and contains (for some use cases) extraneous information - that being the byte-count. The files are also headerless, unquoted and space-separated, which saves space but is sometimes...I think the term is "eeeeh-inducing".
What I'd like to use as a new format is:
full_project_url - encoded_title - desktop_pageviews - mobile_and_zero_pageviews
This file would:
- Include a header row;
- Be formatted as a tab-separated, rather than space-separated, file;
- Exclude bytecounts;
- Include desktop and mobile pageview counts on the same line;
- Use the full project URL ("en.wikivoyage.org") instead of the
pagecounts-specific notation ("en.v")
So, as a made-up example, instead of:
de.m.v Florence 32 9024 de.v Florence 920 7570
we'd end up with:
de.wikivoyage.org Florence 920 32
In the future we could also work to /normalise/ the title - replacing it with the page title that refers to the actual pageID. This won't impact legacy files, and is currently blocked on the Apps team, but should be viable as soon as that blocker goes away.
I've written a script capable of parsing and reformatting the legacy files, so we should be able to backfill in this new format too, if that's wanted (see below).
*The size constraints*
There really aren't any. Like I said, the historical rationale for a lot of these decisions seems to have been keeping the files small. But by putting requests to the same title from different site versions on the same line, and dropping byte-count, we save enough space that the resulting files are approximately the same size as the old ones - or in many cases, actually smaller.
*What I'm asking for*
Feedback! What do people think of the new format? What would they like to see that they don't? What don't they need, here? How useful would normalisation be? How useful would backfilling be?
*What I'm not asking for* WMF time! Like I said, this is a spare-time project; I've also got volunteers for Code Review and checking, too (Yuvi and Otto).
The replacement of the old files! Too many people depend on that format and that definition, and I don't want to make them sad.
Thoughts?
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
End of Analytics Digest, Vol 37, Issue 33
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Mon, Mar 16, 2015 at 3:14 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Kevin: I'm not sure what value there'd be. I mean, there's page-size, maybe? But pageID gives us that (or should).
Time-traveling with MediaWiki is very hard. Calculating the length of wikitext for a given pageID at a given time is cumbersome (instead of simple text processing, you are now dealing with DB queries, need to set up a local DB mirror etc). Finding out what title it had at the time is prohibitively hard (you have to parse semi-structured objects which are serialized into strings in the log table and follow the chain of renames). Finding out the byte size of the rendered HTML is practically impossible (templates and interface messages change; flagged revisions/pending changes might result in older versions of articles being shown). If you omit the bytecounts, there is no way people will be able to reconstruct them from the logs.
Not saying that's a problem - I personally don't see much use for them. Just don't expect pageID to be very useful for "normalizing" logs.
Yep, but bytecounts as an approximate for information density or content size are themselves not terribly useful (mobile web or desktop, two different bytecounts, same content). The goal with this is more, I think, to enable a reference point to work out "okay, what version of the article were these people prrrobably looking at, and what did it look like?"
On 20 March 2015 at 00:18, Gergo Tisza gtisza@wikimedia.org wrote:
On Mon, Mar 16, 2015 at 3:14 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Kevin: I'm not sure what value there'd be. I mean, there's page-size, maybe? But pageID gives us that (or should).
Time-traveling with MediaWiki is very hard. Calculating the length of wikitext for a given pageID at a given time is cumbersome (instead of simple text processing, you are now dealing with DB queries, need to set up a local DB mirror etc). Finding out what title it had at the time is prohibitively hard (you have to parse semi-structured objects which are serialized into strings in the log table and follow the chain of renames). Finding out the byte size of the rendered HTML is practically impossible (templates and interface messages change; flagged revisions/pending changes might result in older versions of articles being shown). If you omit the bytecounts, there is no way people will be able to reconstruct them from the logs.
Not saying that's a problem - I personally don't see much use for them. Just don't expect pageID to be very useful for "normalizing" logs.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hm.. interesting. A few random ideas:
I suppose extra columns wouldn't have to break existing scripts.
When querying total count, columns on one line would make it easier when we add columns if their logic can do "sum any columns after title". Which means if we add another column (which to some degree, means fragmenting traffic into more buckets, so you'd want to add up whatever columns there are at all times) their total count would stay accurate.
When querying for a specific type of traffic in the future (e.g. only traffic from *.applewatch.*.org ::troll::) one would have to be careful to account for the column not existing in older dumps.
Perhaps: * add a column for total count per title (adding up N+2 columns feels like something query tools generally would not support). * give each type of traffic its own column (pv_desktop, pv_mobile, pv_zero, pv_app).
Or perhaps * go back to separate lines for each traffic type (requiring users to be aware of the different types of traffic and their url permutation; or possibly we could use a normalised url and a dedicated column for traffic type).
This would mean users don't have to know the url or special identifier permutations and can simply skip that column if they want total counts. Very query-friendly.
canonical_project_hostname - traffic_source - encoded_title - pageviews_count de.wikivoyage.org, desktop, "Florence", 920 de.wikivoyage.org. mobile, "Florence", 32
/me goes back to lurking from the bushes,
— Timo
On 15 Mar 2015, at 15:44, Aaron Halfaker ahalfaker@wikimedia.org wrote:
It seems that this schema will need to be modified to include new types of view counts. For example, it doesn't seem like views via the Wikipedia App are included (or maybe they are within "mobile"). If we wanted to have them separate, we'd need to add another column -- changing the data format/schema. For any additional sources of views that we'd need to separate in the future, we'd need to add another column again.
So, in a way, this format is less 'normalized' than the previous which would have out the counts from different sources on different lines. This will require a data consumer who wants to process historic files to be able to handle files with differing numbers of columns and think of the incoming data as "full_project_url - title - desktop_counts - mobile_counts - ..." where "..." could contain <something> or <nothing> but must be handled regardless.
I think that this is undesirable -- but not *that* undesirable.
-Aaron
On Mar 14, 2015, at 14:00, Oliver Keyes wrote:
[..]
What I'd like to use as a new format is:
full_project_url - encoded_title - desktop_pageviews - mobile_and_zero_pageviews
This file would:
- Include a header row;
- Be formatted as a tab-separated, rather than space-separated, file;
- Exclude bytecounts;
- Include desktop and mobile pageview counts on the same line;
- Use the full project URL ("en.wikivoyage.org") instead of the
pagecounts-specific notation ("en.v")
So, as a made-up example, instead of:
de.m.v Florence 32 9024 de.v Florence 920 7570
we'd end up with:
de.wikivoyage.org Florence 920 32