Hi,
when doing some basic sanity checks between the output of the existing zero_country and zero_carrier Pig scripts, it seems that the sum of the number of requests of the output of zero_country per day is ~40k larger than for zero_carrier.
First, I've been told that the sum of the number of requests has to match.
Afterwards, I've been told that this is ok, as zero_country should hold all of the mobile requests from a country, and zero_carrier is a drill-down on the specific carriers.
When reading the Pig scripts/Java code, it is obvious that the first explanation does not meet the code. The scripts take completely different paths through our code base and count completely different things :-(
However, the latter explanation does not make much sense to me either, as it's hard to believe that the requests from our zero partners make up >90% of each countries mobile requests. Besides, this explanation would not meet how we generate the raw log files.
Whom could I ask about what the desired semantics of zero_{carrier,country} are?
Best regards, Christian
Diederik and Evan D — Sent from Mailbox for iPhone
On Mon, Jul 22, 2013 at 9:27 AM, Christian Aistleitner christian@quelltextlich.at wrote:
Hi, when doing some basic sanity checks between the output of the existing zero_country and zero_carrier Pig scripts, it seems that the sum of the number of requests of the output of zero_country per day is ~40k larger than for zero_carrier. First, I've been told that the sum of the number of requests has to match. Afterwards, I've been told that this is ok, as zero_country should hold all of the mobile requests from a country, and zero_carrier is a drill-down on the specific carriers. When reading the Pig scripts/Java code, it is obvious that the first explanation does not meet the code. The scripts take completely different paths through our code base and count completely different things :-( However, the latter explanation does not make much sense to me either, as it's hard to believe that the requests from our zero partners make up >90% of each countries mobile requests. Besides, this explanation would not meet how we generate the raw log files. Whom could I ask about what the desired semantics of zero_{carrier,country} are? Best regards, Christian -- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/
Hi Diederik,
[ rearranged response due to top-posting ]
On Mon, Jul 22, 2013 at 06:28:18AM -0700, Diederik van Liere wrote:
On Mon, Jul 22, 2013 at 9:27 AM, Christian Aistleitner christian@quelltextlich.at wrote:
Whom could I ask about what the desired semantics of zero_{carrier,country} are?
Diederik and Evan
so this means that:
zero_country should hold all of the mobile requests from a country, and zero_carrier is a drill-down on the specific carriers.
would be the correct interpretation?
If so, we cannot compute zero_country with the input files we get.
But even if we had all the logs, the code of zero_carrier.pig would not produce a subset of zero_country.pig, as it uses completely different approach to determine a count-worthy log line [1].
But should not zero_carrier be a subset of zero_country then?
Best regards, Christian
[1] A rough transcript of the predicates is:
* zero_carrier :<==> (carrier is set) && (host contains "wikipedia") && (url path contains "/wiki/") && (url is mobile and mobile is free for carrier || url is zero and zero is free for carrier) && (language is included for carrier)
-------------------
* zero_country :<==> (host contains "wiki") && query does not contain "action=opensearch" && query does not contain "action=search" && query does not contain "title=Special%3ASearch&search" && file does not contains "wiki?search" && host does not contain "bits" && host does not contain "upload" && (url is not for image || (url is for image && mime type contains "image")) && (url is not for api || (url is for api && mime type is "application" && mime subtype is "json")) && ((url path does not contain "wiki" or "/w/index.php") || ((url path contains "wiki" or "/w/index.php") && mime type is "text" && (mime subtype is "html" || mime subtype is "vnd.wap.wml"))) && (url is not for image && url is not for api && url path does not contain "wiki" && url path does not contain "/w/index.php" && mime type is "text" && mime subtype is "html") && response code matches ".*(20\d|302|304).*" && lower cased request method contains "get" && ip address is not in 10.0.0.0/8 && ip address is not in 208.80.152.0/22 && ip address is not in 91.198.174.0/24 && user agent does not contain "bot" && user agent does not contain "spider" && user agent does not contain "http" && user agent does not contain "crawler" && (url is for api || (url path contains "wiki" or "/w/index.php")) && (url is for api && referrer is null && isApiPageViewRequest(url)) && (url is for api && referrer is not null && isApiPageViewRequest(referrer) && (canonical titles of paramA and paramB are both null || canonical titles of paramA and paramB are not equivalent)) && (url is for api && referrer is not null && !isApiPageViewRequest(referrer)) && (url path contains "/wiki/" or "/w/index.php")
-------------------
isApiPageViewRequest(paramA) :<==> (path of param contains "/w/api.php") && (param's query contains "action=view" || param's query contains "action=mobileview" || param's query contains "action=query")
-------------------
On Mon, Jul 22, 2013 at 11:14 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi Diederik,
[ rearranged response due to top-posting ]
On Mon, Jul 22, 2013 at 06:28:18AM -0700, Diederik van Liere wrote:
On Mon, Jul 22, 2013 at 9:27 AM, Christian Aistleitner christian@quelltextlich.at wrote:
Whom could I ask about what the desired semantics of zero_{carrier,country} are?
Diederik and Evan
so this means that:
zero_country should hold all of the mobile requests from a country, and zero_carrier is a drill-down on the specific carriers.
would be the correct interpretation?
If so, we cannot compute zero_country with the input files we get.
That's correct, we need the full mobile stream from Kraken, so we have to let go of this for now.
But even if we had all the logs, the code of zero_carrier.pig would not produce a subset of zero_country.pig, as it uses completely different approach to determine a count-worthy log line [1].
I think you are looking at the wrong zero_country.pig script, the logic
should be the same for zero_country and zero_carrier.
But should not zero_carrier be a subset of zero_country then?
Yes zero_carrier is a subset of zero_country.
Best regards, Christian
[1] A rough transcript of the predicates is:
- zero_carrier :<==> (carrier is set)
&& (host contains "wikipedia") && (url path contains "/wiki/") && (url is mobile and mobile is free for carrier || url is zero and zero is free for carrier) && (language is included for carrier)
- zero_country :<==> (host contains "wiki")
&& query does not contain "action=opensearch" && query does not contain "action=search" && query does not contain "title=Special%3ASearch&search" && file does not contains "wiki?search" && host does not contain "bits" && host does not contain "upload" && (url is not for image || (url is for image && mime type contains "image")) && (url is not for api || (url is for api && mime type is "application" && mime subtype is "json")) && ((url path does not contain "wiki" or "/w/index.php") || ((url path contains "wiki" or "/w/index.php") && mime type is "text" && (mime subtype is "html" || mime subtype is "vnd.wap.wml"))) && (url is not for image && url is not for api && url path does not contain "wiki" && url path does not contain "/w/index.php" && mime type is "text" && mime subtype is "html") && response code matches ".*(20\d|302|304).*" && lower cased request method contains "get" && ip address is not in 10.0.0.0/8 && ip address is not in 208.80.152.0/22 && ip address is not in 91.198.174.0/24 && user agent does not contain "bot" && user agent does not contain "spider" && user agent does not contain "http" && user agent does not contain "crawler" && (url is for api || (url path contains "wiki" or "/w/index.php")) && (url is for api && referrer is null && isApiPageViewRequest(url)) && (url is for api && referrer is not null && isApiPageViewRequest(referrer) && (canonical titles of paramA and paramB are both null || canonical titles of paramA and paramB are not equivalent)) && (url is for api && referrer is not null && !isApiPageViewRequest(referrer)) && (url path contains "/wiki/" or "/w/index.php")
isApiPageViewRequest(paramA) :<==> (path of param contains "/w/api.php") && (param's query contains "action=view" || param's query contains "action=mobileview" || param's query contains "action=query")
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Diederik,
On Mon, Jul 22, 2013 at 11:34:09AM -0400, Diederik van Liere wrote:
I think you are looking at the wrong zero_country.pig script, the logic should be the same for zero_country and zero_carrier.
Mhmm ... that would explain things :-) I've been told to use pig/zero_country.pig pig/zero_carrier.pig from the Kraken repository.
Which scripts should I use instead?
Best regards, Christian
I think most of the logic you are referring two is built into the two pageview UDFs, right?
zero_country.pig: DEFINE IS_PAGEVIEW org.wikimedia.analytics.kraken.pig.PageViewFilterFunc();
zero_carrier.pig: DEFINE IS_ZERO_PAGEVIEW org.wikimedia.analytics.kraken.pig.ZeroFilterFunc('default');
On Jul 22, 2013, at 12:39 PM, Christian Aistleitner christian@quelltextlich.at wrote:
Hi Diederik,
On Mon, Jul 22, 2013 at 11:34:09AM -0400, Diederik van Liere wrote:
I think you are looking at the wrong zero_country.pig script, the logic should be the same for zero_country and zero_carrier.
Mhmm ... that would explain things :-) I've been told to use pig/zero_country.pig pig/zero_carrier.pig from the Kraken repository.
Which scripts should I use instead?
Best regards, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Andrew,
On Mon, Jul 22, 2013 at 12:43:09PM -0400, Andrew Otto wrote:
I think most of the logic you are referring two is built into the two pageview UDFs, right?
Yes. The problem however is that the UDFs filter different things.
So consider for example a request to a search page. The UDF of zero_carrier.pig allow to count that. The UDF of zero_country.pig would filter that away.
An example for the other direction is http://ar.m.wikipedia.org/w/index.php?title=%D9%85%D9%84%D9%81:Abha_01.jpg&a... a request for that would be counted for zero_country.pig, but now zero_carrier.pig's UDF would filter that away.
So none of the counted rows of either of the two scripts is a subset of the other :-(
Best regards, Christian
Hi,
let me post updates to this problem, so we also have that in the email archives.
On Mon, Jul 22, 2013 at 08:21:33PM +0200, quelltextlich e.U. - Christian Aistleitner wrote:
The problem however is that the UDFs filter different things.
Diederk agreed that we're gonna change that and make the UDFs agree. We'll use the detailed pageview definition of PageViewFilterFunc in both places so we can finally compare apples to apples.
Thanks Diederik!
Best regards, Christian
P.S.: I've been told we should not use this list for technical discussions. Sorry for the noise.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 07/22/2013 04:52 PM, Christian Aistleitner wrote:
P.S.: I've been told we should not use this list for technical discussions. Sorry for the noise.
I disagree; I do not think it's noise. On the contrary I think it's a great idea.
- -- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation
HI,
although the decision happened already some weeks back, let me add it here to keep our archives happy:
On Mon, Jul 22, 2013 at 10:52:09PM +0200, Christian Aistleitner wrote:
P.S.: I've been told we should not use this list for technical discussions. Sorry for the noise.
After some internal discussion, the analytics team decided, that analytics related /technical/ questions (that do not contain private information) are indeed on-topic on this list. So let's rather have such discussions here in the public instead of private emails :-)
Please speak up, if the signal-to-noise ratio gets disconcerting for you.
Best regards, Christian
Afterwards, I've been told that this is ok, as zero_country should hold all of the mobile requests from a country, and zero_carrier is a drill-down on the specific carriers.
You know, since we are filtering on the X-Analytics header to capture these logs, we are not going to be able to get zero_country from them. These logs are not all of the mobile webrequest logs.
On Jul 22, 2013, at 9:27 AM, Christian Aistleitner christian@quelltextlich.at wrote:
Hi,
when doing some basic sanity checks between the output of the existing zero_country and zero_carrier Pig scripts, it seems that the sum of the number of requests of the output of zero_country per day is ~40k larger than for zero_carrier.
First, I've been told that the sum of the number of requests has to match.
Afterwards, I've been told that this is ok, as zero_country should hold all of the mobile requests from a country, and zero_carrier is a drill-down on the specific carriers.
When reading the Pig scripts/Java code, it is obvious that the first explanation does not meet the code. The scripts take completely different paths through our code base and count completely different things :-(
However, the latter explanation does not make much sense to me either, as it's hard to believe that the requests from our zero partners make up >90% of each countries mobile requests. Besides, this explanation would not meet how we generate the raw log files.
Whom could I ask about what the desired semantics of zero_{carrier,country} are?
Best regards, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian@quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Andrew,
On Mon, Jul 22, 2013 at 10:02:55AM -0400, Andrew Otto wrote:
Afterwards, I've been told that this is ok, as zero_country should hold all of the mobile requests from a country, and zero_carrier is a drill-down on the specific carriers.
You know, since we are filtering on the X-Analytics header to capture these logs, we are not going to be able to get zero_country from them. These logs are not all of the mobile webrequest logs.
Just repeating what I've been told… But yes, I completely agree with you. In fact that was what I meant by:
Besides, this explanation would not meet how we generate the raw log files.
Best regards, Christian