Gonna paste your reply on the ticket
<https://phabricator.wikimedia.org/T184793> and respond there.
On Wed, Feb 7, 2018 at 1:29 PM, Tilman Bayer <tbayer(a)wikimedia.org> wrote:
On Wed, Feb 7, 2018 at 9:19 AM, Andrew Otto
<otto(a)wikimedia.org> wrote:
It will
create significant discrepancies with the existing geolocation
data we record for pageviews
If you only need country (or whatever is in the
cookie), then likely
whatever the output dataset is would only include country when selecting
from pageviews. If you need more than country (it sounded like you
didn’t),
then we can get into doing the IP Geocoding in
EventLogging, but there
are
few technical complications here, and we’re
prefer not to have to do
this if
we don’t have to.
As mentioned repeatedly in this thread (see e.g. Sam's Jan 29 email),
the goal is to record metadata consistent with with our existing
content consumption measurement, concretely: the fields available in
the pageview_hourly table. See
https://phabricator.wikimedia.org/T186728 for details (also regarding
other fields that are not in EL by default but are likewise generated
in a standard fashion for webrequest/pageview data).
I appreciate it will need a bit of engineering work to implement your
proposal of reusing the existing UDF that underlies the pageview data
for the new preview data. But it will serve to avoid a lot of data
limitations and headaches for years to come. To highlight just one
aspect: If we relied on the cookie, the data would be inconsistent
from the start because not all clients accept cookies. When we want to
know (say) the ratio of previews to pageviews in a particular country,
we don't want to have to embark on a research project estimating the
number of cookie-less pageviews in that country. And so on.
On Wed, Feb 7, 2018 at 12:09 PM, Tilman Bayer <tbayer(a)wikimedia.org>
wrote:
>
> Thanks everyone! Separate from Sam's mapping out the frontend
> instrumentation work at
https://phabricator.wikimedia.org/T184793 , I
have
> created a task for the backend work at
>
https://phabricator.wikimedia.org/T186728 based on this thread.
>
> Regarding the last few posts about the geolocation information, from the
> data analysis perspective, there is indeed another, more serious concern
> about using the GeoIP cookie: It will create significant discrepancies
with
> the existing geolocation data we record for
pageviews, where we have
chosen
> to derive this information from the IP
instead. (Remember the
overarching
> goal here of measuring page previews the same
way we measure page views
> currently; the basic principle is that if a reader visits a page and
then
> uses the page preview feature on that page to
read preview cards, all
the
> metadata that is recorded for both should
have identical values for
both the
> preview and the pageview.) Therefore, we
should go with the kind of
solution
> Andrew outlined above (adapting/reusing
GetGeoDataUDF or such).
>
> On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto <otto(a)wikimedia.org> wrote:
>>
>> Wow Sam, yeah, if this cookie works for you, it will make many things
>> much easier for us. Check it out and let us know. If it doesn’t work
for
>> some reason, we can figure out the
backend geocoding part.
>>
>>
>>
>> On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith <samsmith(a)wikimedia.org>
wrote:
>>>
>>> On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto <otto(a)wikimedia.org>
wrote:
>>>>
>>>> > Using the GeoIP cookie will require reconfiguring the EventLogging
>>>> > varnishkafka instance [0]
>>>>
>>>> I’m not familiar with this cookie, but, if we used it, I thought it
>>>> would be sent back to by the client in the event. E.g. event.country
=
>>>> response.headers.country;
EventLogging.emit(event);
>>>>
>>>> That way, there’s no additional special logic needed on the server
side
>>>> to geocode or populate the
country in the event.
>>>
>>>
>>> Hah! I didn't think about accessing the GeoIP cookie on the client. As
>>> you say, the implementation is quite easy.
>>>
>>> My only concern with this approach is the duplication of the value
>>> between the cookie, which is sent in every HTTP request to the
/beacon/event
>>> endpoint, and the event itself. This
duplication seems reasonable when
>>> balanced against capturing either: the client IP and then doing
similar
>>> geocoding further along in the
pipeline; or the cookie for all
requests to
>>> that endpoint and then discarding
them further along in the pipeline.
It
>>> also reflects a seemingly core
principle of the EventLogging system:
that it
> doesn't capture potentiallly PII by default.
>
> -Sam
>
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/analytics
>
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics