Hi,
As part of my first assignment, I'll recompute our historical webrequest dataset, adding client_ip and geocoded information.
While it seems correct to compute historical client_ip based on the existing ip and the x_forwarded_for, the use of the current state of the geocoded maxmind library to compute historical data is more error-prone.
I can either compute it anyway, knowing that there'll be some errors, or put null values for data older than a given point in time.
I'll launch the script to recompute the data as soon as max(a consensus is find on this matter, operations gives me the right to run the script) :)
Thanks
If I remember correctly, Chris had the maxmind db on github with a script that update it and commit changes. Thus making possible to "play back time" and get the state of the db how it was when than data was calculated.
I think Dan has that script & cron running in his homedir, if we could productionize this .. or at least document it on wikitech it will be great.
Thanks,
Nuria
On Mon, Feb 23, 2015 at 7:53 AM, Joseph Allemandou < jallemandou@wikimedia.org> wrote:
Hi,
As part of my first assignment, I'll recompute our historical webrequest dataset, adding client_ip and geocoded information.
While it seems correct to compute historical client_ip based on the existing ip and the x_forwarded_for, the use of the current state of the geocoded maxmind library to compute historical data is more error-prone.
I can either compute it anyway, knowing that there'll be some errors, or put null values for data older than a given point in time.
I'll launch the script to recompute the data as soon as max(a consensus is find on this matter, operations gives me the right to run the script) :)
Thanks
*Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I think it should be fine-ish; it depends what we're calculating. When you say "geocoded information", what do you mean? Country? City? I wouldn't expect country to move about a lot in 60 days (which is the range of our data): I would expect city to.
What's the status on getting an oozie job or similar to compute going forward? To me that's more of a priority than historical data.
On 23 February 2015 at 10:53, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi,
As part of my first assignment, I'll recompute our historical webrequest dataset, adding client_ip and geocoded information.
While it seems correct to compute historical client_ip based on the existing ip and the x_forwarded_for, the use of the current state of the geocoded maxmind library to compute historical data is more error-prone.
I can either compute it anyway, knowing that there'll be some errors, or put null values for data older than a given point in time.
I'll launch the script to recompute the data as soon as max(a consensus is find on this matter, operations gives me the right to run the script) :)
Thanks
Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
As per the IRC discussion, we won't recompute historical data, but start computing new values from the deploy time onward. A new "version" field, and associated documentation will also be provided, allowing to follow changes along time. Thanks for your inputs ! Best
On Mon, Feb 23, 2015 at 4:58 PM, Oliver Keyes okeyes@wikimedia.org wrote:
I think it should be fine-ish; it depends what we're calculating. When you say "geocoded information", what do you mean? Country? City? I wouldn't expect country to move about a lot in 60 days (which is the range of our data): I would expect city to.
What's the status on getting an oozie job or similar to compute going forward? To me that's more of a priority than historical data.
On 23 February 2015 at 10:53, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi,
As part of my first assignment, I'll recompute our historical webrequest dataset, adding client_ip and geocoded information.
While it seems correct to compute historical client_ip based on the
existing
ip and the x_forwarded_for, the use of the current state of the geocoded maxmind library to compute historical data is more error-prone.
I can either compute it anyway, knowing that there'll be some errors, or
put
null values for data older than a given point in time.
I'll launch the script to recompute the data as soon as max(a consensus
is
find on this matter, operations gives me the right to run the script) :)
Thanks
Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Gotcha. So, for transparency...what are we calculating? Country? City? :D
On 23 February 2015 at 13:59, Joseph Allemandou jallemandou@wikimedia.org wrote:
As per the IRC discussion, we won't recompute historical data, but start computing new values from the deploy time onward. A new "version" field, and associated documentation will also be provided, allowing to follow changes along time. Thanks for your inputs ! Best
On Mon, Feb 23, 2015 at 4:58 PM, Oliver Keyes okeyes@wikimedia.org wrote:
I think it should be fine-ish; it depends what we're calculating. When you say "geocoded information", what do you mean? Country? City? I wouldn't expect country to move about a lot in 60 days (which is the range of our data): I would expect city to.
What's the status on getting an oozie job or similar to compute going forward? To me that's more of a priority than historical data.
On 23 February 2015 at 10:53, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi,
As part of my first assignment, I'll recompute our historical webrequest dataset, adding client_ip and geocoded information.
While it seems correct to compute historical client_ip based on the existing ip and the x_forwarded_for, the use of the current state of the geocoded maxmind library to compute historical data is more error-prone.
I can either compute it anyway, knowing that there'll be some errors, or put null values for data older than a given point in time.
I'll launch the script to recompute the data as soon as max(a consensus is find on this matter, operations gives me the right to run the script) :)
Thanks
Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Oops sorry, I forgot to answer this question :) A new map field named "geocoded_data" will contain, when available:
- continent - country - country_code - subdivision - postal_code - city - timezone - latitude - longitude
For instance: {"city":"Mukilteo","country_code":"US","longitude":"-122.3042","subdivision":"Washington","timezone":"America/Los_Angeles","postal_code":"98275","continent":"North America","latitude":"47.913","country":"United States"}
Cheers Joseph
On Mon, Feb 23, 2015 at 8:24 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Gotcha. So, for transparency...what are we calculating? Country? City? :D
On 23 February 2015 at 13:59, Joseph Allemandou jallemandou@wikimedia.org wrote:
As per the IRC discussion, we won't recompute historical data, but start computing new values from the deploy time onward. A new "version" field, and associated documentation will also be
provided,
allowing to follow changes along time. Thanks for your inputs ! Best
On Mon, Feb 23, 2015 at 4:58 PM, Oliver Keyes okeyes@wikimedia.org
wrote:
I think it should be fine-ish; it depends what we're calculating. When you say "geocoded information", what do you mean? Country? City? I wouldn't expect country to move about a lot in 60 days (which is the range of our data): I would expect city to.
What's the status on getting an oozie job or similar to compute going forward? To me that's more of a priority than historical data.
On 23 February 2015 at 10:53, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi,
As part of my first assignment, I'll recompute our historical
webrequest
dataset, adding client_ip and geocoded information.
While it seems correct to compute historical client_ip based on the existing ip and the x_forwarded_for, the use of the current state of the
geocoded
maxmind library to compute historical data is more error-prone.
I can either compute it anyway, knowing that there'll be some errors,
or
put null values for data older than a given point in time.
I'll launch the script to recompute the data as soon as max(a
consensus
is find on this matter, operations gives me the right to run the script)
:)
Thanks
Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Neat! And those can then be accessed with say, geocoded_data['country_code'] in hive?
Are there any plans to integrate the connection type binary? (Sorry to ask endless questions, but this is my jam :D)
On 23 February 2015 at 15:00, Joseph Allemandou jallemandou@wikimedia.org wrote:
Oops sorry, I forgot to answer this question :) A new map field named "geocoded_data" will contain, when available:
continent country country_code subdivision postal_code city timezone latitude longitude
For instance: {"city":"Mukilteo","country_code":"US","longitude":"-122.3042","subdivision":"Washington","timezone":"America/Los_Angeles","postal_code":"98275","continent":"North America","latitude":"47.913","country":"United States"}
Cheers Joseph
On Mon, Feb 23, 2015 at 8:24 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Gotcha. So, for transparency...what are we calculating? Country? City? :D
On 23 February 2015 at 13:59, Joseph Allemandou jallemandou@wikimedia.org wrote:
As per the IRC discussion, we won't recompute historical data, but start computing new values from the deploy time onward. A new "version" field, and associated documentation will also be provided, allowing to follow changes along time. Thanks for your inputs ! Best
On Mon, Feb 23, 2015 at 4:58 PM, Oliver Keyes okeyes@wikimedia.org wrote:
I think it should be fine-ish; it depends what we're calculating. When you say "geocoded information", what do you mean? Country? City? I wouldn't expect country to move about a lot in 60 days (which is the range of our data): I would expect city to.
What's the status on getting an oozie job or similar to compute going forward? To me that's more of a priority than historical data.
On 23 February 2015 at 10:53, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi,
As part of my first assignment, I'll recompute our historical webrequest dataset, adding client_ip and geocoded information.
While it seems correct to compute historical client_ip based on the existing ip and the x_forwarded_for, the use of the current state of the geocoded maxmind library to compute historical data is more error-prone.
I can either compute it anyway, knowing that there'll be some errors, or put null values for data older than a given point in time.
I'll launch the script to recompute the data as soon as max(a consensus is find on this matter, operations gives me the right to run the script) :)
Thanks
Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Are there any plans to integrate the connection type binary? (Sorry to ask endless questions, but this is my jam :D)
Oliver, you are our end user, guide us!
On Feb 23, 2015, at 15:13, Oliver Keyes okeyes@wikimedia.org wrote:
Neat! And those can then be accessed with say, geocoded_data['country_code'] in hive?
Are there any plans to integrate the connection type binary? (Sorry to ask endless questions, but this is my jam :D)
On 23 February 2015 at 15:00, Joseph Allemandou jallemandou@wikimedia.org wrote:
Oops sorry, I forgot to answer this question :) A new map field named "geocoded_data" will contain, when available:
continent country country_code subdivision postal_code city timezone latitude longitude
For instance: {"city":"Mukilteo","country_code":"US","longitude":"-122.3042","subdivision":"Washington","timezone":"America/Los_Angeles","postal_code":"98275","continent":"North America","latitude":"47.913","country":"United States"}
Cheers Joseph
On Mon, Feb 23, 2015 at 8:24 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Gotcha. So, for transparency...what are we calculating? Country? City? :D
On 23 February 2015 at 13:59, Joseph Allemandou jallemandou@wikimedia.org wrote:
As per the IRC discussion, we won't recompute historical data, but start computing new values from the deploy time onward. A new "version" field, and associated documentation will also be provided, allowing to follow changes along time. Thanks for your inputs ! Best
On Mon, Feb 23, 2015 at 4:58 PM, Oliver Keyes okeyes@wikimedia.org wrote:
I think it should be fine-ish; it depends what we're calculating. When you say "geocoded information", what do you mean? Country? City? I wouldn't expect country to move about a lot in 60 days (which is the range of our data): I would expect city to.
What's the status on getting an oozie job or similar to compute going forward? To me that's more of a priority than historical data.
On 23 February 2015 at 10:53, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi,
As part of my first assignment, I'll recompute our historical webrequest dataset, adding client_ip and geocoded information.
While it seems correct to compute historical client_ip based on the existing ip and the x_forwarded_for, the use of the current state of the geocoded maxmind library to compute historical data is more error-prone.
I can either compute it anyway, knowing that there'll be some errors, or put null values for data older than a given point in time.
I'll launch the script to recompute the data as soon as max(a consensus is find on this matter, operations gives me the right to run the script) :)
Thanks
Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics