Neat! And those can then be accessed with say, geocoded_data['country_code'] in hive?
Are there any plans to integrate the connection type binary? (Sorry to ask endless questions, but this is my jam :D)
On 23 February 2015 at 15:00, Joseph Allemandou jallemandou@wikimedia.org wrote:
Oops sorry, I forgot to answer this question :) A new map field named "geocoded_data" will contain, when available:
continent country country_code subdivision postal_code city timezone latitude longitude
For instance: {"city":"Mukilteo","country_code":"US","longitude":"-122.3042","subdivision":"Washington","timezone":"America/Los_Angeles","postal_code":"98275","continent":"North America","latitude":"47.913","country":"United States"}
Cheers Joseph
On Mon, Feb 23, 2015 at 8:24 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Gotcha. So, for transparency...what are we calculating? Country? City? :D
On 23 February 2015 at 13:59, Joseph Allemandou jallemandou@wikimedia.org wrote:
As per the IRC discussion, we won't recompute historical data, but start computing new values from the deploy time onward. A new "version" field, and associated documentation will also be provided, allowing to follow changes along time. Thanks for your inputs ! Best
On Mon, Feb 23, 2015 at 4:58 PM, Oliver Keyes okeyes@wikimedia.org wrote:
I think it should be fine-ish; it depends what we're calculating. When you say "geocoded information", what do you mean? Country? City? I wouldn't expect country to move about a lot in 60 days (which is the range of our data): I would expect city to.
What's the status on getting an oozie job or similar to compute going forward? To me that's more of a priority than historical data.
On 23 February 2015 at 10:53, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi,
As part of my first assignment, I'll recompute our historical webrequest dataset, adding client_ip and geocoded information.
While it seems correct to compute historical client_ip based on the existing ip and the x_forwarded_for, the use of the current state of the geocoded maxmind library to compute historical data is more error-prone.
I can either compute it anyway, knowing that there'll be some errors, or put null values for data older than a given point in time.
I'll launch the script to recompute the data as soon as max(a consensus is find on this matter, operations gives me the right to run the script) :)
Thanks
Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics