Heyo, Discovery team!
(Analytics CCd)
This is just a quick writeup of the Scaleable Event Systems meeting that Erik, Dan, Stas and I went to (although just from my perspective).
For people not in the initial thread, this is a proposal to replace the internal architecture of EventLogging and similar services with Apache Kafka brokers (http://www.confluent.io/blog/stream-data-platform-1/ ). What that means in practice is that the current 1-2k events/second limit on EventLogging will disappear and we can stop worrying about sampling and accidentally bringing down the system. We can be a lot less cautious about our schemas and a lot less cautious about our sampling rate!
It also offers up a lot of opportunities around streaming data and making it available in a layered fashion - while we don't want to explore that right now, I don't think, it's nice to have as an option when we better understand our search data and how we can safely distribute it.
I'd like to thank the Analytics team, particularly Andrew, for putting this together; it was a super-helpful discussion to be in and this sort of product is precisely what I, at least, have been hoping for out of the AnEng brain trust. Full speed ahead!
Very excited to see this moving forward
On Mon, Aug 3, 2015 at 3:12 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Heyo, Discovery team!
(Analytics CCd)
This is just a quick writeup of the Scaleable Event Systems meeting that Erik, Dan, Stas and I went to (although just from my perspective).
For people not in the initial thread, this is a proposal to replace the internal architecture of EventLogging and similar services with Apache Kafka brokers (http://www.confluent.io/blog/stream-data-platform-1/ ). What that means in practice is that the current 1-2k events/second limit on EventLogging will disappear and we can stop worrying about sampling and accidentally bringing down the system. We can be a lot less cautious about our schemas and a lot less cautious about our sampling rate!
It also offers up a lot of opportunities around streaming data and making it available in a layered fashion - while we don't want to explore that right now, I don't think, it's nice to have as an option when we better understand our search data and how we can safely distribute it.
I'd like to thank the Analytics team, particularly Andrew, for putting this together; it was a super-helpful discussion to be in and this sort of product is precisely what I, at least, have been hoping for out of the AnEng brain trust. Full speed ahead!
-- Oliver Keyes Count Logula Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
what are the implications (if any) on event validation?
On Mon, Aug 3, 2015 at 3:19 PM, Tomasz Finc tfinc@wikimedia.org wrote:
Very excited to see this moving forward
On Mon, Aug 3, 2015 at 3:12 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Heyo, Discovery team!
(Analytics CCd)
This is just a quick writeup of the Scaleable Event Systems meeting that Erik, Dan, Stas and I went to (although just from my perspective).
For people not in the initial thread, this is a proposal to replace the internal architecture of EventLogging and similar services with Apache Kafka brokers (http://www.confluent.io/blog/stream-data-platform-1/ ). What that means in practice is that the current 1-2k events/second limit on EventLogging will disappear and we can stop worrying about sampling and accidentally bringing down the system. We can be a lot less cautious about our schemas and a lot less cautious about our sampling rate!
It also offers up a lot of opportunities around streaming data and making it available in a layered fashion - while we don't want to explore that right now, I don't think, it's nice to have as an option when we better understand our search data and how we can safely distribute it.
I'd like to thank the Analytics team, particularly Andrew, for putting this together; it was a super-helpful discussion to be in and this sort of product is precisely what I, at least, have been hoping for out of the AnEng brain trust. Full speed ahead!
-- Oliver Keyes Count Logula Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
nm, clarified with Kevin.
On Aug 3, 2015, at 18:38, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
what are the implications (if any) on event validation?
On Mon, Aug 3, 2015 at 3:19 PM, Tomasz Finc tfinc@wikimedia.org wrote: Very excited to see this moving forward
On Mon, Aug 3, 2015 at 3:12 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Heyo, Discovery team!
(Analytics CCd)
This is just a quick writeup of the Scaleable Event Systems meeting that Erik, Dan, Stas and I went to (although just from my perspective).
For people not in the initial thread, this is a proposal to replace the internal architecture of EventLogging and similar services with Apache Kafka brokers (http://www.confluent.io/blog/stream-data-platform-1/ ). What that means in practice is that the current 1-2k events/second limit on EventLogging will disappear and we can stop worrying about sampling and accidentally bringing down the system. We can be a lot less cautious about our schemas and a lot less cautious about our sampling rate!
It also offers up a lot of opportunities around streaming data and making it available in a layered fashion - while we don't want to explore that right now, I don't think, it's nice to have as an option when we better understand our search data and how we can safely distribute it.
I'd like to thank the Analytics team, particularly Andrew, for putting this together; it was a super-helpful discussion to be in and this sort of product is precisely what I, at least, have been hoping for out of the AnEng brain trust. Full speed ahead!
-- Oliver Keyes Count Logula Wikimedia Foundation
Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
Dario Taraborelli Head of Research, Wikimedia Foundation wikimediafoundation.org • nitens.org • @readermeter
On 4 August 2015 at 04:24, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Oliver Keyes, 04/08/2015 00:12:
a lot less cautious about our sampling rate!
A bit, perhaps, not a lot. Sampling is not just a performance matter.
Could you expand on that?
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Tue, Aug 4, 2015 at 4:27 AM, Oliver Keyes okeyes@wikimedia.org wrote:
On 4 August 2015 at 04:24, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Oliver Keyes, 04/08/2015 00:12:
a lot less cautious about our sampling rate!
A bit, perhaps, not a lot. Sampling is not just a performance matter.
Could you expand on that?
Not to speak for Nemo, but we don't want reckless abandon just because the system won't break. Thrift is one of our values at the foundation, and if we don't need to scale out with more hardware, we shouldn't. So I think if data collected has value beyond the cost to wrangle it through Kafka / HDFS / dashboards, then it should be collected. If not, it should be sampled until it does. There may not be an easy way to measure this so we'll have to rely on good old subjective consensus. I promise we won't be too strict about it, we'll just kindly ask people to think twice before collecting a lot of data.
Indeed; I'm familiar with the WMF's values ;). I was trying to work out if it was a hardware cost thing, a privacy thing, a...etc, etc.
On 13 August 2015 at 10:36, Dan Andreescu dandreescu@wikimedia.org wrote:
On Tue, Aug 4, 2015 at 4:27 AM, Oliver Keyes okeyes@wikimedia.org wrote:
On 4 August 2015 at 04:24, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Oliver Keyes, 04/08/2015 00:12:
a lot less cautious about our sampling rate!
A bit, perhaps, not a lot. Sampling is not just a performance matter.
Could you expand on that?
Not to speak for Nemo, but we don't want reckless abandon just because the system won't break. Thrift is one of our values at the foundation, and if we don't need to scale out with more hardware, we shouldn't. So I think if data collected has value beyond the cost to wrangle it through Kafka / HDFS / dashboards, then it should be collected. If not, it should be sampled until it does. There may not be an easy way to measure this so we'll have to rely on good old subjective consensus. I promise we won't be too strict about it, we'll just kindly ask people to think twice before collecting a lot of data.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics