I'm starting to look at some machine learning projects I've wanted to do for a while (ex: sock-puppet detection). This quickly leads to having to make decisions about data storage formats, i.e. csv, json, protobufs, etc. Left to my own devices, I'd probably use protos, but I don't want to be swimming upstream.
Are there any standards in wiki-land for how people store data? If there's some common way that "everybody does it", that's how I want to do it too. Or, does every project just do their own thing?
Hi Roy,
We had to evaluate data formats for event streaming systems as part of WMF's Modern Event Platform https://phabricator.wikimedia.org/T185233 program https://www.mediawiki.org/wiki/Wikimedia_Technology/Annual_Plans/FY2019/TEC2:_Modern_Event_Platform. The event streaming world mostly uses Avro, but there is plenty of Protobufs around too.
We ultimately decided to use JSON with JSONSchema as our transport format. While lacking some advantages of the other binary options, JSON is just more ubiquitous and easier to work with in a distributed and open source focused developer community. (You don't need the schema to read the data.)
More reading: - Choose Schema Tech RFC https://phabricator.wikimedia.org/T198256 - An old JSON justification blog post https://blog.wikimedia.org/2017/01/13/json-hadoop-kafka/
Our choice of JSONSchema and JSON is mostly around canonical data schemas for in-flight data transport and protocols. For data at rest, it might make more sense to serialize into something completely different (we use Parquet in Hadoop for most data there). You can read some WIP documentation about how we use JSONSchema here https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas.
On Fri, Nov 22, 2019 at 10:14 PM Roy Smith roy@panix.com wrote:
I'm starting to look at some machine learning projects I've wanted to do for a while (ex: sock-puppet detection). This quickly leads to having to make decisions about data storage formats, i.e. csv, json, protobufs, etc. Left to my own devices, I'd probably use protos, but I don't want to be swimming upstream.
Are there any standards in wiki-land for how people store data? If there's some common way that "everybody does it", that's how I want to do it too. Or, does every project just do their own thing? _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Hi Andrew,
Thanks for that info. I've never heard of JSON Schema before. I've done a bit of reading on it. As far as I can tell, it's pretty much a 1:1 mapping to proto specification language. What's not clear to me is how you actually use JSON Schema in real life.
I get that it provides documentation of the schema. That's pretty obvious.
It's not clear how the validation part works. Does a production data consumer validate every incoming JSON object it receives? Do producers validate every object they send?
Beyond that, what else do you do with JS? Is there some sort of code generation aspect to it, as with the proto compiler?
Over the past few days, I've been playing with some code a bit, and thinking a lot. I'm slowly coming to the conclusion that JSON is the way for me to go, for pretty much the reasons you outlined in T198256. The biggest advantage I can see to protos (outside of the immersive google infrastructure) is efficiency. If I need better performance later, swapping out one for the other doesn't seem like it would be a major problem.
On a vaguely related note, I saw in the recent Cloud Services Survey, a question that mentioned MongoDB. Is mongo used inside of WMF? It seems like it would be a natural in a JSON shop. I don't see it running on the bastion hosts.
On Nov 25, 2019, at 11:14 AM, Andrew Otto otto@wikimedia.org wrote:
Hi Roy,
We had to evaluate data formats for event streaming systems as part of WMF's Modern Event Platform https://phabricator.wikimedia.org/T185233 program https://www.mediawiki.org/wiki/Wikimedia_Technology/Annual_Plans/FY2019/TEC2:_Modern_Event_Platform. The event streaming world mostly uses Avro, but there is plenty of Protobufs around too.
We ultimately decided to use JSON with JSONSchema as our transport format. While lacking some advantages of the other binary options, JSON is just more ubiquitous and easier to work with in a distributed and open source focused developer community. (You don't need the schema to read the data.)
More reading:
- Choose Schema Tech RFC https://phabricator.wikimedia.org/T198256
- An old JSON justification blog post https://blog.wikimedia.org/2017/01/13/json-hadoop-kafka/
Our choice of JSONSchema and JSON is mostly around canonical data schemas for in-flight data transport and protocols. For data at rest, it might make more sense to serialize into something completely different (we use Parquet in Hadoop for most data there). You can read some WIP documentation about how we use JSONSchema here https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas.
On Fri, Nov 22, 2019 at 10:14 PM Roy Smith <roy@panix.com mailto:roy@panix.com> wrote: I'm starting to look at some machine learning projects I've wanted to do for a while (ex: sock-puppet detection). This quickly leads to having to make decisions about data storage formats, i.e. csv, json, protobufs, etc. Left to my own devices, I'd probably use protos, but I don't want to be swimming upstream.
Are there any standards in wiki-land for how people store data? If there's some common way that "everybody does it", that's how I want to do it too. Or, does every project just do their own thing? _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org mailto:Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org mailto:labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud https://lists.wikimedia.org/mailman/listinfo/cloud_______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
It's not clear how the validation part works.
For us, we use the schema to validate the incoming data before accepting and persisting it. We have various disparate producers of data (client side javascript, internal PHP, other services and languages), and we need to ensure that the data we receive is consistent and useable in both loosely and strongly typed systems. Since we are using JSONSchema for event streams, we have a separate service (EventGate https://github.com/wikimedia/eventgate) that receives the events over HTTP and validates them before sending them downstream (to Kafka).
Beyond that, what else do you do with JS? Is there some sort of code
generation aspect to it, as with the proto compiler? We also use the schema to do downstream integration between data stores. The JSONSchemas are used to create RDBMS tables into which we can parse and insert the JSON event data. We also use the JSONSchema to do language integration. If not code generation directly, some auto deserializers between a JSON event and Java (or whatever) objects. E.g. map from JSONSchema to Spark's schema format https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-spark/src/main/scala/org/wikimedia/analytics/refinery/spark/sql/JsonSchemaConverter.scala, or (one day?) from JSONSchema to Kafka Connect's schema format https://github.com/ottomata/kafka-connect-jsonschema.
Does a production data consumer validate every incoming JSON object it
receives? It could, but usually not. In our usage, we assume that data has been validated before it enters the system, so consumers can be sure the data is valid.
The biggest advantage I can see to protos (outside of the immersive
google infrastructure) is efficiency. I'm not sure how PB works here, but an advantage of Avro was its schema evolution features. This makes it easier to allow consumers and producers to work with different versions of the same schema, without having to upgrade their code. We accomplish this by only allowing a very strict type of change to JSONSchemas: only optional field additions are allowed (no renames, no field removals, no type changes, etc.).
Is mongo used inside of WMF?
Not that I know of. I could be wrong, but I think most application state for WMF services is either in MariaDB (MediaWiki uses this) or in Redis or Cassandra (which are really derivative caches of data canonically stored in MariaDB).
On Mon, Nov 25, 2019 at 9:00 PM Roy Smith roy@panix.com wrote:
Hi Andrew,
Thanks for that info. I've never heard of JSON Schema before. I've done a bit of reading on it. As far as I can tell, it's pretty much a 1:1 mapping to proto specification language. What's not clear to me is how you actually use JSON Schema in real life.
I get that it provides documentation of the schema. That's pretty obvious.
It's not clear how the validation part works. Does a production data consumer validate every incoming JSON object it receives? Do producers validate every object they send?
Beyond that, what else do you do with JS? Is there some sort of code generation aspect to it, as with the proto compiler?
Over the past few days, I've been playing with some code a bit, and thinking a lot. I'm slowly coming to the conclusion that JSON is the way for me to go, for pretty much the reasons you outlined in T198256. The biggest advantage I can see to protos (outside of the immersive google infrastructure) is efficiency. If I need better performance later, swapping out one for the other doesn't seem like it would be a major problem.
On a vaguely related note, I saw in the recent Cloud Services Survey, a question that mentioned MongoDB. Is mongo used inside of WMF? It seems like it would be a natural in a JSON shop. I don't see it running on the bastion hosts.
On Nov 25, 2019, at 11:14 AM, Andrew Otto otto@wikimedia.org wrote:
Hi Roy,
We had to evaluate data formats for event streaming systems as part of WMF's Modern Event Platform https://phabricator.wikimedia.org/T185233 program https://www.mediawiki.org/wiki/Wikimedia_Technology/Annual_Plans/FY2019/TEC2:_Modern_Event_Platform. The event streaming world mostly uses Avro, but there is plenty of Protobufs around too.
We ultimately decided to use JSON with JSONSchema as our transport format. While lacking some advantages of the other binary options, JSON is just more ubiquitous and easier to work with in a distributed and open source focused developer community. (You don't need the schema to read the data.)
More reading:
- Choose Schema Tech RFC https://phabricator.wikimedia.org/T198256
- An old JSON justification blog post
https://blog.wikimedia.org/2017/01/13/json-hadoop-kafka/
Our choice of JSONSchema and JSON is mostly around canonical data schemas for in-flight data transport and protocols. For data at rest, it might make more sense to serialize into something completely different (we use Parquet in Hadoop for most data there). You can read some WIP documentation about how we use JSONSchema here https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas.
On Fri, Nov 22, 2019 at 10:14 PM Roy Smith roy@panix.com wrote:
I'm starting to look at some machine learning projects I've wanted to do for a while (ex: sock-puppet detection). This quickly leads to having to make decisions about data storage formats, i.e. csv, json, protobufs, etc. Left to my own devices, I'd probably use protos, but I don't want to be swimming upstream.
Are there any standards in wiki-land for how people store data? If there's some common way that "everybody does it", that's how I want to do it too. Or, does every project just do their own thing? _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Thanks again for the additional info. I've been doing more reading about JSON Schema, and am growing fond of the idea. The idea of schema-driven fuzz testing with https://pypi.org/project/hypothesis-jsonschema/ https://pypi.org/project/hypothesis-jsonschema/ is pretty cool.