Hi Andrew,
Thanks for that info. I've never heard of JSON Schema before. I've done a bit of
reading on it. As far as I can tell, it's pretty much a 1:1 mapping to proto
specification language. What's not clear to me is how you actually use JSON Schema in
real life.
I get that it provides documentation of the schema. That's pretty obvious.
It's not clear how the validation part works. Does a production data consumer
validate every incoming JSON object it receives? Do producers validate every object they
send?
Beyond that, what else do you do with JS? Is there some sort of code generation aspect to
it, as with the proto compiler?
Over the past few days, I've been playing with some code a bit, and thinking a lot.
I'm slowly coming to the conclusion that JSON is the way for me to go, for pretty much
the reasons you outlined in T198256. The biggest advantage I can see to protos (outside
of the immersive google infrastructure) is efficiency. If I need better performance
later, swapping out one for the other doesn't seem like it would be a major problem.
On a vaguely related note, I saw in the recent Cloud Services Survey, a question that
mentioned MongoDB. Is mongo used inside of WMF? It seems like it would be a natural in a
JSON shop. I don't see it running on the bastion hosts.
On Nov 25, 2019, at 11:14 AM, Andrew Otto
<otto(a)wikimedia.org> wrote:
Hi Roy,
We had to evaluate data formats for event streaming systems as part of WMF's Modern
Event Platform <https://phabricator.wikimedia.org/T185233> program
<https://www.mediawiki.org/wiki/Wikimedia_Technology/Annual_Plans/FY2019/TEC2:_Modern_Event_Platform>.
The event streaming world mostly uses Avro, but there is plenty of Protobufs around too.
We ultimately decided to use JSON with JSONSchema as our transport format. While lacking
some advantages of the other binary options, JSON is just more ubiquitous and easier to
work with in a distributed and open source focused developer community. (You don't
need the schema to read the data.)
More reading:
- Choose Schema Tech RFC <https://phabricator.wikimedia.org/T198256>
- An old JSON justification blog post
<https://blog.wikimedia.org/2017/01/13/json-hadoop-kafka/>
Our choice of JSONSchema and JSON is mostly around canonical data schemas for in-flight
data transport and protocols. For data at rest, it might make more sense to serialize
into something completely different (we use Parquet in Hadoop for most data there). You
can read some WIP documentation about how we use JSONSchema here
<https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas>.
On Fri, Nov 22, 2019 at 10:14 PM Roy Smith <roy(a)panix.com
<mailto:roy@panix.com>> wrote:
I'm starting to look at some machine learning projects I've wanted to do for a
while (ex: sock-puppet detection). This quickly leads to having to make decisions about
data storage formats, i.e. csv, json, protobufs, etc. Left to my own devices, I'd
probably use protos, but I don't want to be swimming upstream.
Are there any standards in wiki-land for how people store data? If there's some
common way that "everybody does it", that's how I want to do it too. Or,
does every project just do their own thing?
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud(a)lists.wikimedia.org <mailto:Cloud@lists.wikimedia.org> (formerly
labs-l(a)lists.wikimedia.org <mailto:labs-l@lists.wikimedia.org>)
https://lists.wikimedia.org/mailman/listinfo/cloud
<https://lists.wikimedia.org/mailman/listinfo/cloud>_______________________________________________
Wikimedia Cloud Services mailing list
Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud