Re: [Cloud] protobufs?

25 Nov 2019


      Hi Roy,
We had to evaluate data formats for event streaming systems as part of
WMF's Modern Event Platform https://phabricator.wikimedia.org/T185233
program
https://www.mediawiki.org/wiki/Wikimedia_Technology/Annual_Plans/FY2019/TEC2:_Modern_Event_Platform.
The event streaming world mostly uses Avro, but there is plenty of
Protobufs around too.
We ultimately decided to use JSON with JSONSchema as our transport format.
While lacking some advantages of the other binary options, JSON is just
more ubiquitous and easier to work with in a distributed and open source
focused developer community.  (You don't need the schema to read the data.)
More reading:
- Choose Schema Tech RFC https://phabricator.wikimedia.org/T198256
- An old JSON justification blog post
https://blog.wikimedia.org/2017/01/13/json-hadoop-kafka/
Our choice of JSONSchema and JSON is mostly around canonical data schemas
for in-flight data transport and protocols.  For data at rest, it might
make more sense to serialize into something completely different (we use
Parquet in Hadoop for most data there).  You can read some WIP
documentation about how we use JSONSchema here
https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas.
On Fri, Nov 22, 2019 at 10:14 PM Roy Smith roy@panix.com wrote:
...
I'm starting to look at some machine learning projects I've wanted to do
for a while (ex: sock-puppet detection).  This quickly leads to having to
make decisions about data storage formats, i.e. csv, json, protobufs, etc.
Left to my own devices, I'd probably use protos, but I don't want to be
swimming upstream.
Are there any standards in wiki-land for how people store data?  If
there's some common way that "everybody does it", that's how I want to do
it too.  Or, does every project just do their own thing?
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

2024

2023

2022

2021

2020

2019

2018

2017

Re: [Cloud] protobufs?