Hi Roy,
We had to evaluate data formats for event streaming systems as part of WMF's Modern Event Platform https://phabricator.wikimedia.org/T185233 program https://www.mediawiki.org/wiki/Wikimedia_Technology/Annual_Plans/FY2019/TEC2:_Modern_Event_Platform. The event streaming world mostly uses Avro, but there is plenty of Protobufs around too.
We ultimately decided to use JSON with JSONSchema as our transport format. While lacking some advantages of the other binary options, JSON is just more ubiquitous and easier to work with in a distributed and open source focused developer community. (You don't need the schema to read the data.)
More reading: - Choose Schema Tech RFC https://phabricator.wikimedia.org/T198256 - An old JSON justification blog post https://blog.wikimedia.org/2017/01/13/json-hadoop-kafka/
Our choice of JSONSchema and JSON is mostly around canonical data schemas for in-flight data transport and protocols. For data at rest, it might make more sense to serialize into something completely different (we use Parquet in Hadoop for most data there). You can read some WIP documentation about how we use JSONSchema here https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas.
On Fri, Nov 22, 2019 at 10:14 PM Roy Smith roy@panix.com wrote:
I'm starting to look at some machine learning projects I've wanted to do for a while (ex: sock-puppet detection). This quickly leads to having to make decisions about data storage formats, i.e. csv, json, protobufs, etc. Left to my own devices, I'd probably use protos, but I don't want to be swimming upstream.
Are there any standards in wiki-land for how people store data? If there's some common way that "everybody does it", that's how I want to do it too. Or, does every project just do their own thing? _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud