Daniel, thanks, inline:
The structure looks sane and future-proof to me, but since it's
all-in-one-blob, it'll be hard to scale it to more than a few ten thousand lines or so. I like this model, but if you want to go beyond that (DO we want to go beyond that?!) you will need a different approach, which may be incompatible.
We do *eventually* want to go beyond that towards large data. We had this discussion with Brion, see here: * https://phabricator.wikimedia.org/T120452#2224764
I do not think my approach is a blocker for larger datasets, because you can add simple SQL-like interface capable of reading data from these pages and from large backend databases. 2MB page limit will prevent page data from growing too large. Also, larger datasets is a different target, that we should approach when we are ready.
One thing that should be specified very rigorously from the start are the
supported data types, along with their exact syntax and semantics. Your example has string, number, boolean, and localized. So:
- what's the length limit for string?
Good question. Do you have a limit for Wikidata labels and other string values?
- what's the range and precision of number? Is it the same as for JSON?
For now, same as JSON.
- does boolean only accept JSON primitives, or also strings?
true/false only, no strings
- what language codes are valid for localized? Is language fallback
applied for display?
Same rules as for wiki language codes (but without validation against the actual list). Automatic fallback is already implemented, using Language class. If everything else fails, and there is no English, takes random first (unlike Language which stops at English and fails otherwise).
You write in your proposal "Hard to define types like Wikidata ID, datetime, and URL could be stored as a string until we can reuse Wikidata's type system". Well, what's keeping you from using it now? DataValue and friends are standalone composer modules, you can find them on github.
I was told by the Wikidata team at the Jerusalem hackathon that the Javascript code is too entangled, and I won't be able to reuse it for non-Wikidata stuff. I will be very happy to adapt it if possible. Yet, I do not think this is a requirement for the first release.