Enabling shared tabular data pages

List overview All Threads
Download

newer

older

Consolidating dashboards and data...

Simple Wikimedia tech timeline

Yuri Astrakhan

4 Jun 2016 4 Jun '16

10:17 p.m.

We have had some good feedback for the new shared tabular data feature, and we are getting ready to deploy it in production. It would be amazing if you can give it a final look-over to see if there are any blockers left.

The first stage will be to enable Data:*.tab pages on Commons, and allow all other wikis direct access to it via Lua code and Graph extension. All data at this point must be licensed under CC0. More licensing options are still under discussion, and can be easily added later.

In line with the "release early, release often", we will not have any elaborate data editing interface beyond the raw JSON code editor for the first release. Our initial target audience is the more experienced users who will evaluate and test the new technology. Once the underlying tech is stable and prooven, we will work on making it more accessible to the general audience.

Links: * Task: https://phabricator.wikimedia.org/T134426 * Demo: http://data.wmflabs.org * Technical: https://www.mediawiki.org/wiki/Extension:JsonConfig/Tabular * Discussion: https://commons.wikimedia.org/wiki/Commons:Village_pump/Proposals#Tabular_da... * Facebook: https://www.facebook.com/groups/wikipediaweekly/permalink/997545366959961/

Show replies by date

Daniel Kinzler

6 Jun 6 Jun

6:39 p.m.

Am 04.06.2016 um 18:47 schrieb Yuri Astrakhan:

...

In line with the "release early, release often", we will not have any elaborate data editing interface beyond the raw JSON code editor for the first release.

A word of caution about this strategy: this is great for user facing things, but it really sucks if you are creating artifacts, such as page revisions. You will have to stay compatible with your very first data format, and the second, and the third, etc, forever. Similarly, once you have an ecosystem of tools that rely on your API and data model, changing it becomes rather troublesome.

So, for anything that is supposed to offer a stable API, or creates persistent data, "release early, release often" is not a good strategy in my experience. A lot of pain lies this way. Remember: wikitext syntax was once a "let's just make it work, we will fix it later" hack...

-- daniel

Yuri Astrakhan

7:10 p.m.

Daniel, I agree about the data/api versioning. I was mostly talking about features and capabilities. For example, we could spend the next year developing a visual table editor, implement support for unlimited table sizes, provide import/export from other table formats, introduce elaborate schema validation, and many other cool features. And after that year realize that users don't need this whole thing at all, or need something similar but very different. Or we could release one small, well defined, stable subset of that functionality, get feedback, and move forward.

Do you have any thoughts about the proposed data structure?

On Mon, Jun 6, 2016 at 4:09 PM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:

...

Am 04.06.2016 um 18:47 schrieb Yuri Astrakhan:

...
In line with the "release early, release often", we will not have any elaborate data editing interface beyond the raw JSON code editor for the first release.

A word of caution about this strategy: this is great for user facing things, but it really sucks if you are creating artifacts, such as page revisions. You will have to stay compatible with your very first data format, and the second, and the third, etc, forever. Similarly, once you have an ecosystem of tools that rely on your API and data model, changing it becomes rather troublesome.

So, for anything that is supposed to offer a stable API, or creates persistent data, "release early, release often" is not a good strategy in my experience. A lot of pain lies this way. Remember: wikitext syntax was once a "let's just make it work, we will fix it later" hack...

-- daniel

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Rob Lanphier

11:44 p.m.

On Mon, Jun 6, 2016 at 6:40 AM, Yuri Astrakhan yastrakhan@wikimedia.org wrote:

...

Daniel, I agree about the data/api versioning. I was mostly talking about features and capabilities. For example, we could spend the next year developing a visual table editor, implement support for unlimited table sizes, provide import/export from other table formats, introduce elaborate schema validation, and many other cool features. And after that year realize that users don't need this whole thing at all, or need something similar but very different. Or we could release one small, well defined, stable subset of that functionality, get feedback, and move forward.

Hi Yuri,

I think one thing that would be helpful for me (and I suspect many people who want to help) is some more specifics about this statement from your original email: "We have had some good feedback for the new shared tabular data feature, and we are getting ready to deploy it in production." Which "we" are you referring to, and by "getting ready to deploy it in production", does that mean it's about to be usable where someone could upload gigabytes of production data in this format Commons by the end of the week? Is there a more measured plan published somewhere?

This all sounds very cool, but also an area where we could accidentally accrue a crushing load of technical debt without fully realizing it (per Daniel's comment). I'll confess to being ignorant on everything that's been going on, and I'm wondering now how desperately I should study your documentation to make up for it (and how important it is to drop other work to make time for this).

Rob

Yuri Astrakhan

7 Jun 7 Jun

1:30 a.m.

Rob, thanks for your offer to help! Always welcome :)

By discussion and positive feedback I meant Facebook and Commons comments, and a very old and elaborate phab ticket discussion: * https://phabricator.wikimedia.org/T120452 * https://commons.wikimedia.org/wiki/Commons:Village_pump/Proposals#Tabular_da... * https://www.facebook.com/groups/wikipediaweekly/permalink/997545366959961/

I do not think this feature will immediately have a huge uptake. It will simplify graph design, as it will be possible to store data outside of the graph. It will also allow data from tables and lists to be moved into separate wiki pages. The work of moving existing data into these pages might not be very fast. In short, it will only be accessible from Lua and graphs, will be less than 2MB each, and will require some technical skills to edit JSON until better tools are created.

On Mon, Jun 6, 2016 at 9:14 PM, Rob Lanphier robla@wikimedia.org wrote:

...

On Mon, Jun 6, 2016 at 6:40 AM, Yuri Astrakhan yastrakhan@wikimedia.org wrote:

...
Daniel, I agree about the data/api versioning. I was mostly talking about features and capabilities. For example, we could spend the next year developing a visual table editor, implement support for unlimited table sizes, provide import/export from other table formats, introduce

elaborate

...
schema validation, and many other cool features. And after that year realize that users don't need this whole thing at all, or need something similar but very different. Or we could release one small, well defined, stable subset of that functionality, get feedback, and move forward.

Hi Yuri,

I think one thing that would be helpful for me (and I suspect many people who want to help) is some more specifics about this statement from your original email: "We have had some good feedback for the new shared tabular data feature, and we are getting ready to deploy it in production." Which "we" are you referring to, and by "getting ready to deploy it in production", does that mean it's about to be usable where someone could upload gigabytes of production data in this format Commons by the end of the week? Is there a more measured plan published somewhere?

This all sounds very cool, but also an area where we could accidentally accrue a crushing load of technical debt without fully realizing it (per Daniel's comment). I'll confess to being ignorant on everything that's been going on, and I'm wondering now how desperately I should study your documentation to make up for it (and how important it is to drop other work to make time for this).

Rob _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Daniel Kinzler

1:02 a.m.

Am 06.06.2016 um 15:40 schrieb Yuri Astrakhan:

...

Do you have any thoughts about the proposed data structure?

The structure looks sane and future-proof to me, but since it's all-in-one-blob, it'll be hard to scale it to more than a few ten thousand lines or so. I like this model, but if you want to go beyond that (DO we want to go beyond that?!) you will need a different approach, which may be incompatible.

One thing that should be specified very rigorously from the start are the supported data types, along with their exact syntax and semantics. Your example has string, number, boolean, and localized. So:

* what's the length limit for string? * what's the range and precision of number? Is it the same as for JSON? * does boolean only accept JSON primitives, or also strings? * what language codes are valid for localized? Is language fallback applied for display?

Not answering these questions now may lead to having data that can later no longer be properly interpreted. If you get into quantities with precision or date, this becomes a lot more fun. In that case, you would want to re-use the DataValues module(s) that Wikidata uses.

You write in your proposal "Hard to define types like Wikidata ID, datetime, and URL could be stored as a string until we can reuse Wikidata's type system". Well, what's keeping you from using it now? DataValue and friends are standalone composer modules, you can find them on github.

-- daniel

Yuri Astrakhan

1:29 a.m.

Daniel, thanks, inline:

The structure looks sane and future-proof to me, but since it's

...

all-in-one-blob, it'll be hard to scale it to more than a few ten thousand lines or so. I like this model, but if you want to go beyond that (DO we want to go beyond that?!) you will need a different approach, which may be incompatible.

We do *eventually* want to go beyond that towards large data. We had this discussion with Brion, see here: * https://phabricator.wikimedia.org/T120452#2224764

I do not think my approach is a blocker for larger datasets, because you can add simple SQL-like interface capable of reading data from these pages and from large backend databases. 2MB page limit will prevent page data from growing too large. Also, larger datasets is a different target, that we should approach when we are ready.

One thing that should be specified very rigorously from the start are the

...

supported data types, along with their exact syntax and semantics. Your example has string, number, boolean, and localized. So:

what's the length limit for string?

Good question. Do you have a limit for Wikidata labels and other string values?

...

what's the range and precision of number? Is it the same as for JSON?

For now, same as JSON.

...

does boolean only accept JSON primitives, or also strings?

true/false only, no strings

...

what language codes are valid for localized? Is language fallback

applied for display?

Same rules as for wiki language codes (but without validation against the actual list). Automatic fallback is already implemented, using Language class. If everything else fails, and there is no English, takes random first (unlike Language which stops at English and fails otherwise).

...

You write in your proposal "Hard to define types like Wikidata ID, datetime, and URL could be stored as a string until we can reuse Wikidata's type system". Well, what's keeping you from using it now? DataValue and friends are standalone composer modules, you can find them on github.

I was told by the Wikidata team at the Jerusalem hackathon that the Javascript code is too entangled, and I won't be able to reuse it for non-Wikidata stuff. I will be very happy to adapt it if possible. Yet, I do not think this is a requirement for the first release.

Rob Lanphier

14 Jun 14 Jun

6:16 a.m.

Let's revive this thread for this week's ArchCom RFC meeting. I'll doll up a more formal announcement as I finish cleaning up some of our notes documents, but for now, the short version: URL: https://phabricator.wikimedia.org/E213 Time: 2016-06-15, Wednesday 21:00 UTC (2pm PDT, 23:00 CEST) Location: #wikimedia-office IRC channel

Rob

On Sat, Jun 4, 2016 at 9:47 AM, Yuri Astrakhan yastrakhan@wikimedia.org wrote:

...

We have had some good feedback for the new shared tabular data feature, and we are getting ready to deploy it in production. It would be amazing if you can give it a final look-over to see if there are any blockers left.

The first stage will be to enable Data:*.tab pages on Commons, and allow all other wikis direct access to it via Lua code and Graph extension. All data at this point must be licensed under CC0. More licensing options are still under discussion, and can be easily added later.

In line with the "release early, release often", we will not have any elaborate data editing interface beyond the raw JSON code editor for the first release. Our initial target audience is the more experienced users who will evaluate and test the new technology. Once the underlying tech is stable and prooven, we will work on making it more accessible to the general audience.

Links:

Task: https://phabricator.wikimedia.org/T134426

Demo: http://data.wmflabs.org

Technical: https://www.mediawiki.org/wiki/Extension:JsonConfig/Tabular

Discussion:

https://commons.wikimedia.org/wiki/Commons:Village_pump/Proposals#Tabular_da...

Facebook:

https://www.facebook.com/groups/wikipediaweekly/permalink/997545366959961/ _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

3105

Age (days ago)

3115

Last active (days ago)

wikitech-l@lists.wikimedia.org

7 comments

3 participants

tags (0)

participants (3)

Daniel Kinzler
Rob Lanphier
Yuri Astrakhan