Awesome. Dmitry I shall make you explain "How I wrote a new RESTBase service" :)

Will this replace or subsume http://www.mediawiki.org/wiki/Extension:TextExtracts ? Will clients be able to request first paragraph, first 3 sentences, etc.?

In the case of entities existing within a paragraph, we can decide (with a little help from Design) whether it's important to keep them inline with the text, or strip and move them outside of the paragraph.

Will clients be able to request different kinds of stripping? It seems really hard. If you look at the Vincent van Gogh article, its opening sentence is

Vincent Willem van Gogh (Dutch: [ˈvɪnsɛnt ˈʋɪləm vɑn ˈɣɔx] ( listen);[note 1] 30 March 1853 – 29 July 1890) was a major Post-Impressionist painter.

and it seems every client shows this differently:

Google search results snippet displays

Vincent Willem van Gogh (Dutch: [ˈvɪnsɛnt ˈʋɪləm vɑn ˈɣɔx] ( listen); 30 March 1853 – 29 July 1890) was a major ...

Clearly " (listen)" shouldn't be there. Meanwhile Wikipedia Beta Android app and Google's Knowledge graph box remove everything in parentheses (how?) and show two sentences:

Vincent Willem van Gogh was a major Post-Impressionist painter. He was a Dutch artist whose work had a far-reaching influence on 20th-century art. Wikipedia

But the Wikipedia Beta Android app's Share as image gives me:
Displaying Vincent_van_Gogh.jpg
(I filed https://phabricator.wikimedia.org/T102208 ).

It looks like the mobile view service http://appservice.wmflabs.org/en.wikipedia.org/v1/mobile/app/page/lite/Vincent_van_Gogh also renders the full HTML, including the "listen" speaker icon.

There's no single correct form for this snippet, it can't be decomposed into separate bits of JSON, and the pronunciation isn't cleanly nested in HTML for clients to easily remove the right parts of it. The mobile view service could have an ill-defined "Do the right thing" API, or implement a lot of named transform styles, or have some kind of domain-specific language 8-), or always returns structured Parsoid HTML that clients strip, or ??

Cheers,

- - - - originals to end - - - - -

On Jun 11, 2015 10:05 AM, "Dmitry Brant" <dbrant@wikimedia.org> wrote:
Yes, we should definitely do both, keeping in mind that the JSON-only service will be much more important for apps in the long run.
The part that worries me a little bit is not knowing when exactly these services can be deployed to production at full scale. Since so many of our brainstorming ideas for Q1 and beyond are dependent on these services, we should have a concrete time frame for this.

The JSON service basically already exists[1] (in its infancy), and experimenting with changes to the output JSON structure is absurdly easy.
I would suggest that we take an inventory[2] of all the non-text entities that one might find in articles (infoboxes, tables, references, images, math formulas, etc), and update the service to structure them as JSON. Then we'll be free to decide how we want to present these entities natively in the apps.

In the case of entities existing within a paragraph, we can decide (with a little help from Design) whether it's important to keep them inline with the text, or strip and move them outside of the paragraph.  For entities that are important to keep inline, we can still strip them out and restructure them as JSON, but also replace the inline occurrence with a syntax marker that the apps will recognize, and decide how to handle natively.

Whether to preserve HTML formatting might also be a question for design / UX research. However, at least on Android, the native TextView does support some limited HTML tags, and we can do additional formatting with Spans if necessary.


[1] http://appservice.wmflabs.org/en.wikipedia.org/v1/mobile/app/page/lite/Wombat
[2] https://etherpad.wikimedia.org/p/json-content-service-structure


On Thu, Jun 11, 2015 at 11:28 AM, Corey Floyd <cfloyd@wikimedia.org> wrote:

Mostly apps have been talking about this, but I think it would be good to get web folks involved as well.

We have a lot of ideas, and this is at the top of the list for things we need to accomplish potential goals for the quarter. It also seems there are at least 2 ideas for how we should do this:

1. Deploy a service backed by Parsed that delivers better marked up HTML than mobile View. 
2. Deploy a service that converts HTML to JSON and delivers that instead.

My suggestion would be to do both… deliver better HTML first (this should be "easier"), then build another service upon that to serve JSON.

The JSON spec for an article needs some discussion - but I think will be pretty easy to settle on. 

A couple of questions we will have solve:
- Should text still be marked up in HTML? If not, what about formatting loss? 
- Entities like images that are between paragraphs are easy to handle, but what about entities within paragraphs? 


Any other thoughts here?