CommonsMetadata API returning HTML?

List overview All Threads
Download

newer

older

What to do with TimedMediaHandler

CORS on betalabs

Dan Garry

8 Dec 2014 8 Dec '14

8:29 p.m.

Greetings, Multimedia Team!

*Background:* The Mobile Apps Team is working on a restyling of the way content the first fold of content is presented in the Wikipedia app. You can see this image http://i.imgur.com/dxqfJKd.png to see what this looks like. Having a high-resolution image so prominently at the top of the page will likely drive a lot of clicks, so we're working on a lightweight image viewer to deal with file pages, which are poorly styled monstrosities on the mobile app. We're going to use the CommonsMetadata API to help us out. :-)

*Problem:* The CommonsMetadata API can sometimes return HTML [1]. Having HTML in the API response is a bit problematic for us. Native apps make next to no use of HTML when creating links or layouts, so we have to strip the HTML from every API response, lest it be displayed as plaintext to the user. In the short term this is fine, we can strip it and throw the information away. But in the long run it'd be better if the API didn't return HTML.

*Our ask: *Can the CommonsMetadata API please not return HTML in its responses? :-)

Thanks, Dan

[1]: Run this query <https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&....

...

,

and look at "artist" key. The API response has an HTML link in it.

-- Dan Garry Associate Product Manager, Mobile Apps Wikimedia Foundation

Attachments:

attachment.htm (text/html — 1.7 KB)

Show replies by date

Dan Garry

8 Dec 8 Dec

8:30 p.m.

Sorry, the example query I provided was incorrect. Use this instead: https://en.wikipedia.org/w/api.php?action=query&prop=imageinfo&forma...

Thanks, Dan

On 8 December 2014 at 11:29, Dan Garry dgarry@wikimedia.org wrote:

...

Greetings, Multimedia Team!

*Background:* The Mobile Apps Team is working on a restyling of the way content the first fold of content is presented in the Wikipedia app. You can see this image http://i.imgur.com/dxqfJKd.png to see what this looks like. Having a high-resolution image so prominently at the top of the page will likely drive a lot of clicks, so we're working on a lightweight image viewer to deal with file pages, which are poorly styled monstrosities on the mobile app. We're going to use the CommonsMetadata API to help us out. :-)

*Problem:* The CommonsMetadata API can sometimes return HTML [1]. Having HTML in the API response is a bit problematic for us. Native apps make next to no use of HTML when creating links or layouts, so we have to strip the HTML from every API response, lest it be displayed as plaintext to the user. In the short term this is fine, we can strip it and throw the information away. But in the long run it'd be better if the API didn't return HTML.

*Our ask: *Can the CommonsMetadata API please not return HTML in its responses? :-)

Thanks, Dan

[1]: Run this query https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&format=xml&iiprop=extmetadata&iilimit=10&titles=File%3ACommon%20Kingfisher%20Alcedo%20atthis.jpg., and look at "artist" key. The API response has an HTML link in it.

-- Dan Garry Associate Product Manager, Mobile Apps Wikimedia Foundation

-- Dan Garry Associate Product Manager, Mobile Apps Wikimedia Foundation

Derk-Jan Hartman

9:52 p.m.

New subject: [WikimediaMobile] CommonsMetadata API returning HTML?

Welcome to the problem of 'there is no structured metadata for files' :)

This is a garbage in, garbage out problem and probably when you start filtering you will break attribution requirements (more than the community will appreciate).

On Mon, Dec 8, 2014 at 8:30 PM, Dan Garry dgarry@wikimedia.org wrote:

...

Sorry, the example query I provided was incorrect. Use this instead: https://en.wikipedia.org/w/api.php?action=query&prop=imageinfo&forma...

Thanks, Dan

On 8 December 2014 at 11:29, Dan Garry dgarry@wikimedia.org wrote:

...
Greetings, Multimedia Team!

*Background:* The Mobile Apps Team is working on a restyling of the way content the first fold of content is presented in the Wikipedia app. You can see this image http://i.imgur.com/dxqfJKd.png to see what this looks like. Having a high-resolution image so prominently at the top of the page will likely drive a lot of clicks, so we're working on a lightweight image viewer to deal with file pages, which are poorly styled monstrosities on the mobile app. We're going to use the CommonsMetadata API to help us out. :-)

*Problem:* The CommonsMetadata API can sometimes return HTML [1]. Having HTML in the API response is a bit problematic for us. Native apps make next to no use of HTML when creating links or layouts, so we have to strip the HTML from every API response, lest it be displayed as plaintext to the user. In the short term this is fine, we can strip it and throw the information away. But in the long run it'd be better if the API didn't return HTML.

*Our ask: *Can the CommonsMetadata API please not return HTML in its responses? :-)

Thanks, Dan

[1]: Run this query https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&format=xml&iiprop=extmetadata&iilimit=10&titles=File%3ACommon%20Kingfisher%20Alcedo%20atthis.jpg., and look at "artist" key. The API response has an HTML link in it.

-- Dan Garry Associate Product Manager, Mobile Apps Wikimedia Foundation

-- Dan Garry Associate Product Manager, Mobile Apps Wikimedia Foundation

Mobile-l mailing list Mobile-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mobile-l

Dan Garry

11:04 p.m.

New subject: [WikimediaMobile] CommonsMetadata API returning HTML?

On 8 December 2014 at 12:52, Derk-Jan Hartman d.j.hartman@gmail.com wrote:

...

Welcome to the problem of 'there is no structured metadata for files' :)

This is a garbage in, garbage out problem and probably when you start filtering you will break attribution requirements (more than the community will appreciate).

I figured. :-(

So, given that we can't do anything meaningful with the HTML in a native app, that means we only have three options:

- Display the raw HTML directly to the user - Try to parse the HTML for interesting information and update the relevant view's properties using native code - Strip any and all HTML tags that are given to us in the JSON

The first two aren't sounding workable at all to me; the first is unworkable from a product standpoint, and the second is an absolutely gigantic can of worms. So I guess we'll be stripping the HTML until such time that this is fixed. :-)

Thanks, Dan

-- Dan Garry Associate Product Manager, Mobile Apps Wikimedia Foundation

Jon Robson

9 Dec 9 Dec

midnight

New subject: [WikimediaMobile] CommonsMetadata API returning HTML?

It would actually be great to get this problem fixed rather than add yet more band aids on it. What can we actually do to start moving towards structured metadata on files? What needs to happen? Can we lean on Wikidata in anyway?

On Mon, Dec 8, 2014 at 2:04 PM, Dan Garry dgarry@wikimedia.org wrote:

...

On 8 December 2014 at 12:52, Derk-Jan Hartman d.j.hartman@gmail.com wrote:

...
Welcome to the problem of 'there is no structured metadata for files' :)

This is a garbage in, garbage out problem and probably when you start filtering you will break attribution requirements (more than the community will appreciate).

I figured. :-(

So, given that we can't do anything meaningful with the HTML in a native app, that means we only have three options:

Display the raw HTML directly to the user Try to parse the HTML for interesting information and update the relevant view's properties using native code Strip any and all HTML tags that are given to us in the JSON

The first two aren't sounding workable at all to me; the first is unworkable from a product standpoint, and the second is an absolutely gigantic can of worms. So I guess we'll be stripping the HTML until such time that this is fixed. :-)

Thanks, Dan

-- Dan Garry Associate Product Manager, Mobile Apps Wikimedia Foundation

Mobile-l mailing list Mobile-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mobile-l

-- Jon Robson * http://jonrobson.me.uk * https://www.facebook.com/jonrobson * @rakugojon

Federico Leva (Nemo)

10:51 a.m.

New subject: [WikimediaMobile] CommonsMetadata API returning HTML?

Dan Garry, 08/12/2014 23:04:

...

So, given that we can't do anything meaningful with the HTML in a native app, that means we only have three options:

Display the raw HTML directly to the user

Try to parse the HTML for interesting information and update the relevant view's properties using native code

Strip any and all HTML tags that are given to us in the JSON

There is a simpler option: don't display a file if you can't attribute it.

Nemo

Gergo Tisza

12:03 a.m.

Hi Dan!

On Mon, Dec 8, 2014 at 11:29 AM, Dan Garry dgarry@wikimedia.org wrote:

...

*Background:* The Mobile Apps Team is working on a restyling of the way content the first fold of content is presented in the Wikipedia app. You can see this image http://i.imgur.com/dxqfJKd.png to see what this looks like.

That looks awesome, can't wait to see it live! Any chance of something like this eventually hitting the desktop site? :-)

Having a high-resolution image so prominently at the top of the page will

...

likely drive a lot of clicks, so we're working on a lightweight image viewer to deal with file pages, which are poorly styled monstrosities on the mobile app. We're going to use the CommonsMetadata API to help us out. :-)

Keep in mind that there is no guarantee the API output is an accurate representation of the file page (lack of machine-readable template markup etc. - for example, CommonsMetadata can't figure out the license name for about 5% of the MediaViewer pageviews), so you'll still need a link to the raw file page somewhere.

*Problem:* The CommonsMetadata API can sometimes return HTML [1]. Having

...

HTML in the API response is a bit problematic for us. Native apps make next to no use of HTML when creating links or layouts, so we have to strip the HTML from every API response, lest it be displayed as plaintext to the user. In the short term this is fine, we can strip it and throw the information away. But in the long run it'd be better if the API didn't return HTML.

In the long run CommonsMetadata should die in a fire, together with the Commons paradigm of storing information in license parameters. You can see the related plans at Commons:Structured data https://commons.wikimedia.org/wiki/Commons:Structured_data; these include migrating most information to plaintext (file descriptions will probably remain rich text).

In the not so long run, some HTML markup is fairly important. Links can be necessary for the attribution, paragraphs for making long descriptions more readable; removing lists and tables makes some descriptions unreadable (map legends tend to use tables, for example). So I think the API would be much less useful if it started stripping HTML. (It does that already in a few cases where the intent is clear, such as stripping the enclosing <p> generated by MediaWiki, or stripping certain kinds of purely presentational markup such as creator templates https://commons.wikimedia.org/wiki/Template:Creator, but that only works when the source and intent of the markup is known.)

We could add an API parameter to provide a plaintext version, but that would split the cache (both varnish and memcached). Not a huge deal, but tag stripping is very easy, so if you don't need anything more specific than that, I would say it is simpler to do it on the client side. If more complex logic is needed (e.g. turning <ul>s into star lists), it makes sense to do that in the API instead of forcing each client to reimplement it, but I am not sure how generic such a text representation would be.

So, given that we can't do anything meaningful with the HTML in a native

...

app, that means we only have three options:

Display the raw HTML directly to the user

Try to parse the HTML for interesting information and update the

relevant view's properties using native code

Strip any and all HTML tags that are given to us in the JSON

The first two aren't sounding workable at all to me; the first is unworkable from a product standpoint, and the second is an absolutely gigantic can of worms. So I guess we'll be stripping the HTML until such time that this is fixed. :-)

I'm not sure some limited HTML parsing is that bad. The low-hanging fruit is links (MediaViewer currently strips everything else, and most of the time that works decently), and those are never nested, so they can be processed by a trivial SAX parser, for which all platforms surely have libraries.

Bernd Sitzmann

12:18 a.m.

New subject: [WikimediaMobile] CommonsMetadata API returning HTML?

On Android we could use Html.fromHtml() <http://developer.android.com/reference/android/text/Html.html#fromHtml(java...., android.text.Html.ImageGetter, android.text.Html.TagHandler)> to strip the HTML tags.

Bernd

On Mon, Dec 8, 2014 at 4:03 PM, Gergo Tisza gtisza@wikimedia.org wrote:

...

Hi Dan!

On Mon, Dec 8, 2014 at 11:29 AM, Dan Garry dgarry@wikimedia.org wrote:

...
*Background:* The Mobile Apps Team is working on a restyling of the way content the first fold of content is presented in the Wikipedia app. You can see this image http://i.imgur.com/dxqfJKd.png to see what this looks like.

That looks awesome, can't wait to see it live! Any chance of something like this eventually hitting the desktop site? :-)

Having a high-resolution image so prominently at the top of the page will

...
likely drive a lot of clicks, so we're working on a lightweight image viewer to deal with file pages, which are poorly styled monstrosities on the mobile app. We're going to use the CommonsMetadata API to help us out. :-)

Keep in mind that there is no guarantee the API output is an accurate representation of the file page (lack of machine-readable template markup etc. - for example, CommonsMetadata can't figure out the license name for about 5% of the MediaViewer pageviews), so you'll still need a link to the raw file page somewhere.

*Problem:* The CommonsMetadata API can sometimes return HTML [1]. Having

...
HTML in the API response is a bit problematic for us. Native apps make next to no use of HTML when creating links or layouts, so we have to strip the HTML from every API response, lest it be displayed as plaintext to the user. In the short term this is fine, we can strip it and throw the information away. But in the long run it'd be better if the API didn't return HTML.

In the long run CommonsMetadata should die in a fire, together with the Commons paradigm of storing information in license parameters. You can see the related plans at Commons:Structured data https://commons.wikimedia.org/wiki/Commons:Structured_data; these include migrating most information to plaintext (file descriptions will probably remain rich text).

In the not so long run, some HTML markup is fairly important. Links can be necessary for the attribution, paragraphs for making long descriptions more readable; removing lists and tables makes some descriptions unreadable (map legends tend to use tables, for example). So I think the API would be much less useful if it started stripping HTML. (It does that already in a few cases where the intent is clear, such as stripping the enclosing <p> generated by MediaWiki, or stripping certain kinds of purely presentational markup such as creator templates https://commons.wikimedia.org/wiki/Template:Creator, but that only works when the source and intent of the markup is known.)

We could add an API parameter to provide a plaintext version, but that would split the cache (both varnish and memcached). Not a huge deal, but tag stripping is very easy, so if you don't need anything more specific than that, I would say it is simpler to do it on the client side. If more complex logic is needed (e.g. turning <ul>s into star lists), it makes sense to do that in the API instead of forcing each client to reimplement it, but I am not sure how generic such a text representation would be.

So, given that we can't do anything meaningful with the HTML in a native

...
app, that means we only have three options:

Display the raw HTML directly to the user

Try to parse the HTML for interesting information and update the

relevant view's properties using native code

Strip any and all HTML tags that are given to us in the JSON

The first two aren't sounding workable at all to me; the first is unworkable from a product standpoint, and the second is an absolutely gigantic can of worms. So I guess we'll be stripping the HTML until such time that this is fixed. :-)

I'm not sure some limited HTML parsing is that bad. The low-hanging fruit is links (MediaViewer currently strips everything else, and most of the time that works decently), and those are never nested, so they can be processed by a trivial SAX parser, for which all platforms surely have libraries.

Mobile-l mailing list Mobile-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mobile-l

Dan Garry

2:10 a.m.

New subject: [WikimediaMobile] CommonsMetadata API returning HTML?

Hey Gergo,

Responses in-line.

On 8 December 2014 at 15:03, Gergo Tisza gtisza@wikimedia.org wrote:

...

Hi Dan!

On Mon, Dec 8, 2014 at 11:29 AM, Dan Garry dgarry@wikimedia.org wrote:

...
*Background:* The Mobile Apps Team is working on a restyling of the way content the first fold of content is presented in the Wikipedia app. You can see this image http://i.imgur.com/dxqfJKd.png to see what this looks like.

That looks awesome, can't wait to see it live! Any chance of something like this eventually hitting the desktop site? :-)

Hah, complicated question! I'd love to see that happen, but unfortunately it seems unlikely in the near future. :-(

Keep in mind that there is no guarantee the API output is an accurate

...

representation of the file page (lack of machine-readable template markup etc. - for example, CommonsMetadata can't figure out the license name for about 5% of the MediaViewer pageviews), so you'll still need a link to the raw file page somewhere.

Fortunately, we knew this going in! We'll be dumping a link to the file page into the overflow menu. :-)

...

In the long run CommonsMetadata should die in a fire, together with the Commons paradigm of storing information in license parameters. You can see the related plans at Commons:Structured data https://commons.wikimedia.org/wiki/Commons:Structured_data; these include migrating most information to plaintext (file descriptions will probably remain rich text).

Yay! Looking forward to this. \o/

...

In the not so long run, some HTML markup is fairly important. Links can be necessary for the attribution, paragraphs for making long descriptions more readable; removing lists and tables makes some descriptions unreadable (map legends tend to use tables, for example). So I think the API would be much less useful if it started stripping HTML. (It does that already in a few cases where the intent is clear, such as stripping the enclosing <p> generated by MediaWiki, or stripping certain kinds of purely presentational markup such as creator templates https://commons.wikimedia.org/wiki/Template:Creator, but that only works when the source and intent of the markup is known.)

Given that this API is hopefully going to soon die a painful death, it probably just makes sense for us to strip the HTML ourselves rather than making you deal with that.

Unfortunately, tables are going to be an issue. On Android, we get some limited HTML parsing for free using the Html class [1], but the native TextView class doesn't support displaying tables. On iOS, it's worse, because we don't get *any* HTML parsing for free, and we actually have to strip the HTML manually too.

In the interests of keeping this simple, we'll probably be able to handle links on Android, but not on iOS. And tables will probably just be totally stripped.

Thanks for your help!

Dan

[1]: On Android, this apparently does the trick where the HTML only contains links: textView.setText(htmlStringWithLinks); textView.setAutoLinkMask(Linkify.WEB_URLS); textView.setLinksClickable(true);

-- Dan Garry Associate Product Manager, Mobile Apps Wikimedia Foundation

Monte Hurd

3:05 a.m.

New subject: [WikimediaMobile] CommonsMetadata API returning HTML?

"On iOS, it's worse, because we don't get any HTML parsing for free, and we actually have to strip the HTML manually too"

Oops, Dan I may have misspoken - on iOS we can strip html w/NSXMLParser which is SAX style. What we don't get for free is labels which can render html links like the android ones you showed me.

...

On Dec 8, 2014, at 5:10 PM, Dan Garry dgarry@wikimedia.org wrote:

On iOS, it's worse, because we don't get any HTML parsing for free, and we actually have to strip the HTML manually too

Dan Garry

3:26 a.m.

New subject: [WikimediaMobile] CommonsMetadata API returning HTML?

On 8 December 2014 at 18:05, Monte Hurd mhurd@wikimedia.org wrote:

...

"On iOS, it's worse, because we don't get *any* HTML parsing for free, and we actually have to strip the HTML manually too"

Oops, Dan I may have misspoken - on iOS we can strip html w/NSXMLParser which is SAX style. What we don't get for free is labels which can render html links like the android ones you showed me.

Okay, thanks for clarifying! Still, we will have to omit links for simplicity. :-)

Dan

-- Dan Garry Associate Product Manager, Mobile Apps Wikimedia Foundation

Monte Hurd

7:06 a.m.

New subject: [WikimediaMobile] CommonsMetadata API returning HTML?

Ya could always do a UIWebview for the descriptions but that just seems icky :)

...

On Dec 8, 2014, at 6:26 PM, Dan Garry dgarry@wikimedia.org wrote:

...
On 8 December 2014 at 18:05, Monte Hurd mhurd@wikimedia.org wrote:

"On iOS, it's worse, because we don't get any HTML parsing for free, and we actually have to strip the HTML manually too"

Oops, Dan I may have misspoken - on iOS we can strip html w/NSXMLParser which is SAX style. What we don't get for free is labels which can render html links like the android ones you showed me.

Okay, thanks for clarifying! Still, we will have to omit links for simplicity. :-)

Dan

-- Dan Garry Associate Product Manager, Mobile Apps Wikimedia Foundation

Dan Garry

7:17 a.m.

New subject: [WikimediaMobile] CommonsMetadata API returning HTML?

*Darth WebView:* Your native code is weak old man. *Objecti-Cee Kenobi:* You can't win, WebView. If you strike me down, I shall become more native than you could possibly imagine.

;-)

Dan

On 8 December 2014 at 22:06, Monte Hurd mhurd@wikimedia.org wrote:

...

Ya could always do a UIWebview for the descriptions but that just seems icky :)

On Dec 8, 2014, at 6:26 PM, Dan Garry dgarry@wikimedia.org wrote:

On 8 December 2014 at 18:05, Monte Hurd mhurd@wikimedia.org wrote:

...
"On iOS, it's worse, because we don't get *any* HTML parsing for free, and we actually have to strip the HTML manually too"

Oops, Dan I may have misspoken - on iOS we can strip html w/NSXMLParser which is SAX style. What we don't get for free is labels which can render html links like the android ones you showed me.

Okay, thanks for clarifying! Still, we will have to omit links for simplicity. :-)

Dan

-- Dan Garry Associate Product Manager, Mobile Apps Wikimedia Foundation

-- Dan Garry Associate Product Manager, Mobile Apps Wikimedia Foundation

Monte Hurd

7:23 a.m.

New subject: [WikimediaMobile] CommonsMetadata API returning HTML?

Bahaha!!

...

On Dec 8, 2014, at 10:17 PM, Dan Garry dgarry@wikimedia.org wrote:

Darth WebView: Your native code is weak old man. Objecti-Cee Kenobi: You can't win, WebView. If you strike me down, I shall become more native than you could possibly imagine.

;-)

Dan

...
On 8 December 2014 at 22:06, Monte Hurd mhurd@wikimedia.org wrote: Ya could always do a UIWebview for the descriptions but that just seems icky :)

...
On Dec 8, 2014, at 6:26 PM, Dan Garry dgarry@wikimedia.org wrote:

...
On 8 December 2014 at 18:05, Monte Hurd mhurd@wikimedia.org wrote:

"On iOS, it's worse, because we don't get any HTML parsing for free, and we actually have to strip the HTML manually too"

Oops, Dan I may have misspoken - on iOS we can strip html w/NSXMLParser which is SAX style. What we don't get for free is labels which can render html links like the android ones you showed me.

Okay, thanks for clarifying! Still, we will have to omit links for simplicity. :-)

Dan

-- Dan Garry Associate Product Manager, Mobile Apps Wikimedia Foundation

-- Dan Garry Associate Product Manager, Mobile Apps Wikimedia Foundation

3664

Age (days ago)

3665

Last active (days ago)

multimedia@lists.wikimedia.org

13 comments

7 participants

tags (0)

participants (7)

Bernd Sitzmann
Dan Garry
Derk-Jan Hartman
Federico Leva (Nemo)
Gergo Tisza
Jon Robson
Monte Hurd