Re: [Multimedia] CommonsMetadata API returning HTML?

9 Dec 2014

      Hi Dan!
On Mon, Dec 8, 2014 at 11:29 AM, Dan Garry dgarry@wikimedia.org wrote:
...
*Background:* The Mobile Apps Team is working on a restyling of the way
content the first fold of content is presented in the Wikipedia app. You
can see this image http://i.imgur.com/dxqfJKd.png to see what this
looks like.
That looks awesome, can't wait to see it live! Any chance of something like
this eventually hitting the desktop site? :-)
Having a high-resolution image so prominently at the top of the page will
...
likely drive a lot of clicks, so we're working on a lightweight image
viewer to deal with file pages, which are poorly styled monstrosities on
the mobile app. We're going to use the CommonsMetadata API to help us out.
:-)
Keep in mind that there is no guarantee the API output is an accurate
representation of the file page (lack of machine-readable template markup
etc. - for example, CommonsMetadata can't figure out the license name for
about 5% of the MediaViewer pageviews), so you'll still need a link to the
raw file page somewhere.
*Problem:* The CommonsMetadata API can sometimes return HTML [1]. Having
...
HTML in the API response is a bit problematic for us. Native apps make next
to no use of HTML when creating links or layouts, so we have to strip the
HTML from every API response, lest it be displayed as plaintext to the
user. In the short term this is fine, we can strip it and throw the
information away. But in the long run it'd be better if the API didn't
return HTML.
In the long run CommonsMetadata should die in a fire, together with the
Commons paradigm of storing information in license parameters.
You can see the related plans at Commons:Structured data
https://commons.wikimedia.org/wiki/Commons:Structured_data; these include
migrating most information to plaintext (file descriptions will probably
remain rich text).
In the not so long run, some HTML markup is fairly important. Links can be
necessary for the attribution, paragraphs for making long descriptions more
readable; removing lists and tables makes some descriptions unreadable (map
legends tend to use tables, for example). So I think the API would be much
less useful if it started stripping HTML. (It does that already in a few
cases where the intent is clear, such as stripping the enclosing <p>
generated by MediaWiki, or stripping certain kinds of purely presentational
markup such as creator templates
https://commons.wikimedia.org/wiki/Template:Creator, but that only works
when the source and intent of the markup is known.)
We could add an API parameter to provide a plaintext version, but that
would split the cache (both varnish and memcached). Not a huge deal, but
tag stripping is very easy, so if you don't need anything more specific
than that, I would say it is simpler to do it on the client side. If more
complex logic is needed (e.g. turning <ul>s into star lists), it makes
sense to do that in the API instead of forcing each client to reimplement
it, but I am not sure how generic such a text representation would be.
So, given that we can't do anything meaningful with the HTML in a native
...
app, that means we only have three options:

Display the raw HTML directly to the user

Try to parse the HTML for interesting information and update the

relevant view's properties using native code

Strip any and all HTML tags that are given to us in the JSON

The first two aren't sounding workable at all to me; the first is
unworkable from a product standpoint, and the second is an absolutely
gigantic can of worms. So I guess we'll be stripping the HTML until such
time that this is fixed. :-)
I'm not sure some limited HTML parsing is that bad. The low-hanging fruit
is links (MediaViewer currently strips everything else, and most of the
time that works decently), and those are never nested, so they can be
processed by a trivial SAX parser, for which all platforms surely have
libraries.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [Multimedia] CommonsMetadata API returning HTML?