With the Structured Data for Commons project about to move into high
gear, it seems to me that there's something the Wikidata community needs
to have a serious discussion about, before APIs start getting designed
and set in stone.
Specifically: when should an object have an item with its own Q-number
created for it on Wikidata? What are the limits? (Are there any limits?)
The position so far seems to be essentially that a Wikidata item has
only been created when an object either already has a fully-fledged
Wikipedia article written for it, or reasonably could have.
So objects that aren't particularly notable typically have not had
Wikidata items made for them.
Indeed, practically the first message Lydia sent to me when I started
trying to work on Commons and Wikidata was to underline to me that
Wikidata objects should generally not be created for individual Commons
But, if I'm reading the initial plans and API thoughts of the Multimedia
team correctly, eg
there seems to be the key assumption that, for any image that contains
information relating to something beyond the immediate photograph or
scan, there will be some kind of 'original work' item on main Wikidata
that the file page will be able to reference, such that the 'original
work' Wikidata item will be able to act as a place to locate any
information specifically relating to the original work.
Now in many ways this is a very clean division to be able to make. It
removes any question of having to judge "notability"; and it removes any
ambiguity or diversity of where information might be located -- if the
information relates to the original work, then it will be stored on
But it would appear to imply a potentially *huge* increase in the
inclusion criteria for Wikidata, and the number of Wikidata items
So it seems appropriate that the Wikidata community should discuss and
sign off just what should and should not be considered appropriate,
before things get much further.
For example, a year ago the British Library released 1 million
illustrations from out-of-copyright books, which increasingly have been
uploaded to Commons. Recently the Internet Archive has announced plans
to release a further 12 million, with more images either already
uploading or to follow from other major repositories including eg the
NYPL, the Smithsonian, the Wellcome Foundation, etc, etc.
How many of these images, all scanned from old originals, are going to
need new Q-numbers for those originals? Is this okay? Or are some of
them too much?
For example, for maps, cf this data schema
, each map sheet will have a separate Northernmost, Southernmost,
Easternmost, Westernmost bounding co-ordinates. Does that mean each map
sheet should have its own Wikidata item?
For book illustrations, perhaps it is would be enough just to reference
the edition of the book. But if individual illustrations have their own
artist and engraver details, does that mean the illustration needs to
have its own Wikidata item? Similarly, if the same engraving has
appeared in many books, is that also a sign that it should have its own
What about old photographs, or old postcards, similarly. When should
these have their own Wikidata item? If they have their own known
creator, and creation date, then is it most simple just to give them a
Wikidata item, so that such information about an original underlying
work is always looked for on Wikidata? What if multiple copies of the
same postcard or photograph are known, published or re-published at
different times? But the potential number of old postcards and
photographs, like the potential number of old engravings, is *huge*.
What if an engraving was re-issued in different "states" (eg a
re-issued engraving of a place might have been modified if a tower had
been built). When should these get different items?
where I raised some of these issues a couple of weeks ago, there has
even been the suggestion that particular individual impressions of an
engraving might deserve their own separate items; or even everything
with a separate accession number, so if a museum had three copies of an
engraving, we would make three separate items, each carrying their own
accession number, identifying the accession number that belonged to a
(See also other sections at
further relevant discussions on how to represent often quite complicated
relations with Wikidata properties).
With enough items, we could re-create and represent essentially the
entire FRBR tree.
We could do this. We may even need to do this, if MM team's outline for
Commons is to be implemented in its apparent current form.
But it seems to me that we shouldn't just sleepwalk into it.
It does seem to me that this does represent (at least potentially) a
*very* large expansion in the number of items, and widening of the
inclusion criteria, for what Wikidata is going to encompass.
I'm not saying it isn't the right thing to do, but given the potential
scale of the implications, I do think it is something we do need to have
properly worked through as a community, and confirmed that it is indeed
what we *want* to do.
(Note that this is a slightly different discussion, though related, to
the one I raised a few weeks ago as to whether Commons categories -- eg
for particular sets of scans -- should necessarily have their own
Q-number on Wikidata. Or whether some -- eg some intersection
categories -- should just have an item on Commons data. But it's
clearly related: is the simplest thing just to put items for everything
on Wikidata? Or does one try to keep Wikidata lean, and no larger than
it absolutely needs to be; albeit then having to cope with the
complexity that some categories would have a Q-number, and some would not.)
a little more detail from the funnel analysis of UploadWizard (if you
haven't been following the other funnel thread,
<https://www.mediawiki.org/wiki/UploadWizard/Funnel_analysis> has a quick
*Users repeat the upload process many times*
The main thing I am trying to understand at this point is why people use
the "upload another file" button so much. UploadWizard allows uploading up
to 50 files at the same time, which should be more then enough for the
average user, but our click-tracking data shows that most people click
through the tutorial-file-deed-details-thanks screens, then click on the
upload more button (which effectively resets the process and starts again
from the file screen), then click through the screens again, then click on
the upload more button again, then do the same again, and again, and again.
(Doing this fifty times in a row is not uncommon.) This suggests some
fundamental failing in UW - Sage suggested it is the instability of
uploading more than a few files at the same time. I wonder if others have
*Errors do not seem to be the main problem*
I have tried to identify the reason for failed UploadWizard sessions (a
series of UploadWizard events logged on the same page which are not
terminated by reaching the thanks page) by checking what the last event
was, and assuming that for failed sessions caused by errors, that error
would be the last event. Assuming this is sound, errors do not seem to be
the main problem - they only appear at the end of ~25% of the failed
sessions (which is ~8% of the total sessions).
That said, here is a list of error codes (these are mostly API error codes,
but a few are internal to UploadWizard) sorted by frequency, collected over
| filename | 20 |
| badtoken | 19 |
| missingresult | 14 |
| title | 13 |
| publishfailed | 11 |
| stasherror | 7 |
| server-error | 3 |
| fileexists-forbidden | 2 |
| filetype-banned-type | 1 |
| unknown | 1 |
| verification-error | 1 |
| unknownerror | 1 |
A little explanation about the more frequent ones:
- filename: these seem to be user errors - most often invalid filetype
(doc, bmp etc), sometimes no extension at all or trying to add the same
- badtoken: some sort of CSRF token expiration; bug 69691
- missingresult: returned by the upload API in the details step when the
uploaded file has gone missing; bug 43967
- title: an error about duplicate files (i.e. the same file already
exists on Commons) that somehow happens in the details step instead of the
- publishfailed: this seems to be some sort of race condition: first api
call to publish a file from stash puts it into the job queue and sets it
status to pending, second call will throw this error.
- stasherror: could be lots of things. bug 56302
<https://bugzilla.wikimedia.org/show_bug.cgi?id=56302>, bug 54028
<https://bugzilla.wikimedia.org/show_bug.cgi?id=54028> and more.
*Some suggestions based on the findings so far*
- review UX for "fatal user errors" (i.e. when UploadWizard says "you
can't upload this file type") - is the error message helpful?
- review and improve api error messages (api-error-*), possibly override
them with UW-specific ones. Do they identify next steps? Do they even
exist?(e.g. api-error-publishfailed does not.)
- renew token on badtoken error (bug 69691
- make sure that the specific error message thrown by
ApiUpload::dieUsage gets logged somewhere. Currently we only log a generic
message derived from the API error code, so e.g. all the dozen different
UploadStashException subclasses are reported with the same message.
- poll for success on publishfailed error (unlike its name suggest, it
seems to be actually meaning something like "publish in progress")
- understand better why people repeat the upload process so often. This
might reveal serious UX deficiencies or functional errors (e.g. in an older
thread about funnel analysis, Sage claims uploading more than three files
at the same time is too unreliable for him).
- Investigate if there is a low-effort way to recover entered details
when the upload process has to be restarted. (There are drop-in solutions
like garlic.js <http://garlicjs.org/> or sisyphus.js
<https://github.com/simsalabim/sisyphus> but the very dynamic nature of
UW forms might be a problem.)
- figure out why are some title errors only reported in the details step
- log information
uploaded files to better identify size- or filetype-specific issues
Bigger / longer-term effort:
- figure out a way to retry when the user already entered all the
details but publishing the file failed. (This points towards the
- make stashed / async uploads rely on the database instead of the
session (bug 43967 <https://bugzilla.wikimedia.org/show_bug.cgi?id=43967>
we have recently added some funnel  logging to UploadWizard. A nice
dashboard is in the works, but here are some preliminary results, showing
the number of virtual pageviews for each step of UploadWizard.
mysql:email@example.com [log]> select event_step,
count(*), count(*)/3623 as survival_rate from UploadWizardStep_8612364
group by event_step order by survival_rate desc;
| event_step | count(*) | survival_rate |
| tutorial | 3623 | 1.0000 |
| file | 3496 | 0.9649 |
| deeds | 2433 | 0.6715 |
| details | 2373 | 0.6550 |
| thanks | 2109 | 0.5821 |
This is based on about a day's worth of logs (25.5 hours) - the logging
code was deployed to Commons yesterday.
The big drop is apparently in the file upload step (almost 30% - well over
1000 uploads a day). Some of that might be intentional (upload caught by
badtitle filter etc), but even so the drop is huge. Given that that step is
rather simple from a UX point of view, it seems that upload bugs are a
bigger problem right now than design issues.
(The license selection - deeds -> details - on the other hand is
unexpectedly unproblematic; I would have expected it to be the main source
of confusion, but actually adding description etc. seems worse.)
The next step would be to log JS/upload errors, I suppose.
Also, it would be nice to know which dropoffs are final and which are
reloads/restarts. The Navigation Timing API can tell apart reloads and
normal navigation, alternatively we could maybe group by IP + useragent +
time bucket to find retries.
After working on an article on English Wikipedia, I came to realise that it
might be useful if we had a slideshow feature for media for use in articles.
I was informed that Hebrew Wikipedia has a fantastic slideshow template
that can be used in articles. The slideshow is created with this
The design is very sleek and it would no doubt be a fantastic addition to
I've left a message for the person responsible for this template on he.wp
asking if they can help create it for English Wikipedia, but I have been
informed that they are basically semi-retired/on extended wikibreak.
Would anyone out there like to take this on board and get it created for
English Wikipedia at the earliest convenience. It can be tested live on the
article I am working on at the moment if need be.
CC'ing Multimedia team
Maryana, this could be something interesting for the Mobile Web team
to look at to optimize image delivery.
Have you guys done any perf work around images?
On Thu, Jun 5, 2014 at 4:10 PM, Yuri Astrakhan <yastrakhan(a)wikimedia.org> wrote:
> The reduced quality images is now live in production. To see it for
> yourself, compare original with low quality images (253KB => 99.9KB, 60%
> The quality reduction is triggered by adding "qlow-" in front of the file
> name's pixel size.
> Continuing our previous discussion, now we need to figure out how to best
> use this feature. As covered before, there are two main approaches:
> network/device/user preference conditions. Issues may include multiple
> downloads of the same image (if the browser starts the download before JS
> runs), parser cache fragmentation.
> * Varnish-based rewrite - varnish decides which image to server under the
> same URL. This approach requires Varnish to know everything needed to make a
> Zero plans to go the first route, but if we make it mobile, or ever site
> wide, all the better.
> Mobile-l mailing list
As most of you know, we document WMF engineering activities on
mediawiki.org using "activity pages", which is just a fancy word for
pages that have an infobox. We can then list the activities in many
places, like the Wikimedia Engineering portal (
https://www.mediawiki.org/wiki/Wikimedia_Engineering ) and the status
dashboard ( https://www.mediawiki.org/wiki/Wikimedia_Engineering/Dashboard
Most of the activities are about a particular project, like
"Phabricator migration" or "Flow". Multimedia is a bit awkward because
it's about a team rather than the projects you guys work on.
It might have made sense previously (for example if the team was
touching a lot of different pieces of Multimedia) but my understanding
from the Wikimania workshops is that the Multimedia team plans to
mostly focus on two main projects this fiscal year: UploadWizard and
Therefore, I'd like to recommend that we make those two projects
actual "activities", with a dedicated infobox and status updates.
Other, smaller multimedia-related bits like MediaViewer could still be
in the catch-all "Multimedia" activity.
This wouldn't change anything for most of you; the only visible
difference would be that you would report on UploadWizard and
Structured Data on a different page. It would be more consistent with
the rest of WMF engineering, and it would be easier for the rest of
the community to follow your work on each project.
Unless there are strong objections to this proposal, I'm happy to add
the infoboxes myself, but I wanted to ask here first :) Let me know if
you have any questions.
Technical Communications Manager — Wikimedia Foundation