SUMMARY: This week I experienced an issue when uploading several
hundred very high resolution maps as part the NYPL maps project.[1]
Discussion has been going on in several places and this thread is an
attempt to share a discussion in one place so all users can benefit.
[Gilles, Could you join this low volume open email list to keep track
of GWT issues and be a voice for WMF Operations to help us reach a
recommendation for end user best practices?]
HISTORY
For our GLAM projects my upload was unusually stressful for the WMF
servers. Individual map scans are up to 300 MB images, and resolutions
can exceed 80 megapixels (80 million pixels). There are 20,000 tiff
images to be uploaded, I have completed around 12%. I used the
GLAMtoolset at full capacity (20 threads) though I had broken the xml
file up, so runs were a few hundred images at a time. My intention was
to ramp this up to a couple of thousand per upload "tranche".
I was contacted on Tuesday by operations asking for me to suspend the
upload as the demand for attempted thumbnail rendering of the tiff
images was too high a load on WMF servers.[2] Over 500 of the tiff
images were greater than 50 megapixels and as a consequence Commons
fails to render any thumbnails (they are created for jpegs greater
than this limit, this is a tiff specific constraint).[3]
CURRENT STATE
With no obvious immediate fix/work-around on the table from WMF ops, I
have proposed to re-start my uploads for this project with an
effective throttle by using 2 threads (this is a setting on the first
screen of the GWToolset. In practice, having tried a run of a couple
of hundred, this means that the tool is uploading 100MB sized images
at a rate of 2 every 5 minutes. This seems to not be causing any
issues.
WAY FORWARD
In the longer term the WMF is looking at alternatives for rendering
tiff thumbnails which will enable 50MP+ images to be handled; this may
or may not help solve the problem seen this week.[4]
I recommend that the GWToolset on-wiki guides include a recommendation
about how to choose the number of processing threads based on the
types of images to be uploaded. To date, no other project has seen
these problems, probably because the image resolutions fall well under
the 50MP threshold. The maximum allowed number of threads is 20, with
a default being 10. For the time being I suggest that we agree a best
practice that for upload projects with tiffs over 50MP, that no more
than 2 threads are used; these problems do not appear to exist for
projects uploading smaller resolution files.
I propose that WMF Operations consider finding ways of testing the
peak loads possible from the GWT and decide if this can be fixed by
future operational improvements, whether the tool might benefit from
some simple "load management" changes, or if establishing a best
practice for our (relatively) small number of GWT users would be a
sufficient community based control.
Links
1. https://commons.wikimedia.org/wiki/Commons:Batch_uploading/NYPL_Maps
2. https://commons.wikimedia.org/wiki/Commons_talk:Batch_uploading/NYPL_Maps
3. https://commons.wikimedia.org/wiki/Category:NYPL_maps_%28over_50_megapixels…
4. https://bugzilla.wikimedia.org/show_bug.cgi?id=52045
Fae
--
faewik(a)gmail.com https://commons.wikimedia.org/wiki/User:Fae
Dear Erik,
(Also copying in the Cultural Partners and GLAMwiki Toolset mailing lists
as Erik's email below is directly is related to them).
Thank you for this email with the explicit invitation for groups in the
Wikimedia movement to directly take responsibility for supporting the
technology needs of GLAM partnerships. Different groups in the movement
have different capacities and different areas of priority - and that is how
it should be :-) We each need to try and 'bite off what we can chew' in a
way that is coordinated, mutually beneficial, and not a duplication of each
others' efforts.
To that end...
Over the last couple of years *Europeana*[1] has been increasingly involved
in supporting tech development for mediawiki that is specifically targeted
at addressing the needs of the GLAMwiki community. I note that the report
you linked to on the stats that GLAMs want[1] and also the GLAMwiki Toolset
for mass multimedia upload which you also mentioned[2] are both
*Europeana* projects
- in collaboration with several European Wikimedia Chapters.
On behalf of *Europeana *I would like to confirm that we wish to become
even more involved in this area and has the full intention of supporting
further development in partnership with interested Chapters when possible.
In the fullness of time, we intend to apply for a WMF grant in order to
enable precisely that.
On the mediawiki.org discussion page for the 2014/15 Engineering goals
there has been a fair bit of discussion about GLAM-related projects that
are not in the WMF's own plans[4]. Fabrice, as "process owner" for the
Multimedia section of those goals, has proposed on that talkpage a couple
of meetings of interested parties to discuss how we can all work together
effectively on this, notably in person at Wikimania, an offer which we
definitely accept :-) I also agree with Illario's point that formalising WMF
support for externally-developed software is an important criteria in any
grant decisions and for organisational reputation. Fortunately Fabrice has
specifically addressed this issue relating specifically to the GLAMwiki
Toolset which is very helpful.[5]
Sincerely,
Liam / Wittylama
GLAMWIKI coordinator, Europeana.
[1] http://pro.europeana.eu/
[2] https://upload.wikimedia.org/wikipedia
/commons/a/a2/Report_on_requirements_for_usage_and_reuse_statistics_for_GLAM_content.
pdf
[3] https://commons.wikimedia.org/wiki/Commons:GLAMwiki_Toolset_Project
[4] https://www.mediawiki.org/wiki/Talk:Wikimedia
_Engineering/2014-15_Goals#Image_view_analytics
[5] https://www.mediawiki.org/wiki/Talk:Wikimedia_Engineering/2014-15_Goals#
GLAMwiki_Toolset
wittylama.com
Peace, love & metadata
On 26 June 2014 05:54, Erik Moeller <erik(a)wikimedia.org> wrote:
> Hi folks,
>
> At the Zurich Hackathon, I met with a couple of folks from WM-CH who
> were interested in talking about ways that chapters can get involved
> in engineering/product development, similar to WM-DE's work on
> Wikidata.
>
> My recommendation to them was to consider working on GLAM-related
> tooling. This includes helping improve some of the reporting tools
> currently running in Labs (primarily developed by the illustrious and
> wonderful Magnus Manske in his spare time), but also meeting other
> requirements identified by the GLAM community [1] and potentially
> helping with the development of more complex MediaWiki-integrated
> tools like the GLAMWiki-Toolset.
>
> There's work that only WMF is well positioned to do (like feeding all
> media view data into Hadoop and providing generalized reports and
> APIs), but a lot of work in the aforementioned categories could be
> done by any chapter and could easily be scaled up from 1 to 2 to 3
> FTEs and beyond as warranted. That's because a lot of the tools are
> separate from MediaWiki, so code review and integration requirements
> are lower, and it's easier for technically proficient folks to help.
>
> In short, I think this could provide a nice on-ramp for a chapter or
> chapters to support the work of volunteers in the cultural sector with
> appropriate technology. This availability of appropriate technology is
> clearly increasingly a distinguishing factor for Wikimedia relative to
> more commercial offerings in its appeal to the cultural sector.
>
> At the same time, WMF itself doesn't currently prioritize work with
> the cultural sector very highly, which I think is appropriate given
> all the other problems we have to solve. So if this kind of work has
> to compete for attention with much more basic improvements to say the
> uploading pipeline or the editing tools, it's going to lose. Therefore
> I think having a "cultural tooling" team or teams in the larger
> movement would be appropriate.
>
> I've not heard back from WM-CH yet on this, but I also don't think
> it's an exclusive suggestion, so wanted to put the idea in people's
> heads in case other organizations in the movement want to help with
> it. I do want WMF to solve the larger infrastructure problems, but the
> more specialized tooling is likely _not_ going to be high on our
> agenda anytime soon.
>
> Thanks,
> Erik
>
> [1]
> https://upload.wikimedia.org/wikipedia/commons/a/a2/Report_on_requirements_…
>
> --
> Erik Möller
> VP of Engineering and Product Development, Wikimedia Foundation
>
> _______________________________________________
> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> Wikimedia-l(a)lists.wikimedia.org
> <https://meta.wikimedia.org/wiki/Mailing_lists/GuidelinesWikimedia-l@lists.w…>
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
This is a simple sounding question, but I have two uploads going on in
parallel right now, one using 8 processing threads and the other using
16, so a total of 24. None of these files is huge, they seem to be
under 15mb, with an occasional outlier around 45mb (though quite a few
drawing scans break the TIFF max size barrier of 50MP even though
these are only a miniscule ~2.5mb in filesize).
GWT was designed for a maximum of 20 threads, and I don't know whether
to feel guilty at running 24 threads this way, even though these
uploads are unlikely to break anything.
Any thoughts? If what I'm doing is somehow self-regulating, I would be
tempted to add another job and bump the "volume" to 40 or more
threads, as this particular upload has over 100,000 images
(potentially 200,000) and I'd rather it didn't take over a month to
complete (which is what it is looking like right now at a rate of
2,800 images per day).
Fae
--
faewik(a)gmail.com https://commons.wikimedia.org/wiki/User:Fae
I have an unusual scenario where the GLAM might make images available
either under a one-shot deal where the source website might vanish
after upload (possible a transient FTP share) or where I am allowed to
systematically use high resolution links that are not normally
directly declared but only passed on after CAPTCHA checks that the
requester is human. In part, the CAPTCHA is in place to avoid having
bots bring down their servers with a flood of requests; there seems to
be little in the way of automated throttling to handle these events.
Can we have the option of not saving the link-to-media-file in the
metadata on the image page? I could remove it with a post-upload bot,
however it would remain in the history and therefore be potentially
data-mine-able.
The link would effectively be replaced by a link to the catalogue
page, where a user can navigate to the same high-resolution file after
passing the CAPTCHA.
Fae
--
faewik(a)gmail.com https://commons.wikimedia.org/wiki/User:Fae
This email (below) just got sen out by Erik Moeller. Do you think it is
appropriate for us to contact him (and/or Fabrice - who is responsible for
the 'multimedia' section) to ask where the GWT might sit within the WMF
Tech plans?
The document is specifically the WMF's own plans, not about grants, but
there are dependencies and aspects of what we want to do with GWT (on the
assumption that we successfully apply for a grant to fund round 2 of
development) that interrelate to the WMFs core priorities.
-Liam
---------- Forwarded message ----------
From: *Erik Moeller* <erik(a)wikimedia.org>
Date: Tuesday, 10 June 2014
Subject: [Wikimedia-l] First _draft_ goals for WMF engineering/product
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>, Wikimedia
Mailing List <wikimedia-l(a)lists.wikimedia.org>
Hi all,
We've got the first DRAFT (sorry for shouting, but can't hurt to
emphasize :)) of the annual goals for the engineering/product
department up on mediawiki.org. We're now mid-point in the process,
and will finalize through June.
https://www.mediawiki.org/wiki/Wikimedia_Engineering/2014-15_Goals
Note that at this point in the process, teams have flagged
inter-dependencies, but they've not necessarily been taken into
account across the board, i.e. team A may say "We depend on X from
team B" and team B may not have sufficiently accounted for X in its
goals. :P Identifying common themes, shared dependencies, and
counteracting silo tendencies is the main focus of the coming weeks.
We may also add whole new sections for cross-functional efforts not
currently reflected (e.g. UX standardization). Site performance will
likely get its own section as well.
My own focus will be on fleshing out the overall narrative, aligning
around organization-wide objectives, and helping to manage scope.
As far as quantitative targets are concerned, we will aim to set them
where we have solid baselines and some prior experience to work with
(a good example is Wikipedia Zero, where we now have lots of data to
build targets from). Otherwise, though, our goal should be to _obtain_
metrics that we want to track and build targets from. This, in itself,
is a goal that needs to be reflected, including expectations e.g. from
Analytics.
Like last year, these goals won't be set in stone. At least on a
quarterly basis, we'll update them to reflect what we're learning.
Some areas (e.g. scary new features like Flow) are more likely to be
significantly revised than others.
With this in mind: Please leave any comments/questions on the talk
page (not here). Collectively we're smarter than on our own, so we do
appreciate honest feedback:
- What are our blind spots? Obvious, really high priority things we're
not paying sufficient attention to?
- Where are we taking on too much? Which projects/goals make no sense
to you and require a stronger rationale, if they're to be undertaken at all?
- Which projects are a Big Deal from a community perspective, or from
an architecture perspective, and need to be carefully coordinated?
These are all conversations we'll have in coming weeks, but public
feedback is very helpful and may trigger conversations that otherwise
wouldn't happen.
Please also help to carry this conversation into the wikis in coming
weeks. Again, this won't be the only opportunity to influence, and
I'll be thinking more about how the quarterly review process can also
account for community feedback.
Warmly,
Erik
--
Erik Möller
VP of Engineering and Product Development, Wikimedia Foundation
_______________________________________________
Wikimedia-l mailing list, guidelines at:
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l(a)lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
<mailto:wikimedia-l-request@lists.wikimedia.org <javascript:;>
?subject=unsubscribe>
--
wittylama.com
Peace, love & metadata
Thank a lot Dan,
I will make sure that they have all this information today, we will also fix with them the date, it would be awesome that you can join on Friday 13!
Cheers
Charles
___________________________________________________________
Charles ANDRES, Chief Science Officer
"Wikimedia CH" – Association for the advancement of free knowledge –
www.wikimedia.ch
Office +41 (0)21 340 66 21
Mobile +41 (0)78 910 00 97
Skype: charles.andres.wmch
IRC://irc.freenode.net/wikimedia-chhttp://prezi.com/user/Andrescharles/
Le 5 juin 2014 à 08:11, dan entous <d_entous(a)yahoo.com> a écrit :
> hi charles,
>
> a couple of logistic items:
>
> 1. are the google developers already familiar with the mediawiki framework?
> 2. do they already have developer environments set-up?
> 3. this page may be helpful http://www.mediawiki.org/wiki/MediaWiki-Vagrant
>
>
> with kind regards,
> dan
>
> sent from my mind to yours
> --------------------------
> (function signature() {
> return [
> 'dan entous',
> 't. +31 (20) 684.5005',
> 'm. +31 (64) 024.6187'
> ].join('\n');
> }());
>
> On Jun 3, 2014, at 15:46 , David Haskiya <david.haskiya(a)europeana.eu> wrote:
>
>> Hi Charles,
>> Sounds like an opportunity!
>>
>> From the Europeana side we would certainly make volunteer to have Dan available remotely. And he'd be willing to fly down to Zurich as well if you think it valuable.
>>
>> Cheers,
>> David
>> From: glamtools-bounces(a)lists.wikimedia.org [glamtools-bounces(a)lists.wikimedia.org] on behalf of charles andrès [charles.andres(a)wikimedia.ch]
>> Sent: 03 June 2014 14:03
>> To: Conversations revolving around the development of GLAM Digital Tools; europeana-steering-group(a)lists.wmnederland.nl
>> Cc: Muriel Staub
>> Subject: [Glamtools] Google Serve in Zurich, 9-14 June 2014
>>
>> We have been contacted by the Google office in Zurich, they would be interested in doing a one day of hacking for the wikimedia movement.
>>
>> Google is hosting each year a "Google Serve" Day, where Googlers volunteer for one day for a project of their choice. David Furrer who organizes "Google Serve" for the Google office Zurich had the idea of having several Googlers volunteering for Wikimedia for this one day. We are oping for 10 google developers.
>>
>> At Wikimedia CH we would like to focus on open bug of the GWToolset and Kiwix, but we would need the help of developers to drive the google developers.
>>
>> Because they are google, they are used to interact remotely, so the presence of wikimedians in Zurich Office is not mandatory, but it would be a best.
>>
>> the bad news is that the window for this event is really shot, it’s next week!
>>
>> That’s why I create a doodle to see who can be available to help us during this day:http://doodle.com/2db4m8hqc59hyip9
>>
>> I you want to come in Zurich for real, WMCH can support this cost if reasonable.
>>
>> If you are not able to join in live or remotely, it would be awesome if you could help by reviewing the open bugs and help us to identify the most important ones, and open some other if you think it’s accurate.
>>
>> Thanks for your help
>>
>>
>> Charles
>>
>>
>> ___________________________________________________________
>>
>> Charles ANDRES, Chief Science Officer
>> "Wikimedia CH" – Association for the advancement of free knowledge –
>> www.wikimedia.ch
>> Office +41 (0)21 340 66 21
>> Mobile +41 (0)78 910 00 97
>> Skype: charles.andres.wmch
>> IRC://irc.freenode.net/wikimedia-ch
>> http://prezi.com/user/Andrescharles/
>>
>>
>> David Haskiya
>> Product Development Manager
>>
>> T: +31 (0)70 314 0696
>> M: +31 (0)64 217 2542
>> E: david.haskiya(a)europeana.eu
>> Skype: davidhaskiya
>>
>> If you’re interested in Europe’s cultural heritage, sign up for our newsletter at http://eepurl.com/SAaC5 and start receiving our monthly eNews, in English or French!
>>
>> Europeana makes Europe’s culture available for all, across borders and generations and for creative re-use – follow how at#AllezCulture
>>
>> Disclaimer: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. If you are not the named addressee you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if you have received this email by mistake and delete this email from your system.
>>
>> _______________________________________________
>> Glamtools mailing list
>> Glamtools(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/glamtools
>
> _______________________________________________
> Europeana-steering-group mailing list
> Europeana-steering-group(a)lists.wmnederland.nl
> https://lists.wmnederland.nl/cgi-bin/mailman/listinfo/europeana-steering-gr…
I managed to figure out what the problem was with the filename, because the
source xml had a <xml:lang="nl"> element in it that contained quotation
marks it wouldn't except that line as a title, since most punctuation marks
are not allowed in Commons filenames. As was suggested earlier by Fae, it
would be nice if this flaw would result in an error message rather than an
obscure filename. Maybe something that can be fixed at a later time.
Best,
Jesse
2014-06-02 15:34 GMT+02:00 Jesse de Vos <jdvos(a)beeldengeluid.nl>:
> Hi everyone,
>
> If I run an upload in both Beta and production the title always changes to
> 'Array-array.ogv'. In the metadata mapping I create a title by combining
> the fields: 'oi:title' and 'oi:identifier'. 'Array-array' is nowhere to be
> found in my xml.
>
> Can anyone tell how I can make sure the title comes out correctly?
>
> Cheers,
> Jesse
>
> --
>
> Met vriendelijke groet,
>
> *Jesse de Vos*
> GLAM-wiki coördinator
>
> *T* 035 - 677 39 37
> *Aanwezig:* ma, di, do
>
> <http://www.beeldengeluid.nl/>
>
> *Nederlands Instituut voor Beeld en Geluid*
>
> *Media Parkboulevard 1, 1217 WE Hilversum | Postbus 1060, 1200 BB Hilversum | *
> *beeldengeluid.nl* <http://www.beeldengeluid.nl/>
>
--
Met vriendelijke groet,
*Jesse de Vos*
GLAM-wiki coördinator
*T* 035 - 677 39 37
*Aanwezig:* ma, di, do
<http://www.beeldengeluid.nl/>
*Nederlands Instituut voor Beeld en Geluid*
*Media Parkboulevard 1, 1217 WE Hilversum | Postbus 1060, 1200 BB
Hilversum | *
*beeldengeluid.nl* <http://www.beeldengeluid.nl/>