Re: [Commons-l] GSoC 2016 | Porting catimages to pywikibot-core

18 May 2016

(Just replying on Commons-l with a non-tech observation. If more tech
stuff arises I'll add it to Phabricator instead)

This looks like a useful contained project, though a lot to be done in
12 weeks. :-)

I was not familiar with catimages.py. It would be great if using the
module for the preparation or housekeeping of large batch uploads were
easy and not time consuming to try. As Commons grows we are seeing
more donations over 10,000 images and have had a few with over 1m.
Uploads of this size make manual categorization a huge hurdle, so
automatic 'tagging' of image characteristics would be a useful way of
breaking down such a large batch to highlight the more interesting
outliers or mistakes, which can then be prioritized on a backlog for
human review.

For example, in my upload projects I have problems detecting:
* incomplete uploads resulting from server failures. Checksum
comparisons would mean re-downloading files, which would be
unnecessarily bandwidth expensive, but local image analysis would
highlight these.
* uploads that are mostly blank pages in old scanned books. I have a
simple detection process, but it would be neat to have a more common
standard way of doing this.
* distinguishing between scans with diagrams and line
drawings/cartoons, printed old photographs, newsprint and text pages.

It would be great if the testing routines you use during the project
could tackle any of these and be written up as practical case studies.

As well as the Phabricator write-up/tracking of the project, it would
be useful to have an on-wiki Commons or Mediawiki user guide. Perhaps
this can be sketched out as you go along during the project, giving an
insight into what other users or amateur Python programmers might do
to customize or make better use of the module? Having an more easy to
find manual, might avoid others going off on their own tangents using
various off the shelf image modules, when they could just plug in
catimages with a smallish amount of configuration.

P.S. If you would like to test the tool on some large collections with
predictable formats, try looking through <
https://commons.wikimedia.org/wiki/User:Fae/Project list >. The 1/2
million images in the book plates project would be an interesting
sample set.

Thanks,
Fae

On 18 May 2016 at 02:53, Abdeali Kothari &lt;abdealikothari(a)gmail.com&gt; wrote:
...
  Hi,

 I'm a student from Chennai, India and my project is going to be related to
 performing image processing on the images on commons.wikimedia to automate
 categorization. DrTrigon had made the script catimages.py a few years ago
 which was made in the old pywikipedia-bot framework. I'll be working towards
 updating the script to the pywikibot-core framework, updating it's
 dependencies, and using newer techniques when possible.

 catimages.py is a script that analyzes an image using various computer
 vision algorithms and allots categories to the image on commons. For
 example, consider algorithms that detect faces, barcodes, etc. The script
 uses these to categorize images to Category:Unidentified People,
 Category:Barcode, and so on.

 If you have any suggestions and categorizations you think might be useful to
 you, drop in at #gsoc-catimages on freenode or my talk page[0]. You can find
 out more about me on User:AbdealiJK[1] and about the project at T129611[2].

 Regards

 [0] - https://commons.wikimedia.org/wiki/User_talk:AbdealiJK
 [1] - https://meta.wikimedia.org/wiki/User:AbdealiJK
 [2] - https://phabricator.wikimedia.org/T129611

 _______________________________________________
 Commons-l mailing list
 Commons-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/commons-l

-- 
faewik(a)gmail.com https://commons.wikimedia.org/wiki/User:Fae
Personal and confidential, please do not circulate or re-quote.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Commons-l] GSoC 2016 | Porting catimages to pywikibot-core