Re: [Commons-l] GSoC 2016 | Porting catimages to pywikibot-core - Commons-l

24 May 2016

Replies in-line.

On 24 May 2016 at 06:57, Dr. Trigon &lt;dr.trigon(a)surfeu.ch&gt; wrote:
...
   * incomplete
uploads resulting from
 server failures. Checksum
 comparisons would mean re-
 downloading files, which would be
 unnecessarily bandwidth expensive, but
 local image analysis would
 highlight these. 
 What about local checksum comparison? 
Yes, we have SHA1 values for the Commons hosted images, however a
local checksum is not normally available from the source (e.g. NYPL)
which means re-downloading the original to do the comparison. As some
of my uploads are over 100mb for one page, it's an expensive solution.

...
   * uploads that
are mostly blank pages
 in old scanned books. I have a
 simple detection process, but it would
 be neat to have a more common
 standard way of doing this. 
 Depends on the format. For PDF you can try to use Poppler/poppler-utils or
 MuPDF. For images it will be bit more involved ... but intressting. 
Formats are normally jpeg or TIFF. My blank detection uses analysis of
pixel colour deviations over parts of the image to deduce if it looks
blank. This uses the basic Python Image Library rather than any
sophisticated math. This can happen pre-upload by testing a
client-side image. See
<https://commons.wikimedia.org/wiki/User:Fae/Project_list/Internet_Archive#Blank_pages>

...
...
   Hi Fæ,

 Thanks a lot for the ideas !
 The ideas you mentioned are awesome, and something I'll definitely look
 into !

 The second and third ideas mentioned are, I believe, do-able within the
 scope of my GSoC. For the first idea to be implemented, as you mentioned
 local image analysis would be needed, which we've not planned (But i'll add
 it to the "to plan" list :) ). Currently we're planning on downloading the
 image and performing the analysis on ToolsLab or a personal computer.

 Thank you for the project list ! I was looking for a good dataset to test
 things out on and this will be immensely helpful.

 Regards
 Abdeali JK

 On Wed, May 18, 2016 at 5:25 PM, Fæ &lt;faewik(a)gmail.com&gt; wrote:

 (Just replying on Commons-l with a non-tech observation. If more tech
 stuff arises I'll add it to Phabricator instead)

 This looks like a useful contained project, though a lot to be done in
 12 weeks. :-)

 I was not familiar with catimages.py. It would be great if using the
 module for the preparation or housekeeping of large batch uploads were
 easy and not time consuming to try. As Commons grows we are seeing
 more donations over 10,000 images and have had a few with over 1m.
 Uploads of this size make manual categorization a huge hurdle, so
 automatic 'tagging' of image characteristics would be a useful way of
 breaking down such a large batch to highlight the more interesting
 outliers or mistakes, which can then be prioritized on a backlog for
 human review.

 For example, in my upload projects I have problems detecting:
 * incomplete uploads resulting from server failures. Checksum
 comparisons would mean re-downloading files, which would be
 unnecessarily bandwidth expensive, but local image analysis would
 highlight these.
 * uploads that are mostly blank pages in old scanned books. I have a
 simple detection process, but it would be neat to have a more common
 standard way of doing this.
 * distinguishing between scans with diagrams and line
 drawings/cartoons, printed old photographs, newsprint and text pages.

 It would be great if the testing routines you use during the project
 could tackle any of these and be written up as practical case studies.

 As well as the Phabricator write-up/tracking of the project, it would
 be useful to have an on-wiki Commons or Mediawiki user guide. Perhaps
 this can be sketched out as you go along during the project, giving an
 insight into what other users or amateur Python programmers might do
 to customize or make better use of the module? Having an more easy to
 find manual, might avoid others going off on their own tangents using
 various off the shelf image modules, when they could just plug in
 catimages with a smallish amount of configuration.

 P.S. If you would like to test the tool on some large collections with
 predictable formats, try looking through <
 https://commons.wikimedia.org/wiki/User:Fae/Project list >. The 1/2
 million images in the book plates project would be an interesting
 sample set.

 Thanks,
 Fae

 On 18 May 2016 at 02:53, Abdeali Kothari &lt;abdealikothari(a)gmail.com&gt;
 wrote:
  Hi,

 I'm a student from Chennai, India and my project is going to be related
 to
 performing image processing on the images on commons.wikimedia to
 automate
 categorization. DrTrigon had made the script catimages.py a few years
 ago
 which was made in the old pywikipedia-bot framework. I'll be working
 towards
 updating the script to the pywikibot-core framework, updating it's
 dependencies, and using newer techniques when possible.

 catimages.py is a script that analyzes an image using various computer
 vision algorithms and allots categories to the image on commons. For
 example, consider algorithms that detect faces, barcodes, etc. The
 script
 uses these to categorize images to Category:Unidentified People,
 Category:Barcode, and so on.

 If you have any suggestions and categorizations you think might be
 useful to
 you, drop in at #gsoc-catimages on freenode or my talk page[0]. You can
 find
 out more about me on User:AbdealiJK[1] and about the project at
 T129611[2].

 Regards

 [0] - https://commons.wikimedia.org/wiki/User_talk:AbdealiJK
 [1] - https://meta.wikimedia.org/wiki/User:AbdealiJK
 [2] - https://phabricator.wikimedia.org/T129611

 _______________________________________________
 Commons-l mailing list
 Commons-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/commons-l

 --
 faewik(a)gmail.com https://commons.wikimedia.org/wiki/User:Fae
 Personal and confidential, please do not circulate or re-quote. 

 Dr. Trigon 

-- 
faewik(a)gmail.com https://commons.wikimedia.org/wiki/User:Fae
Personal and confidential, please do not circulate or re-quote.