[Foundation-l] Wikix Image Detection in enwiki dumps

Jeffrey V. Merkey jmerkey at wolfmountaingroup.com
Thu Mar 1 09:56:47 UTC 2007


I have integrated the wikix program that extracts image files and 
analyzes image template usage in the enwiki dumps into the AI engine I 
use for machine translation projects.   I use wikix to sync up with the 
Wikipedia Image repository.  It has yielded some useful results as a 
side affect which may be useful for the Wikipedia community on the 
English Wikipedia.

During analysis of the last dumps posted as enwiki-20070206, the program 
identified all tag usages in templates for image tagging in use on the 
English
Wikipedia as well as all suspect image files which may be trojans, 
viruses, and other types of content which has been uploaded as images to 
the site.

The image files and data are grouped into the following output logs from 
the English Wikipedia.  Not all the files are trojans and some of them 
are probably ok , but a some may not be, particularly files named 
"spoof" and MS word files which can contain VB5 virus code if downloaded 
from Wikipedia.  At any rate, the list of files and the articles which 
they link to are provided and it may be useful for someone to review 
these files since they appear to be file types which can harbor viruses 
and trojans.  They are files I will not be hosting or pulling into 
Wikigadugi since they may contain malicious code.

images.log - all image files referenced in the last enwiki dumps
reject.log - all suspect files which may be viruses or trojans listed by 
article title which link to the image files
fragment.log - all templates and image tags used in templates which 
alias to the Image: directive as some point through the website logic 
and the first article title in which they appear. (this is interesting 
to see how many tags people create in templates to map to Image:)

These logs can be downloaded from:

ftp://www.wikigadugi.org/wiki/xml/wikix-logs.tar.gz

Jeff




More information about the foundation-l mailing list