[Foundation-l] Wikix Image Detection in enwiki dumps
Jeffrey V. Merkey
jmerkey at wolfmountaingroup.com
Thu Mar 1 09:56:47 UTC 2007
I have integrated the wikix program that extracts image files and
analyzes image template usage in the enwiki dumps into the AI engine I
use for machine translation projects. I use wikix to sync up with the
Wikipedia Image repository. It has yielded some useful results as a
side affect which may be useful for the Wikipedia community on the
English Wikipedia.
During analysis of the last dumps posted as enwiki-20070206, the program
identified all tag usages in templates for image tagging in use on the
English
Wikipedia as well as all suspect image files which may be trojans,
viruses, and other types of content which has been uploaded as images to
the site.
The image files and data are grouped into the following output logs from
the English Wikipedia. Not all the files are trojans and some of them
are probably ok , but a some may not be, particularly files named
"spoof" and MS word files which can contain VB5 virus code if downloaded
from Wikipedia. At any rate, the list of files and the articles which
they link to are provided and it may be useful for someone to review
these files since they appear to be file types which can harbor viruses
and trojans. They are files I will not be hosting or pulling into
Wikigadugi since they may contain malicious code.
images.log - all image files referenced in the last enwiki dumps
reject.log - all suspect files which may be viruses or trojans listed by
article title which link to the image files
fragment.log - all templates and image tags used in templates which
alias to the Image: directive as some point through the website logic
and the first article title in which they appear. (this is interesting
to see how many tags people create in templates to map to Image:)
These logs can be downloaded from:
ftp://www.wikigadugi.org/wiki/xml/wikix-logs.tar.gz
Jeff
More information about the foundation-l
mailing list