I don't know if there's anyone within Wikimedia doing any kind of
research like this, but it might be interesting if we had someone
attending this to see if there's anything that we could use to improve
our own image search techniques.
I recall we had a bot at one stage on Commons that would grab the
local captions of images used in various projects and then put those
captions back on the Commons image page. Seems a bit the same :)
cheers,
Brianna
user:pfctdayelise
See message 2 at http://linguistlist.org/issues/18/18-2794.html#2
Date: 21-Jan-2008 - 21-Jan-2008
Location: Funchal, Madeira, Portugal
Web Site: http://www.visapp.org/MMIU.htm
Scope
The number of digital images being generated, stored, managed and shared
through the internet is growing at a phenomenal rate. Press and photo
agencies receive and manage thousands or millions of images per day and
end-users (e.g. amateur reporters) can easily participate into the related
professional workflows. In an environment of approximately one billion
photos, searchable in online databases worldwide, finding the most
relevant or the most appealing image for a given task (e.g. to illustrate a
story) has become an extremely difficult process. In these huge
repositories, many images have additional information coming from different
sources.
Information related to the image capture such as date, location, camera
settings or name of photographer is often available from the digital camera
used to take the photograph. The owner can further add a relevant title,
filename or/and descriptive caption or any other textual reference. If the
image is uploaded to a shared photo collection, additional comments are
frequently added to the image by other users. On the other hand, images
used in documents, i.e. web pages, frequently have captions and surrounding
text. All this information can be considered image metadata and is of value
for organizing, sharing, and processing images.
However, it is not always evident how to exploit the information contained
in such metadata in an intelligent, generic or task-specific way. Linking
this information with the actual image content is still an open challenge.
The aim of this workshop is to offer a meeting opportunity for researchers,
content providers and related user-service providers to elaborate on the
needs and practices of digital image management, to share ideas that will
point to new directions on using metadata for image understanding and to
demonstrate related technology representative of the state of the art and
beyond.
Research paper topics that may be addressed include, but are not limited to:
- image metadata pattern discovery and mining
- interaction of image metadata and visual content
- image and video metadata enrichment
- automatic metadata creation
- hybrid collaborative and machine learning techniques for metadata
creation and/or fusion
- cross image-text categorization and retrieval
- image auto-captioning and annotation transfer
- learning user preferences, aesthetical and emotional measures from
opinion mining
- integration of camera settings with image categorization, retrieval or
enhancement
- application-specific issues of metadata mining:
integration of visual and geo-location information for improved virtual
tourism
stock-photo
web-based image retrieval
Important Dates
Full Paper Submission: October 15, 2007
Authors Notification: November 7, 2007
Final Paper Submission and Registration: November 19, 2007
--
They've just been waiting in a mountain for the right moment:
http://modernthings.org/
PanImages Image Search Tool Speaks Hundreds Of Languages
http://www.lockergnome.com/nexus/news/2007/09/12/panimages-image-search-too…
quote:
PanImages' powerful brains were created by scanning more than 350
machine-readable online dictionaries. Some of these were
"wiktionaries," online multilingual dictionaries written by
volunteers. The PanImages software scans these dictionaries and uses
an algorithm to check the accuracy of the results. It then assembles
the results in a matrix that allows translation in combinations that
may never have been attempted — for instance, from Gujarati to
Lithuanian.
The actual search engine is here:
http://www.panimages.org/index.jsp?displang=eng
And the research paper detailing the algorithm and method is here:
http://turing.cs.washington.edu/papers/EtzioniMTSummit07.pdf
An idea to use Wiktionary or interwiki links to improve image search
for Commons has long been kicked around. Maybe we could collaborate
with them to improve the Mayflower search engine for Commons? (Or else
ask them to index upload.wikimedia.org and pay attention to license
metadata :)) After all, we supplied them with all this useful data for
free.
cheers,
Brianna
--
They've just been waiting in a mountain for the right moment:
http://modernthings.org/
---------- Forwarded message ----------
From: Lars Aronsson <lars(a)aronsson.se>
Date: 06-Sep-2007 21:32
Subject: [Wikitech-l] Statistics on templates and references
To: wikitech-l(a)lists.wikimedia.org
A year ago, I wrote a little script for extracting template calls
from the XML database dump. The idea is that many templates are
infoboxes that provide structured information, such as the
population density of a country or bibliographic information in
book citations. The script is now updated to also extract ISBNs
and <ref> tags, as if these had been templates.
http://meta.wikimedia.org/wiki/User:LA2/Extraktor
I downloaded the reasonably small Wikipedia dumps for the
Scandinavian and Baltic languages and compiled some statistics,
such as the 50 most used templates, the 20 most cited ISBNs and
the 15 most common things to find inside <ref> tags.
http://meta.wikimedia.org/wiki/User:LA2/Extraktor_stats_200709
Of these languages, Swedish is the biggest (the uncompressed
database dump is 600 MB) followed by Finnish (481 MB) and
Norwegian (415 MB). But Finnish is far ahead in the use of
references and templates. One way to describe this degree of
structure is the size of my script's output compared to its input:
Language Dump size Extraktor output
----------------- --------- ----------------
lt = Lithuanian 152 MB 18.4 % or 28 MB
no = Norwegian 415 MB 16.9 %
nn = Nynorsk 85 MB 15.3 %
fi = Finnish 481 MB 14.1 %
is = Icelandic 66 MB 12.7 %
se = Sami 5.1 MB 10.8 %
da = Danish 209 MB 10.5 %
sv = Swedish 600 MB 10.2 %
fo = Faroese 7.8 MB 8.9 %
et = Estonian 116 MB 8.3 %
lv = Latvian 45 MB 8.2 %
fiu-vro = Võro 3.5 MB 6.4 %
I can't fully explain why the Lithuanian WP ranks so high.
Perhaps there is an opening <ref> that doesn't close, causing many
bytes to be included? If so, my script could help to find and
hunt down such errors. (I also tried the Yiddish Wikipedia and
got an even higher ranking, but I can't understand anything of
that language, so I'm totally clueless.)
And the ranking doesn't quite capture the fact that the Finnish
Wikipedia contains 59365 <ref> tags and 15108 ISBNs, while Swedish
has 28956 and 10742, respectively, and the Norwegian 19078 and
9060. The main difference seems to be the "good" examples above
12% and the laggards below 12%. Swedish and Danish should learn
from Norwegian and Finnish.
My conclusions are not final. The message is that the script
exists, and you are all free to help in digging out interesting
information.
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
--
They've just been waiting in a mountain for the right moment:
http://modernthings.org/
I'm writing a paper on cyberstalking and harassment, which I hope to
hand to the Foundation with a view to educating people about the
extent of the problem on Wikimedia.
I'd like to include some concrete examples of the cyberstalking or
offline stalking of users as a result of their participation in any of
the Wikimedia projects, and particularly where the target was picked
on because they were an administrator.
If you've been a target of this yourself, or if you know someone who
has, I'd appreciate hearing from you at slimvirgin at gmail dot com.
All replies will be received in strictest confidence. The target's
name and story will not be included in the final document without
consent, and all identifying details will be changed on request.
What I'm most interested in is how the cyberstalking or harassment
made you feel, and what happened when you tried to find support. I'd
like to hear whether it frightened you or made you anxious; whether it
affected your sleep, your appetite, or your health in any other way;
and whether you considered ending your association with the project
you were involved in, or did end it.
Many thanks,
Sarah
http://en.wikipedia.org/wiki/User:SlimVirgin
Hi all,
after quite some work into improving the DBpedia information
extraction framework, we have released a new version of the DBpedia
dataset today.
DBpedia is a community effort to extract structured information from
Wikipedia and to make this information available on the Web. DBpedia
allows you to ask sophisticated queries against Wikipedia and to link
other datasets on the Web to Wikipedia data.
The DBpedia dataset describes 1,950,000 "things", including at least
80,000 persons, 70,000 places, 35,000 music albums, 12,000 films. It
contains 657,000 links to images, 1,600,000 links to relevant external
web pages and 440,000 external links into other RDF datasets.
Altogether, the DBpedia dataset consists of around 103 million RDF
triples.
The Dataset has been extracted from the July 2007 Wikipedia dumps of
English, German, French, Spanish, Italian, Portuguese, Polish,
Swedish, Dutch, Japanese, Chinese, Russian, Finnish and Norwegian
versions of Wikipedia. It contains descriptions in all these
languages.
Compared to the last version, we did the following:
1. Improved the Data Quality
We increased the quality of the data, be improving the DBpedia
information extraction algorithms. So if you have decided that the old
version of the dataset was too dirty for your application, please look
again, you will be surprised :-)
2. Third Classification Schema Added
We have added a third classification schema to the dataset. Beside of
the Wikipedia categorization and the YAGO classification, concepts are
now also classified by associating them to WordNet synsets.
3. Geo-Coordinates
The dataset contains geo-coordinates for for geographic locations.
Geo-coordinates are expressed using the W3C Basic Geo Vocabulary. This
enables location-based SPARQL queries.
4. RDF Links to other Open Datasets
We interlinked DBpedia with further open datasets and ontologies. The
dataset now contains 440 000 external RDF links into the Geonames,
Musicbrainz, WordNet, World Factbook, EuroStat, Book Mashup, DBLP
Bibliography and Project Gutenberg datasets. Altogether, the network
of interlinked datasources around DBpedia currently amounts to around
2 billion RDF triples which are accessible as Linked Data on the Web.
The DBpedia dataset is licensed under the terms GNU Free Documentation
License. The dataset can be accessed online via a SPARQL endpoint and
as Linked Data. It can also be downloaded in the form of RDF dumps.
Please refer to the DBpedia webpage for more information about the
dataset and its use cases:
http://dbpedia.org/
Many thanks for their excellent work to:
1. Georgi Kobilarov (Freie Universität Berlin) who redesigned and
improved the extraction framework and implemented many of the
interlinking algorithms.
2. Piet Hensel (Freie Universität Berlin) who improved the infobox
extraction code, wrote the unit test suite.
3. Richard Cyganiak (Freie Universität Berlin) for his advice on
redesigning the architecture of the extraction framework and for
helping to solve many annoying Unicode and URI problems.
4. Zdravko Tashev (OpenLink Software) for his patience to try several
times to import buggy versions of the dataset into Virtuoso.
5. OpenLink Software altogether for providing the server that hosts
the DBpedia SPARQL endpoint.
6. Sören Auer, Jens Lehmann and Jörg Schüppel (Universität Leipzig)
for the original version of the infobox extraction code.
7. Tom Heath and Peter Coetzee (Open University) for the RDFS version
of the YAGO class hirarchy.
8. Fabian M. Suchanek, Gjergji Kasneci (Max-Plank-Institut
Saarbrücken) for allowing us to integrate the YAGO classification.
9. Christian Becker (Freie Universität Berlin) for writing the
geo-coordinates and the homepage extractor.
10. Ivan Herman, Tim Berners-Lee, Rich Knopman and many others for
their bug reports.
Have fun exploring the new dataset :-)
Cheers
Chris
--
Chris Bizer
Freie Universität Berlin
Phone: +49 30 838 54057
Mail: chris(a)bizer.de
Web: www.bizer.de