[Foundation-l] [Commons-l] Wikidata

Erik Moeller erik at wikimedia.org
Wed Nov 24 19:25:14 UTC 2010


Hi all,

as you may know I've been involved in the structured data community
for a few years (through the original "Wikidata" proposal in 2004 as
well as architecting and developing OmegaWiki, together with the
OpenProgress team and others from 2005-2007). I've been following
Semantic MediaWiki, Freebase and other projects from the beginning.
You don't need to sell me on the value or importance of structured
data.

The problem space is very complex, especially when taking into account
that Wikimedia is a fully multilingual system. There are still low
hanging fruits, especially for a project like Wikimedia Commons, but I
agree w/ Michael that a more holistic approach to how to access and
manage data in WMF projects is much preferable to, for example,
throwing SMW into some wikis and not others, etc.

When I joined WMF, I couldn't justify arguing for higher priority on
data tech projects more so than, for example, the 2009-10 usability
initiative and continuing efforts in this area, especially given that
we still have only a tiny engineering staff. I don't believe that
structured data is going to be the principal driver of participation
-- that problem space is more about social and technical barriers to
entry, interaction with new users, mentoring, etc. And we're
continuing to fall behind the rest of the web in terms of usability.

That being said, it's clear that it's a key enabling technology
(including for _some_ usability improvements, although many of them
can be made without a full-fledged structured data support system). I
particularly think it has huge potential in bootstrapping small
languages by more closely interconnecting useful and translatable bits
of information (start a page about "Germany" in a new language and
immediately pull all relevant data, possibly including translations of
labels if available).

Danese and I have been working on a "Data Summit" this year to bring
together both the key players in the structured data field (DBPedia,
SMW, etc.), as well as some of the research and analytics community.
Unfortunately we've had to reschedule it, but it'll happen in Q1 2011.
We're not going to be able to dedicate lots of resources to
engineering in this area in the near future, but since there are
already so many disparate efforts that focus on making WP data usable,
we do hope that we can partner up with others to move things forward.

In a nutshell, I think we should aim to establish a “Wikidata Commons”
project at data.wikimedia.org which serves all Wikimedia projects with
structured data in a language-neutral fashion, analog to “Wikimedia
Commons” for multimedia files, and which becomes the central location
to curate, maintain and discuss such data. Wikidata Commons should
provide standard interfaces for querying, importing, and exporting
data. This project could be built incrementally (starting with clunky
but reasonably future-proof ways to manage and retrieve data).

The key challenges as I see them continue to be, as ever: 1)
maintaining predictable and reasonable system performance as the DB
scales, more and increasingly complex queries are performed, etc., 2)
consistently improving rather than degrading user experience, 3)
handling multilingual representations of all translatable content well
without giving undue prominence to any one language, 4) effectively
caching and purging data wherever it's used, 5)
versioning/transactioning relational data to be maximally useful and
conducive to collaboration.

Earlier this week, Danese and I met with Denny Vrandecic from SMW,
who's recently put together a prototype called "Shortipedia" that
allows language-independent (using multilingual labels) annotation of
concepts with SMW-style properties through a minimal form-based
interface, interfacing with whichever triple store is configured for
SMW. It's still very much a hack, and he's aiming to clean it up for
the summit. But it looks potentially very interesting, and like a
concept we could rally energy behind. The data from such a repository
could then be pulled into WP templates, accessed through "wizards"
that auto-generate template data for new articles, etc.

Anyone who wants to advance the thinking in this space should also
consider what can be done today with Wikimedia Commons and SMW. Since
Wikimedia Commons is an intrinsically multilingual database with focus
on annotating individual files, its operational requirements are
somewhat different from those of most other projects. It would be
useful to have an instance of SMW running using a copy of the
Wikimedia Commons database and possibly Semantic Forms to see what
such annotation could look like in practice. Anyone with time and
technical skills can put together prototypes like this that'll help us
move forward.

Again, I think the likely path forward here is for us to ally
effectively with the key players in the space, rather than doing all
the work ourselves.

-- 
Erik Möller
Deputy Director, Wikimedia Foundation

Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate



More information about the foundation-l mailing list