On Tue, Jul 3, 2012 at 9:32 AM, Michael Smethurst
<michael.smethurst(a)bbc.co.uk> wrote:
A few notes on the BBC's use of DBpedia which Dan
thought might be of
interest to this list:
It's great to see real world use cases to inform the development
priorities of Wikidata.
Not sure how familiar you are with bbc web stuff so a brief introduction
=== some problems we've found when using dbpedia
===
I'm really looking forward to Wikidata, but it sounds like you might
not be familiar with Freebase which already provides solutions to some
of your problems today.
1. it's not really intended for use for data
extraction. The semantics of
extraction depend on the infobox data and this isn't always applied
correctly. So
http://en.wikipedia.org/wiki/Fox_News_Channel and
http://en.wikipedia.org/wiki/Fox_News_Channel_controversies share the same
main infobox meaning dbpedia sees them both as tv channels
This is partly (mostly) a social problem which Wikidata will need to
solve at the community level, rather than through technical means.
2. wikipedia tends to conflate many objects into a
single item / page. Eg
http://en.wikipedia.org/wiki/Penny_Lane has composer details, duration
details and release information conflating composition with recording with
release
They do the opposite too, splitting long articles about topics (e.g.
WW II) into arbitrary chunks.
For cases like Penny Lane, you may find that Freebase has teased apart
the conflated material. The Freebase topic:
http://www.freebase.com/view/en/penny_lane
is about the song, since that's principally what the Wikipedia article
is about and includes links to the various records, each with their
own times, etc.
Additionally, in cases where a conflated topic was split, there's
usually a split_to property that you can follow:
http://www.freebase.com/inspect/wikipedia/en/Penny_Lane
5. wikipedia uris are derived from the article title.
If the article title
changes the uri changes. Dbpedia uris are derived from wikipedia uris so
they also change when wikipedia uris / titles change. This has caused us no
end of upsets. An example: bbc.co.uk/nature uses wiki|dbpedia uri slugs. So
http://en.wikipedia.org/wiki/Stoat on wikipedia is
http://www.bbc.co.uk/nature/life/Stoat on bbc.co.uk
Apparently people in the UK call stoats stoats and people in the US call
them ermine (or the other way round) which lead to an edit war on wikipedia
which caused the dbpedia uri to flip repeatedly and our aggregations to
break. We've had similar problems with music artists (can't quite remember
the details but seem to remember some arguments about how the "and" should
appear in Florence and the Machine
http://en.wikipedia.org/wiki/Florence_and_the_Machine
6. Titles do change often enough to cause us problems. Particularly names
for people
Nic (cced) has done some work on dbpedia lite (
http://dbpedialite.org/)
which aims to provide stable identifiers for dbpedia concepts based on (I
think) wikipedia table row identifiers (which wikimedia do claim are
guaranteed)
Freebase includes both the Wikipedia textual keys (including
redirects) and the numeric keys, but the primary link to Wikipedia is
the numeric key because of just this problem. It's not completely
foolproof because there can be some pretty complex machinations by
Wikipedia editors repurposing articles or flipping them between
disambiguation pages and regular article pages, but it handles most of
the cases.
7. wikipedia has a policy that aims toward one
outbound link per infobox. So
for a person or organisation page eg they tend to settle on that person /
orgs's homepage and not their social media accounts or web presence(s)
elsewhere. Which makes dbpedia less useful as an identifier triangulation
point
Freebase includes a variety of links to Twitter handles, MySpace
pages, home pages, etc, in addition to the links which are imported
from Wikipedia.
I'm not arguing that Freebase is or should be a replacement for
Wikidata, since Wikidata will be solving some very useful problems for
Freebase's Wikipedia imports, but if you've got these problems today
and want a solution now, Freebase can probably help. And, of course,
it's all open data.
Tom