On Tue, Jul 3, 2012 at 9:32 AM, Michael Smethurst michael.smethurst@bbc.co.uk wrote:
A few notes on the BBC's use of DBpedia which Dan thought might be of interest to this list:
It's great to see real world use cases to inform the development priorities of Wikidata.
Not sure how familiar you are with bbc web stuff so a brief introduction
=== some problems we've found when using dbpedia ===
I'm really looking forward to Wikidata, but it sounds like you might not be familiar with Freebase which already provides solutions to some of your problems today.
- it's not really intended for use for data extraction. The semantics of
extraction depend on the infobox data and this isn't always applied correctly. So http://en.wikipedia.org/wiki/Fox_News_Channel and http://en.wikipedia.org/wiki/Fox_News_Channel_controversies share the same main infobox meaning dbpedia sees them both as tv channels
This is partly (mostly) a social problem which Wikidata will need to solve at the community level, rather than through technical means.
- wikipedia tends to conflate many objects into a single item / page. Eg
http://en.wikipedia.org/wiki/Penny_Lane has composer details, duration details and release information conflating composition with recording with release
They do the opposite too, splitting long articles about topics (e.g. WW II) into arbitrary chunks.
For cases like Penny Lane, you may find that Freebase has teased apart the conflated material. The Freebase topic: http://www.freebase.com/view/en/penny_lane is about the song, since that's principally what the Wikipedia article is about and includes links to the various records, each with their own times, etc.
Additionally, in cases where a conflated topic was split, there's usually a split_to property that you can follow: http://www.freebase.com/inspect/wikipedia/en/Penny_Lane
- wikipedia uris are derived from the article title. If the article title
changes the uri changes. Dbpedia uris are derived from wikipedia uris so they also change when wikipedia uris / titles change. This has caused us no end of upsets. An example: bbc.co.uk/nature uses wiki|dbpedia uri slugs. So http://en.wikipedia.org/wiki/Stoat on wikipedia is http://www.bbc.co.uk/nature/life/Stoat on bbc.co.uk Apparently people in the UK call stoats stoats and people in the US call them ermine (or the other way round) which lead to an edit war on wikipedia which caused the dbpedia uri to flip repeatedly and our aggregations to break. We've had similar problems with music artists (can't quite remember the details but seem to remember some arguments about how the "and" should appear in Florence and the Machine http://en.wikipedia.org/wiki/Florence_and_the_Machine
- Titles do change often enough to cause us problems. Particularly names
for people Nic (cced) has done some work on dbpedia lite (http://dbpedialite.org/) which aims to provide stable identifiers for dbpedia concepts based on (I think) wikipedia table row identifiers (which wikimedia do claim are guaranteed)
Freebase includes both the Wikipedia textual keys (including redirects) and the numeric keys, but the primary link to Wikipedia is the numeric key because of just this problem. It's not completely foolproof because there can be some pretty complex machinations by Wikipedia editors repurposing articles or flipping them between disambiguation pages and regular article pages, but it handles most of the cases.
- wikipedia has a policy that aims toward one outbound link per infobox. So
for a person or organisation page eg they tend to settle on that person / orgs's homepage and not their social media accounts or web presence(s) elsewhere. Which makes dbpedia less useful as an identifier triangulation point
Freebase includes a variety of links to Twitter handles, MySpace pages, home pages, etc, in addition to the links which are imported from Wikipedia.
I'm not arguing that Freebase is or should be a replacement for Wikidata, since Wikidata will be solving some very useful problems for Freebase's Wikipedia imports, but if you've got these problems today and want a solution now, Freebase can probably help. And, of course, it's all open data.
Tom