[Wikidata-l] DBpedia usage in the bbc - Wikidata

3 Jul 2012

Hello

A few notes on the BBC's use of DBpedia which Dan thought might be of
interest to this list:

Not sure how familiar you are with bbc web stuff so a brief introduction

<skip-me>
We have a large and somewhat sprawling website with 2 main sections: news
article related stuff (including sports) and programme related stuff (tv and
radio). In between these sections are various other domain specific bits
(http://www.bbc.co.uk/music, http://www.bbc.co.uk/food,
http://www.bbc.co.uk/nature etc)

In the main we have actual content / data for news articles and programmes.
Most of the other bits of co.uk are really just different ways of cutting
this content / new aggregations. Because we don't have data for these
domains we borrow from elsewhere (mostly from the LOD cloud). So /music is
based on a backbone of musicbrainz data, /nature is based on numerous data
sources (open and not so open) all tied together with dbpedia identifiers...

In the main we don't really use dbpedia as a data source but rather as a
source of identifiers to triangulate with other data sources

So for example, we have 2 tools for "tagging" programmes with dbpedia
identifiers. Short clips are tagged with one tool using dbpedia information
resource uris, full episodes are tagged with another tool using dbpedia
non-information resource uris (< don't ask)

Taking /music as an example: because it's based on musicbrainz and because
musicbrainz includes wikipedia uris for artists we can easily derive dbpedia
uris (of whatever flavour) and query the programme systems for programmes
tagged with that artist's dbpedia uri
</skip-me>

=== some problems we've found when using dbpedia ===

1. it's not really intended for use for data extraction. The semantics of
extraction depend on the infobox data and this isn't always applied
correctly. So http://en.wikipedia.org/wiki/Fox_News_Channel and
http://en.wikipedia.org/wiki/Fox_News_Channel_controversies share the same
main infobox meaning dbpedia sees them both as tv channels

2. wikipedia tends to conflate many objects into a single item / page. Eg
http://en.wikipedia.org/wiki/Penny_Lane has composer details, duration
details and release information conflating composition with recording with
release

3. the data extraction is a bit flakey in parts. Mainly because it's been
done by a small team and it covers so many different domains.

4. wikipedia doesn't do redirects properly. So
http://en.wikipedia.org/wiki/Spring_watch and
http://en.wikipedia.org/wiki/Autumn_watch are based on the same data /
return the same content and are flagged as a redirect internally but they
don't actually 30x. This is confusing for editorial staff knowing which uri
to "tag" with

5. wikipedia uris are derived from the article title. If the article title
changes the uri changes. Dbpedia uris are derived from wikipedia uris so
they also change when wikipedia uris / titles change. This has caused us no
end of upsets. An example: bbc.co.uk/nature uses wiki|dbpedia uri slugs. So
http://en.wikipedia.org/wiki/Stoat on wikipedia is
http://www.bbc.co.uk/nature/life/Stoat on bbc.co.uk
Apparently people in the UK call stoats stoats and people in the US call
them ermine (or the other way round) which lead to an edit war on wikipedia
which caused the dbpedia uri to flip repeatedly and our aggregations to
break. We've had similar problems with music artists (can't quite remember
the details but seem to remember some arguments about how the "and" should
appear in Florence and the Machine
http://en.wikipedia.org/wiki/Florence_and_the_Machine

6. Titles do change often enough to cause us problems. Particularly names
for people
Nic (cced) has done some work on dbpedia lite (http://dbpedialite.org/)
which aims to provide stable identifiers for dbpedia concepts based on (I
think) wikipedia table row identifiers (which wikimedia do claim are
guaranteed)

7. wikipedia has a policy that aims toward one outbound link per infobox. So
for a person or organisation page eg they tend to settle on that person /
orgs's homepage and not their social media accounts or web presence(s)
elsewhere. Which makes dbpedia less useful as an identifier triangulation
point

=== end of problems (at least the one's I can remember) ===

So I think we'd be interested in wikidata for 2 (maybe 3) reasons:
1. as a source of data for domains where there's no established (open)
authority (eg the equivalent of musicbrainz for films)
2. as a better, more stable source of identifiers to triangulate to other
data sources
?3?. Possibly as a place to contribute of some of our data (eg we're
donating our classical music data to musicbrainz; there may be data we have
that would be useful to wikidata)

Have glanced quickly at the proposed wikidata uri scheme
(http://meta.wikimedia.org/wiki/Wikidata/Notes/URI_scheme#Proposal_for_Wikid
ata) and 
<snip>
http://{site}.wikidata.org/item/{Title} is a semi-persistent convenience URI
for the item about the article Title on the selected site
Semi-persistent refers to the fact that Wikipedia titles can change over
time, although this happens rarely
</snip>
Not sure on the definition of infrequently but I know it's caused us
problems. 

Wondering if the id in http://wikidata.org/id/Q{id} is the wikipedia row ID
(as used by dbpedialite)? Also wondering why there's a different set of URIs
for machine-readable access rather than just using content negotiation?

Cheers
Michael

http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are
not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and
notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.