Hello
A few notes on the BBC's use of DBpedia which Dan thought might be of interest to this list:
Not sure how familiar you are with bbc web stuff so a brief introduction
<skip-me> We have a large and somewhat sprawling website with 2 main sections: news article related stuff (including sports) and programme related stuff (tv and radio). In between these sections are various other domain specific bits (http://www.bbc.co.uk/music, http://www.bbc.co.uk/food, http://www.bbc.co.uk/nature etc)
In the main we have actual content / data for news articles and programmes. Most of the other bits of co.uk are really just different ways of cutting this content / new aggregations. Because we don't have data for these domains we borrow from elsewhere (mostly from the LOD cloud). So /music is based on a backbone of musicbrainz data, /nature is based on numerous data sources (open and not so open) all tied together with dbpedia identifiers...
In the main we don't really use dbpedia as a data source but rather as a source of identifiers to triangulate with other data sources
So for example, we have 2 tools for "tagging" programmes with dbpedia identifiers. Short clips are tagged with one tool using dbpedia information resource uris, full episodes are tagged with another tool using dbpedia non-information resource uris (< don't ask)
Taking /music as an example: because it's based on musicbrainz and because musicbrainz includes wikipedia uris for artists we can easily derive dbpedia uris (of whatever flavour) and query the programme systems for programmes tagged with that artist's dbpedia uri </skip-me>
=== some problems we've found when using dbpedia ===
1. it's not really intended for use for data extraction. The semantics of extraction depend on the infobox data and this isn't always applied correctly. So http://en.wikipedia.org/wiki/Fox_News_Channel and http://en.wikipedia.org/wiki/Fox_News_Channel_controversies share the same main infobox meaning dbpedia sees them both as tv channels
2. wikipedia tends to conflate many objects into a single item / page. Eg http://en.wikipedia.org/wiki/Penny_Lane has composer details, duration details and release information conflating composition with recording with release
3. the data extraction is a bit flakey in parts. Mainly because it's been done by a small team and it covers so many different domains.
4. wikipedia doesn't do redirects properly. So http://en.wikipedia.org/wiki/Spring_watch and http://en.wikipedia.org/wiki/Autumn_watch are based on the same data / return the same content and are flagged as a redirect internally but they don't actually 30x. This is confusing for editorial staff knowing which uri to "tag" with
5. wikipedia uris are derived from the article title. If the article title changes the uri changes. Dbpedia uris are derived from wikipedia uris so they also change when wikipedia uris / titles change. This has caused us no end of upsets. An example: bbc.co.uk/nature uses wiki|dbpedia uri slugs. So http://en.wikipedia.org/wiki/Stoat on wikipedia is http://www.bbc.co.uk/nature/life/Stoat on bbc.co.uk Apparently people in the UK call stoats stoats and people in the US call them ermine (or the other way round) which lead to an edit war on wikipedia which caused the dbpedia uri to flip repeatedly and our aggregations to break. We've had similar problems with music artists (can't quite remember the details but seem to remember some arguments about how the "and" should appear in Florence and the Machine http://en.wikipedia.org/wiki/Florence_and_the_Machine
6. Titles do change often enough to cause us problems. Particularly names for people Nic (cced) has done some work on dbpedia lite (http://dbpedialite.org/) which aims to provide stable identifiers for dbpedia concepts based on (I think) wikipedia table row identifiers (which wikimedia do claim are guaranteed)
7. wikipedia has a policy that aims toward one outbound link per infobox. So for a person or organisation page eg they tend to settle on that person / orgs's homepage and not their social media accounts or web presence(s) elsewhere. Which makes dbpedia less useful as an identifier triangulation point
=== end of problems (at least the one's I can remember) ===
So I think we'd be interested in wikidata for 2 (maybe 3) reasons: 1. as a source of data for domains where there's no established (open) authority (eg the equivalent of musicbrainz for films) 2. as a better, more stable source of identifiers to triangulate to other data sources ?3?. Possibly as a place to contribute of some of our data (eg we're donating our classical music data to musicbrainz; there may be data we have that would be useful to wikidata)
Have glanced quickly at the proposed wikidata uri scheme (http://meta.wikimedia.org/wiki/Wikidata/Notes/URI_scheme#Proposal_for_Wikid ata) and <snip> http://%7Bsite%7D.wikidata.org/item/%7BTitle%7D is a semi-persistent convenience URI for the item about the article Title on the selected site Semi-persistent refers to the fact that Wikipedia titles can change over time, although this happens rarely </snip> Not sure on the definition of infrequently but I know it's caused us problems.
Wondering if the id in http://wikidata.org/id/Q%7Bid%7D is the wikipedia row ID (as used by dbpedialite)? Also wondering why there's a different set of URIs for machine-readable access rather than just using content negotiation?
Cheers Michael
http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
On Tue, Jul 3, 2012 at 9:32 AM, Michael Smethurst michael.smethurst@bbc.co.uk wrote:
A few notes on the BBC's use of DBpedia which Dan thought might be of interest to this list:
It's great to see real world use cases to inform the development priorities of Wikidata.
Not sure how familiar you are with bbc web stuff so a brief introduction
=== some problems we've found when using dbpedia ===
I'm really looking forward to Wikidata, but it sounds like you might not be familiar with Freebase which already provides solutions to some of your problems today.
- it's not really intended for use for data extraction. The semantics of
extraction depend on the infobox data and this isn't always applied correctly. So http://en.wikipedia.org/wiki/Fox_News_Channel and http://en.wikipedia.org/wiki/Fox_News_Channel_controversies share the same main infobox meaning dbpedia sees them both as tv channels
This is partly (mostly) a social problem which Wikidata will need to solve at the community level, rather than through technical means.
- wikipedia tends to conflate many objects into a single item / page. Eg
http://en.wikipedia.org/wiki/Penny_Lane has composer details, duration details and release information conflating composition with recording with release
They do the opposite too, splitting long articles about topics (e.g. WW II) into arbitrary chunks.
For cases like Penny Lane, you may find that Freebase has teased apart the conflated material. The Freebase topic: http://www.freebase.com/view/en/penny_lane is about the song, since that's principally what the Wikipedia article is about and includes links to the various records, each with their own times, etc.
Additionally, in cases where a conflated topic was split, there's usually a split_to property that you can follow: http://www.freebase.com/inspect/wikipedia/en/Penny_Lane
- wikipedia uris are derived from the article title. If the article title
changes the uri changes. Dbpedia uris are derived from wikipedia uris so they also change when wikipedia uris / titles change. This has caused us no end of upsets. An example: bbc.co.uk/nature uses wiki|dbpedia uri slugs. So http://en.wikipedia.org/wiki/Stoat on wikipedia is http://www.bbc.co.uk/nature/life/Stoat on bbc.co.uk Apparently people in the UK call stoats stoats and people in the US call them ermine (or the other way round) which lead to an edit war on wikipedia which caused the dbpedia uri to flip repeatedly and our aggregations to break. We've had similar problems with music artists (can't quite remember the details but seem to remember some arguments about how the "and" should appear in Florence and the Machine http://en.wikipedia.org/wiki/Florence_and_the_Machine
- Titles do change often enough to cause us problems. Particularly names
for people Nic (cced) has done some work on dbpedia lite (http://dbpedialite.org/) which aims to provide stable identifiers for dbpedia concepts based on (I think) wikipedia table row identifiers (which wikimedia do claim are guaranteed)
Freebase includes both the Wikipedia textual keys (including redirects) and the numeric keys, but the primary link to Wikipedia is the numeric key because of just this problem. It's not completely foolproof because there can be some pretty complex machinations by Wikipedia editors repurposing articles or flipping them between disambiguation pages and regular article pages, but it handles most of the cases.
- wikipedia has a policy that aims toward one outbound link per infobox. So
for a person or organisation page eg they tend to settle on that person / orgs's homepage and not their social media accounts or web presence(s) elsewhere. Which makes dbpedia less useful as an identifier triangulation point
Freebase includes a variety of links to Twitter handles, MySpace pages, home pages, etc, in addition to the links which are imported from Wikipedia.
I'm not arguing that Freebase is or should be a replacement for Wikidata, since Wikidata will be solving some very useful problems for Freebase's Wikipedia imports, but if you've got these problems today and want a solution now, Freebase can probably help. And, of course, it's all open data.
Tom
On 3 July 2012 19:19, Tom Morris tfmorris@gmail.com wrote:
A few notes on the BBC's use of DBpedia which Dan thought might be of interest to this list:
It's great to see real world use cases to inform the development priorities of Wikidata.
Amen to that.
=== some problems we've found when using dbpedia ===
- it's not really intended for use for data extraction. The semantics of
extraction depend on the infobox data and this isn't always applied correctly. So http://en.wikipedia.org/wiki/Fox_News_Channel and http://en.wikipedia.org/wiki/Fox_News_Channel_controversies share the same main infobox meaning dbpedia sees them both as tv channels
This is partly (mostly) a social problem which Wikidata will need to solve at the community level, rather than through technical means.
Indeed; and if we can better explain these issues to the community we might be better successful in persuading the blockers that such matters are important.
Of course, there will always be some Luddites who see Wikipedia as a prose encyclopedia rather than the database of encyclopedic content which it really is ;-)
On 03/07/2012 19:19, "Tom Morris" tfmorris@gmail.com wrote:
On Tue, Jul 3, 2012 at 9:32 AM, Michael Smethurst michael.smethurst@bbc.co.uk wrote:
I'm really looking forward to Wikidata, but it sounds like you might not be familiar with Freebase which already provides solutions to some of your problems today.
Hi Tom
Short answer is we are familiar with Freebase and we have talked about using it but not done for a variety of reasons. Mainly because other data sets we use (like MusicBrainz) tend to link to Wikipedia and not Freebase (except through Wikipedia)
Should have probably split my list of problems into 2 parts: - using the data from dbpedia - using identifiers from dbpedia
Freebase would solve some of the data normalisation problems but as I said, mainly we use dbpedia as a source of identifiers and for identifier triangulation. We use a standard dump of Musicbrainz (rather than the MusicBrainz data in Freebase) so to triangulate to Freebase we'd need to go through Dbpedia and rely on their identifiers to be stable
Cheers michael
http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
Hello Michael,
thank you for your input, this is extremely valuable.
In general I expect that Wikidata will serve your needs better than an extraction from Wikipedia could. First, yes, we will have more stable identifiers. Second, it should be better at identifying items of interest. Some of the reasons why several meanings are conflated into one article or spread over several articles in Wikipedia is that it simply makes sense for a text encyclopedia. I don't see a reason for Wikidata doing the same.
I do not expect Wikidata to solve all problems. In some glorious future, Wikidata will have a community. This community will decide on criteria for inclusion, both with regards to the coverage of items and with regards to what they are saying about them. The community will decide on the kind of sources they accept. Etc.
(Actually, "decide" is too nice a word for the process I expect will unfold... )
We will keep the problems you mentioned in mind, and I fully think that we will improve on every single one of them.
2012/7/3 Michael Smethurst michael.smethurst@bbc.co.uk:
So I think we'd be interested in wikidata for 2 (maybe 3) reasons:
- as a source of data for domains where there's no established (open)
authority (eg the equivalent of musicbrainz for films) 2. as a better, more stable source of identifiers to triangulate to other data sources
Yes, I expect that both use cases will be covered by Wikidata.
?3?. Possibly as a place to contribute of some of our data (eg we're donating our classical music data to musicbrainz; there may be data we have that would be useful to wikidata)
It will be up to the community to accept data donations -- the development team does not speak for the community. Personally I would be thrilled to see such donations happen. See also:
http://meta.wikimedia.org/wiki/Wikidata/FAQ#I_have_a_lot_of_data_to_contribute._How_can_I_do_that.3F
Have glanced quickly at the proposed wikidata uri scheme (http://meta.wikimedia.org/wiki/Wikidata/Notes/URI_scheme#Proposal_for_Wikid ata) and
<snip> http://{site}.wikidata.org/item/{Title} is a semi-persistent convenience URI for the item about the article Title on the selected site Semi-persistent refers to the fact that Wikipedia titles can change over time, although this happens rarely </snip> Not sure on the definition of infrequently but I know it's caused us problems.
Fully agree. But they make for nice looking URIs. The canonical URI though is the ID-based one, and these are stable. The pretty ones are for convenience only. I will take a look at the note to see if this needs to be made more explicit.
Wondering if the id in http://wikidata.org/id/Q%7Bid%7D is the wikipedia row ID (as used by dbpedialite)? Also wondering why there's a different set of URIs for machine-readable access rather than just using content negotiation?
No it is not. There is no such thing as the "wikipedia row ID", what you mean is the "page ID on the English Wikipedia". As there are plenty of items that have articles only in Wikipedia other than English, a reliance on the English Page ID would be problematic. We introduce new IDs for Wikidata, but we will provide mappings to page IDs in the different Wikipedia language editions.
Thank you again for your input, and I hope the answers help.
Cheers, Denny
On 04/07/2012 10:48, "Denny Vrandečić" denny.vrandecic@wikimedia.de wrote:
Hello Michael,
thank you for your input, this is extremely valuable.
In general I expect that Wikidata will serve your needs better than an extraction from Wikipedia could. First, yes, we will have more stable identifiers. Second, it should be better at identifying items of interest. Some of the reasons why several meanings are conflated into one article or spread over several articles in Wikipedia is that it simply makes sense for a text encyclopedia. I don't see a reason for Wikidata doing the same.
I do not expect Wikidata to solve all problems. In some glorious future, Wikidata will have a community. This community will decide on criteria for inclusion, both with regards to the coverage of items and with regards to what they are saying about them. The community will decide on the kind of sources they accept. Etc.
(Actually, "decide" is too nice a word for the process I expect will unfold... )
We will keep the problems you mentioned in mind, and I fully think that we will improve on every single one of them.
Look forward to seeing it unfold :-)
2012/7/3 Michael Smethurst michael.smethurst@bbc.co.uk:
So I think we'd be interested in wikidata for 2 (maybe 3) reasons:
- as a source of data for domains where there's no established (open)
authority (eg the equivalent of musicbrainz for films) 2. as a better, more stable source of identifiers to triangulate to other data sources
Yes, I expect that both use cases will be covered by Wikidata.
?3?. Possibly as a place to contribute of some of our data (eg we're donating our classical music data to musicbrainz; there may be data we have that would be useful to wikidata)
It will be up to the community to accept data donations -- the development team does not speak for the community.
Yes, that goes for musicbrainz too. We can offer data but it's up to the community whether or not they accept it
Personally I would be thrilled to see such donations happen. See also:
Have glanced quickly at the proposed wikidata uri scheme (http://meta.wikimedia.org/wiki/Wikidata/Notes/URI_scheme#Proposal_for_Wikid ata) and
<snip> http://{site}.wikidata.org/item/{Title} is a semi-persistent convenience URI for the item about the article Title on the selected site Semi-persistent refers to the fact that Wikipedia titles can change over time, although this happens rarely </snip> Not sure on the definition of infrequently but I know it's caused us problems.
Fully agree. But they make for nice looking URIs.
Aesthetic concerns about uris tend to make me shiver :-)
The canonical URI though is the ID-based one, and these are stable. The pretty ones are for convenience only. I will take a look at the note to see if this needs to be made more explicit.
Think it is explicit. Just that there's so many flavours of URI knocking about it feels a bit confusing. The separation of the human readable and the machine readable feels like it's following the dbpedia design pattern and conflating the NIR > IR step with the content negotiation which feels (to me) like a mistake.
Have talked about this is the past on the LOD list so to save typing: http://lists.w3.org/Archives/Public/public-lod/2012Mar/0337.html
Not sure putting /data in a URI is ever a good idea. Shouldn't whether you want data or not be decided by your accept headers. Same for ?format=json etc.
For reference we use hash uris for things but only reference those in rdf and never link to them. One information resource uri gets exposed in links / the browser bar and does content negotiation for format (and eventually language) and the response comes with content location header of the IR URI dot the_format
Wondering if the id in http://wikidata.org/id/Q%7Bid%7D is the wikipedia row ID (as used by dbpedialite)? Also wondering why there's a different set of URIs for machine-readable access rather than just using content negotiation?
No it is not. There is no such thing as the "wikipedia row ID", what you mean is the "page ID on the English Wikipedia".
Ah, ok. Think someone once said that was the id of the underlying database row of the page record. Looking at dbpedialite it seems it does only support en.wikipedia
As there are plenty of items that have articles only in Wikipedia other than English, a reliance on the English Page ID would be problematic. We introduce new IDs for Wikidata, but we will provide mappings to page IDs in the different Wikipedia language editions.
Cool. Those mappings would be very useful for us. We're using Wikiminer ( https://secure.wikimedia.org/wikipedia/meta/wiki/WikiMiner) for entity extraction on archive media which also returns the page ID so some systems only know that ID. Be good to be able to query wikidata by it
Thank you again for your input, and I hope the answers help.
Yes, thanks michael
Cheers, Denny
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
On 4 July 2012 15:40, Michael Smethurst michael.smethurst@bbc.co.uk wrote: "Aesthetic concerns about uris tend to make me shiver :-)"
As a side note: I believe the point about semantic, human-intelligable Wikipedia-Style URIs and Page identifiers is that they allow DEBUGGING. Wikipedia uses non-opaque Identifiers for the same reason that programmers tend to write code with "Ticket.VoucherID" rather than "T12321.A237423".
In my observation, numeric-URI-based systems like Drupal tend to have minimal Links inside their content pages (i.e. beyond the menu system), mediawiki-based system tend to have hundreds of links inside their content. I believe this is so because links inside Drupal pages usually point to something like http://drupal.org/node/21947/ which makes it impossible for humans to easily check whether this is an intentional or erroneous link.
There are certainly areas where software can completely eliminate the need for humans to check links; in these cases opaque links are desirable. The question is:
Which is the case for Wikidata?
Gregor
On Thu, Jul 5, 2012 at 12:08 AM, Gregor Hagedorn g.m.hagedorn@gmail.comwrote:
In my observation, numeric-URI-based systems like Drupal tend to have minimal Links inside their content pages (i.e. beyond the menu system), mediawiki-based system tend to have hundreds of links inside their content. I believe this is so because links inside Drupal pages usually point to something like http://drupal.org/node/21947/ which makes it impossible for humans to easily check whether this is an intentional or erroneous link.
This is off-topic, but for Drupal this is a configuration issue. One of the early lessons in books and tutorial series is how to configure this, and many Drupal sites are configured to use human-readable paths. Drupal.org is not because it has millions of nodes which often change names.
You are correct that most Drupal sites have fewer internal links than wikis, but I think that holds for Drupal sites that are configured to use human-readable paths as well. The cause is more likely in a different interface issue.
I don't mean to spin this out into a tangent about Drupal, just wanted to point out that correlation doesn't imply causation in this case.
-Lin
I don't mean to spin this out into a tangent about Drupal.
Me neither, my discussion point here is: There are advantages for opaque (like http:something.org/node123456) and nonopaque (http:something.org/Bonn,_Northrhine-Westfalia,_Germany) URI/IRI identifiers.
In the light of the use-case of interlinking discussed here: which is right for Wikidata? Does Wikidata need both in parallel (I believe this is the current plan)?
Gregor
Yes, we are planning to do both in parallel, as this page explains:
https://meta.wikimedia.org/wiki/Wikidata/Notes/URI_scheme
Cheers, Denny
2012/7/5 Gregor Hagedorn g.m.hagedorn@gmail.com:
I don't mean to spin this out into a tangent about Drupal.
Me neither, my discussion point here is: There are advantages for opaque (like http:something.org/node123456) and nonopaque (http:something.org/Bonn,_Northrhine-Westfalia,_Germany) URI/IRI identifiers.
In the light of the use-case of interlinking discussed here: which is right for Wikidata? Does Wikidata need both in parallel (I believe this is the current plan)?
Gregor
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l