I notice lines in the dbpedia dumps that look like
http://dbpedia.org/resource/Boston%2C_MA http://dbpedia.org/property/redirect http://dbpedia.org/resource/Boston .
Note the URL encoded %2C=",".
Anyhow, if I go to
http://dbpedia.org/page/Boston%2C_MA
I see two redirects [one of which unescapes the comma] and ultimately end up at
http://dbpedia.org/page/Boston
If I go to Wikipedia
http://wikipedia.org/page/Boston%2C_MA
I get redirected to
http://wikipedia.org/page/Boston,_MA
which, oddly, displays the same content as "Boston" [rather than 301 redirecting...]
When I do
curl -H "Accept: application/rdf+xml" http://dbpedia.org/data/Boston.xml
I see stuff like
<rdf:Description rdf:about="http://dbpedia.org/resource/Harvey_Mason%2C_Jr.%22%3E<dbpedia-owl:birthPlace xmlns:dbpedia-owl="http://dbpedia.org/ontology/" rdf:resource="http://dbpedia.org/resource/Boston%22/%3E</rdf:Description>
Now If I run the SPARQL query
select ?Predicate where {http://dbpedia.org/resource/Harvey_Mason,_Jr. ?Predicate http://dbpedia.org/resource/Boston }
I get nothing, but if I run
select ?Predicate where {http://dbpedia.org/resource/Harvey_Mason%2C_Jr. ?Predicate http://dbpedia.org/resource/Boston }
I get
http://dbpedia.org/ontology/birthPlace
So it looks like the %-encoded URI is the "real URI" in dbpedia. Obviously I ought to keep it around in case I want to run a SPARQL query now and then. Also, dbpedia encodes wikipedia this way as well,
http://en.wikipedia.org/wiki/Harvey_Mason%2C_Jr. http://xmlns.com/foaf/0.1/primaryTopic http://dbpedia.org/resource/Harvey_Mason%2C_Jr. .
------
I took a look at some standards docs and found:
http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#dfn-URI-reference
I see that we encode UTF-8 text as octets, and if the octets aren't US-ASCII characters, I wed %-encode them. However, the spec also says that
*"Note:* Because of the risk of confusion between RDF URI references that would be equivalent if derefenced, the use of %-escaped characters in RDF URI references is strongly discouraged. "
------
Now the problem I've got with the Ookaboo API is that I know people are going to punch in
http://wikipedia.org/page/Boston,_MA
and I need to turn this into the right dbpedia URL. My plan for dealing with this is to
(i) store the exact URI I get out of dbpedia, (ii) always give people the exact URI out of dbpedia (if I publish RDFa or JSON data), (iii) give the same URI for wikipedia that dbpedia gives (in HTML, RDFa, etc.) (iv) if I get a query, apply the same canonicalization rules that dbpedia uses...
Which begs the question of what exactly those rules are. What are they?
On Wed, Oct 13, 2010 at 2:29 PM, Paul Houle wrote: [snip]
Now the problem I've got with the Ookaboo API is that I know people are going to punch in
http://wikipedia.org/page/Boston,_MA
and I need to turn this into the right dbpedia URL. My plan for dealing with this is to
(i) store the exact URI I get out of dbpedia, (ii) always give people the exact URI out of dbpedia (if I publish RDFa or JSON data), (iii) give the same URI for wikipedia that dbpedia gives (in HTML, RDFa, etc.) (iv) if I get a query, apply the same canonicalization rules that dbpedia uses...
Which begs the question of what exactly those rules are. What are they?
It sounds like you need to map from "URL which contains an English Wikipedia article title" to "URI identifier for the DBPedia node describing the concept which that article is about".
The good news is that you can probably get away without caring too much about the actual encoding in the source Wikipedia URL you're looking at. :)
DBPedia's documentation says that their resource URIs are "of the form http://dbpedia.org/resource/Name, where Name is taken from the URL of the source Wikipedia article, which has the form http://en.wikipedia.org/wiki/Name. Thus, each resource is tied directly to an English-language Wikipedia article." -- http://wiki.dbpedia.org/Datasets#h18-4
They may, or may not, actually mean that as regards to normalization of %-encoding... let's assume that they do indeed copy it exactly.
My off-the-cuff recommendation might be something like this:
1) Resolve the redirect: a) fetch the URL, following any HTTP redirects -- this will let you avoid worrying about domain aliases you don't recognize, etc b) grab the <link rel="canonical"> url if any -- this will resolve any in-wiki redirects for you
2) Get the article name! a) confirm the URL is in the format you expect: http://en.wikipedia.org/wiki/(.*) b) divide the title from the rest of the URL
In a sane world, you'd unescape the %-encoding here, replace underscores to spaces, then take the UTF-8 name and run it through what DBPedia specifies as their encoding style. But if they're just copying the URL fragments from Wikipedia direct, you can just take the string now. :)
3) Prepend http://dbpedia.org/resource/ to the URL fragment.
Alternately, it wouldn't surprise me if DBPedia contained metadata or a search helper to look up by Wikipedia article name, but I can't get any of the SPARQL examples I've found on the web to work on their online lookup just now so I'm too lazy to go looking further. ;)
-- brion
wikitech-l@lists.wikimedia.org