On Wed, Oct 13, 2010 at 2:29 PM, Paul Houle wrote: [snip]
Now the problem I've got with the Ookaboo API is that I know people are going to punch in
http://wikipedia.org/page/Boston,_MA
and I need to turn this into the right dbpedia URL. My plan for dealing with this is to
(i) store the exact URI I get out of dbpedia, (ii) always give people the exact URI out of dbpedia (if I publish RDFa or JSON data), (iii) give the same URI for wikipedia that dbpedia gives (in HTML, RDFa, etc.) (iv) if I get a query, apply the same canonicalization rules that dbpedia uses...
Which begs the question of what exactly those rules are. What are they?
It sounds like you need to map from "URL which contains an English Wikipedia article title" to "URI identifier for the DBPedia node describing the concept which that article is about".
The good news is that you can probably get away without caring too much about the actual encoding in the source Wikipedia URL you're looking at. :)
DBPedia's documentation says that their resource URIs are "of the form http://dbpedia.org/resource/Name, where Name is taken from the URL of the source Wikipedia article, which has the form http://en.wikipedia.org/wiki/Name. Thus, each resource is tied directly to an English-language Wikipedia article." -- http://wiki.dbpedia.org/Datasets#h18-4
They may, or may not, actually mean that as regards to normalization of %-encoding... let's assume that they do indeed copy it exactly.
My off-the-cuff recommendation might be something like this:
1) Resolve the redirect: a) fetch the URL, following any HTTP redirects -- this will let you avoid worrying about domain aliases you don't recognize, etc b) grab the <link rel="canonical"> url if any -- this will resolve any in-wiki redirects for you
2) Get the article name! a) confirm the URL is in the format you expect: http://en.wikipedia.org/wiki/(.*) b) divide the title from the rest of the URL
In a sane world, you'd unescape the %-encoding here, replace underscores to spaces, then take the UTF-8 name and run it through what DBPedia specifies as their encoding style. But if they're just copying the URL fragments from Wikipedia direct, you can just take the string now. :)
3) Prepend http://dbpedia.org/resource/ to the URL fragment.
Alternately, it wouldn't surprise me if DBPedia contained metadata or a search helper to look up by Wikipedia article name, but I can't get any of the SPARQL examples I've found on the web to work on their online lookup just now so I'm too lazy to go looking further. ;)
-- brion