On Wed, Oct 13, 2010 at 2:29 PM, Paul Houle wrote:
[snip]
Now the problem I've got with the Ookaboo API is
that I know people are
going to punch in
http://wikipedia.org/page/Boston,_MA
and I need to turn this into the right dbpedia URL. My plan for dealing
with this is to
(i) store the exact URI I get out of dbpedia,
(ii) always give people the exact URI out of dbpedia (if I publish RDFa
or JSON data),
(iii) give the same URI for wikipedia that dbpedia gives (in HTML,
RDFa, etc.)
(iv) if I get a query, apply the same canonicalization rules that
dbpedia uses...
Which begs the question of what exactly those rules are. What are they?
It sounds like you need to map from "URL which contains an English Wikipedia
article title" to "URI identifier for the DBPedia node describing the
concept which that article is about".
The good news is that you can probably get away without caring too much
about the actual encoding in the source Wikipedia URL you're looking at. :)
DBPedia's documentation says that their resource URIs are "of the form
http://dbpedia.org/resource/Name, where Name is taken from the URL of the
source Wikipedia article, which has the form
http://en.wikipedia.org/wiki/Name. Thus, each resource is tied directly
to an English-language Wikipedia article." --
http://wiki.dbpedia.org/Datasets#h18-4
They may, or may not, actually mean that as regards to normalization of
%-encoding... let's assume that they do indeed copy it exactly.
My off-the-cuff recommendation might be something like this:
1) Resolve the redirect:
a) fetch the URL, following any HTTP redirects -- this will let you avoid
worrying about domain aliases you don't recognize, etc
b) grab the <link rel="canonical"> url if any -- this will resolve any
in-wiki redirects for you
2) Get the article name!
a) confirm the URL is in the format you expect:
http://en.wikipedia.org/wiki/(.*)
b) divide the title from the rest of the URL
In a sane world, you'd unescape the %-encoding here, replace underscores to
spaces, then take the UTF-8 name and run it through what DBPedia specifies
as their encoding style. But if they're just copying the URL fragments from
Wikipedia direct, you can just take the string now. :)
3) Prepend
http://dbpedia.org/resource/ to the URL fragment.
Alternately, it wouldn't surprise me if DBPedia contained metadata or a
search helper to look up by Wikipedia article name, but I can't get any of
the SPARQL examples I've found on the web to work on their online lookup
just now so I'm too lazy to go looking further. ;)
-- brion