Creating a drill-down/browsing interface for Wikidata - Wikidata

24 Apr 2017

Hi,

You may know me as the author of the reference book "Working with
MediaWiki" (shameless plug - http://workingwithmediawiki.com). I'm also a
MediaWiki extension developer, who has focused on creating generic
interfaces for editing and viewing structured data. Of these, the best
known is the extension Page Forms, which displays user-editable forms for
editing template calls and sections within pages:

https://www.mediawiki.org/wiki/Extension:Page_Forms

However, I've also created various applications that provide a "drill-down"
interface for browsing data. There is Semantic Drilldown, which provides
such an interface for Semantic MediaWiki's data:

https://www.mediawiki.org/wiki/Extension:Semantic_Drilldown

...Cargo, which provides browsing for its own data:

https://www.mediawiki.org/wiki/Extension:Cargo/Browsing_data

...and Miga, a JavaScript application that is not directly
MediaWiki-related but was nonetheless originally intended to browse data
from MediaWiki instances:

http://migadv.com/

I've been thinking quite a bit recently about creating this kind of
drill-down interface for the entirety of Wikidata's own data.

In terms of the interface, my idea is that it would actually most resemble
Miga - like Miga, it would be an all-JavaScript "single-page application",
and I think it makes sense to copy Miga's general interface approach. You
can see an example of Miga's browsing UI here - note the green bar at the
top, holding the filter options:

http://migadv.com/miga/?fictional#_cat=Fictional%20nonhumans

The Wikidata browser could have a somewhat similar interface, though it
would get its data via SPARQL queries rather than by querying data stored
in the browser, as Miga does. Another difference would be how people got to
'classes" in the first place. I'm envisioning an interface where people
start at the highest-level class ("Entity", I guess), then click down into
child classes until they find the one they're looking for, then drill down
from there. A text search could help with locating classes as well.

There are a few potential complications with creating a browsing interface
for Wikidata, but I believe they can all be overcome. One complication is
that there's no easy way to know which properties can be filtered on for
any class - for instance that, for pages in the class "country", it makes
sense to be able to filter on "population". It's my belief that Wikidata
should directly store, and make use of, the expected "domain" and
"range"
for every property - I've shared this opinion with the Wikidata developers,
who have tended to disagree. But what can be done instead of modifying
Wikidata - and what I think would have to be done for this project to work
- is to create a separate site that scrapes the "domain" data from
Wikidata's property talk pages, stores that information in a database, and
creates an API that returns, for any class name, the "data structure" for
that class - i.e., the set of properties that have that class in their
domain.

(This outside service, once created, could potentially be used for other
things - like alternate form-based editing of Wikidata entities in which
the form had pre-set fields for each expected property. That's outside the
scope of this potential project, though.)

Another big complication is the massive amount of data involved. Wikidata
has around 1,000 times the amount of data that the other applications I
listed usually handle. But I think it's all doable, using some well-placed
logic. See this Cargo drilldown interface, for example:

http://discoursedb.org/wiki/Special:Drilldown/Items

The "Author" field holds too many values to display on the screen, so it's
just a text input with autocompletion. As you drill down through the
values, though, the set of options gets reduced, and at some point all the
options are shown on the screen. That's the sort of interface logic that
could be used to keep the Wikidata browsing manageable.

A related complication is the large number of properties that could show up
as filters: if all of them are displayed on the screen, it could overwhelm
the interface. Miga already handles this problem, by calculating the
"diffusion" of each property - the number of unique values divided by the
number of total values - and then only displaying filters for properties
with a small-enough diffusion value. I assume that this Wikidata browser
could use a similar approach - and also automatically ignore properties of
certain types, like "ID", which don't make sense to drill down on.

Another complication is that some (or maybe all?) properties can hold
values that are time-specific - the "population" property I mentioned
before is a perfect example of that, since it can hold a different value
for year. I don't know what an ideal solution for that is, but I think it's
fine for now to just always use the most recent value for any such property.

I believe it would be fairly easy to "internationalize" this tool, also, by
the way - i.e., let the user select a language, and then show the
interface, and as much of the "data structure" (class and property names)
and data as possible, in that language.

Why do this whole thing? I can think of a number of important uses this
tool could have:

1) A new way to explore all the data on Wikidata - allowing both
aggregation and finding specific results.

2) A way to run specific queries, for those who don't know SPARQL or
understand Wikidata's specific data structure. This could open up Wikidata
querying to a wide range of people who otherwise would never be able to do
it.

3) Tied in with that, an API to create SPARQL queries - I didn't mention
this before, but it probably makes sense to add, to any page in the
display, a  "View SPARQL" link, which retrieves the SPARQL query that was
used to get the current set of results.

4) Potentially, a visualization tool - I didn't mention this either, but
Miga shows maps and timelines for data that contain coordinate and date
information, and it makes sense for this tool to do the same thing, whether
that happens in the first version or later.

So that's my explanation. This is a lot of information to throw out at one
time. Ideally, I would be creating a whole wiki page for this idea, with
mockup images and so forth; and maybe I'll do that at some point. But for
now, I really just wanted to hear people's general views on this sort of
thing. And if some people think it's a good idea, I'm also very curious to
hear what the best strategy might be to get funding for this. I could try
get a Wikimedia Individual Engagement Grant (IEG) to fund it - that's
actually how Miga was funded - but I wonder if another option is to get
Wikimedia Deutschland itself, or some other organization, to sponsor it,
and perhaps to take ownership of the resulting application. But maybe
that's getting too far ahead.

-Yaron

-- 
WikiWorks · MediaWiki Consulting · http://wikiworks.com