Hi,
You may know me as the author of
the reference book "Working with MediaWiki" (shameless plug -
http://workingwithmediawiki.com). I'm also a
MediaWiki extension developer, who has focused on creating generic
interfaces for editing and viewing structured data. Of these, the best
known is the extension Page Forms, which displays user-editable forms
for editing template calls and sections within pages:
However,
I've also created various applications that provide a "drill-down"
interface for browsing data. There is Semantic Drilldown, which provides
such an interface for Semantic MediaWiki's data:
...Cargo,
which provides browsing for its own data:
...and
Miga, a JavaScript application that is not directly MediaWiki-related
but was nonetheless originally intended to browse data from MediaWiki
instances:
I've
been thinking quite a bit recently about creating this kind of
drill-down interface for the entirety of Wikidata's own data.
In
terms of the interface, my idea is that it would actually most resemble
Miga - like Miga, it would be an all-JavaScript "single-page
application", and I think it makes sense to copy Miga's general
interface approach. You can see an example of Miga's browsing UI here -
note the green bar at the top, holding the filter options:
The
Wikidata browser could have a somewhat similar interface, though it
would get its data via SPARQL queries rather than by querying data
stored in the browser, as Miga does. Another difference would be how
people got to 'classes" in the first place. I'm envisioning an interface
where people start at the highest-level class ("Entity", I guess), then
click down into child classes until they find the one they're looking
for, then drill down from there. A text search could help with locating
classes as well.
There are a few potential
complications with creating a browsing interface for Wikidata, but I
believe they can all be overcome. One complication is that there's no
easy way to know which properties can be filtered on for any class - for
instance that, for pages in the class "country", it makes sense to be
able to filter on "population". It's my belief that Wikidata should
directly store, and make use of, the expected "domain" and "range" for
every property - I've shared this opinion with the Wikidata developers,
who have tended to disagree. But what can be done instead of modifying
Wikidata - and what I think would have to be done for this project to
work - is to create a separate site that scrapes the "domain" data from
Wikidata's property talk pages, stores that information in a database,
and creates an API that returns, for any class name, the "data
structure" for that class - i.e., the set of properties that have that
class in their domain.
(This outside service,
once created, could potentially be used for other things - like
alternate form-based editing of Wikidata entities in which the form had
pre-set fields for each expected property. That's outside the scope of
this potential project, though.)
Another
big complication is the massive amount of data involved. Wikidata has
around 1,000 times the amount of data that the other applications I
listed usually handle. But I think it's all doable, using some
well-placed logic. See this Cargo drilldown interface, for example:
The
"Author" field holds too many values to display on the screen, so it's
just a text input with autocompletion. As you drill down through the
values, though, the set of options gets reduced, and at some point all
the options are shown on the screen. That's the sort of interface logic
that could be used to keep the Wikidata browsing manageable.
A
related complication is the large number of properties that could show
up as filters: if all of them are displayed on the screen, it could
overwhelm the interface. Miga already handles this problem, by
calculating the "diffusion" of each property - the number of unique
values divided by the number of total values - and then only displaying
filters for properties with a small-enough diffusion value. I assume
that this Wikidata browser could use a similar approach - and also
automatically ignore properties of certain types, like "ID", which don't
make sense to drill down on.
Another
complication is that some (or maybe all?) properties can hold values
that are time-specific - the "population" property I mentioned before is
a perfect example of that, since it can hold a different value for
year. I don't know what an ideal solution for that is, but I think it's
fine for now to just always use the most recent value for any such
property.
I believe it would be fairly easy to
"internationalize" this tool, also, by the way - i.e., let the user
select a language, and then show the interface, and as much of the "data
structure" (class and property names) and data as possible, in that
language.
Why do this whole thing? I can think
of a number of important uses this tool could have:
1)
A new way to explore all the data on Wikidata - allowing both
aggregation and finding specific results.
2) A
way to run specific queries, for those who don't know SPARQL or
understand Wikidata's specific data structure. This could open up
Wikidata querying to a wide range of people who otherwise would never be
able to do it.
3) Tied in with that, an API to create SPARQL
queries - I didn't mention this before, but it probably makes sense to
add, to any page in the display, a "View SPARQL" link, which retrieves
the SPARQL query that was used to get the current set of results.
4)
Potentially, a visualization tool - I didn't mention this either, but
Miga shows maps and timelines for data that contain coordinate and date
information, and it makes sense for this tool to do the same thing,
whether that happens in the first version or later.
So
that's my explanation. This is a lot of
information to throw out at one time. Ideally, I would be creating a
whole wiki page for this idea, with mockup images and so forth; and
maybe I'll do that at some point. But for now, I really just wanted to
hear people's general views on this sort of thing. And if some people
think it's a good idea, I'm also very curious to hear what the best
strategy might be to get funding for this. I could try get a Wikimedia
Individual Engagement Grant (IEG) to fund it - that's actually how Miga
was funded - but I wonder if another option is to get Wikimedia
Deutschland itself, or some other organization, to sponsor it, and
perhaps to take ownership of the resulting application. But maybe that's
getting too far ahead.