Eric,
Two general remarks first:
(1) Protege is for small and medium ontologies, but not really for such
large datasets. To get SPARQL support for the whole data, you could to
install Virtuoso. It also comes with a simple Web query UI. Virtuoso
does not do much reasoning, but you can use SPARQL 1.1 transitive
closure in queries (using "*" after properties), so you can find "all
subclasses" there too. (You could also try this in Protege ...)
(2) If you want to explore the class hierarchy, you can also try our new
class browser:
http://tools.wmflabs.org/wikidata-exports/miga/?classes
It has the whole class hierarchy, but without the "leaves" (=instances
of classes + subclasses that have no own subclasses/instances). For
example, it tells you that "lepton" has 5 direct subclasses, but shows
only one:
http://tools.wmflabs.org/wikidata-exports/miga/?classes#_item=3338
On the other hand, it includes relationships of classes and properties
that are not part of the RDF (we extract this from the data by
considering co-occurrence). Example:
"Classes that have no superclasses but at least 10 instances, and which
are often used with the property 'sex or gender'":
http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Direct…
I already added superclasses for some of those in Wikidata now -- data
in the browser is updated with some delay based on dump files.
More answers below:
On 14/06/14 05:52, emw wrote:
Markus,
Thank you very much for this. Translating Wikidata into the language of
the Semantic Web is important. Being able to explore the Wikidata
taxonomy [1] by doing SPARQL queries in Protege [2] (even primitive
queries) is really neat, e.g.
SELECT ?subject
WHERE
{
?subject rdfs:subClassOf <http://www.wikidata.org/entity/Q82586> .
}
This is more of an issue of my ignorance of Protege, but I notice that
the above query returns only the direct subclasses of Q82586. The full
set of subclasses for Q82586 ("lepton") is visible at
http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q82586&rp=279&la…
-- a few of the 2nd-level subclasses (muon neutrino, tau neutrino,
electron neutrino) are shown there but not returned by that SPARQL
query. It seems rdfs:subClassOf isn't being treated as a transitive
property in Protege. Any ideas?
You need a reasoner to compute this properly. For a plain class
hierarchy as in our case, ELK should be a good choice [1]. You can
install the ELK Protege plugin and use it to classify the ontology [2].
Protege will then show the copmuted class hierarchy in the browser; I am
not sure what happens to the SPARQL queries (it's quite possible that
they don't use the reasoner).
[1]
https://code.google.com/p/elk-reasoner/
[2]
https://code.google.com/p/elk-reasoner/wiki/ElkProtege
Do you know when the taxonomy data in OWL will have labels available?
We had not thought of this as a use case. A challenge is that the label
data is quite big because of the many languages. Should we maybe create
an English label file for the classes? Descriptions too or just labels?
Also, regarding the complete dumps, would it be possible to export a
smaller subset of the faithful data? The files under "Complete Data
Dumps" in
http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/ look too
big to load into Protege on most personal computers, and would likely
require adjusting JVM settings on higher-end computers to load. If it's
feasible to somehow prune those files -- and maybe even combine them
into one file that could be easily loaded into Protege -- that would be
especially nice.
What kind of "pruning" do you have in mind? You can of course take a
subset of the data, but then some of the data will be missing.
A general remark on mixing and matching RDF files. We use N3 format,
where every line in the ontology is self-contained (no multi-line
constructs, no header, no namespaces). Therefore, any subset of the
lines of any of our files is still a valid file. So if you want to have
only a slice of the data (maybe to experiment with), then you could
simply do something like:
gunzip -c wikidata-statements.nt.gz | head -10000 > partial-data.nt
"head" simply selects the first 10000 lines here. You could also use
grep to select specific triples instead, such as:
zgrep "http://www.w3.org/2000/01/rdf-schema#label" wikidata-terms.nt.gz
| grep "@en ." > en-labels.nt
This selects all English labels. I am using zgrep here for a change; you
can also use gunzip as above. Similar methods can also be used to count
things in the ontology (use grep -c to count lines = triples).
Finally, you can combine multiple files into one by simply concatenating
them in any order:
cat partial-data-1.nt > mydata.nt
cat partial-data-2.nt >> mydata.nt
...
Maybe you can experiment a bit and let us know if there is any export
that would be particularly meaningful for you.
Cheers,
Markus
Thanks,
Eric
https://www.wikidata.org/wiki/User:Emw
1.
http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/wikidata-tax…
2.
http://protege.stanford.edu/
On Tue, Jun 10, 2014 at 4:43 AM, Markus Kroetzsch
<markus.kroetzsch(a)tu-dresden.de <mailto:markus.kroetzsch@tu-dresden.de>>
wrote:
Hi all,
We are now offering regular RDF dumps for the content of Wikidata:
http://tools.wmflabs.org/__wikidata-exports/rdf/
<http://tools.wmflabs.org/wikidata-exports/rdf/>
RDF is the Resource Description Framework of the W3C that can be
used to exchange data on the Web. The Wikidata RDF exports consist
of several files that contain different parts and views of the data,
and which can be used independently. Details on the available
exports and the RDF encoding used in each can be found in the paper
"Introducing Wikidata to the Linked Data Web" [1].
The available RDF exports can be found in the directory
http://tools.wmflabs.org/__wikidata-exports/rdf/exports/
<http://tools.wmflabs.org/wikidata-exports/rdf/exports/>. New
exports are generated regularly from current data dumps of Wikidata
and will appear in this directory shortly afterwards.
All dump files have been generated using Wikidata Toolkit [2]. There
are some important differences in comparison to earlier dumps:
* Data is split into several dump files for convenience. Pick
whatever you are most interested in.
* All dumps are generated using the OpenRDF library for Java (better
quality than ad hoc serialization; much slower too ;-)
* All dumps are in N3 format, the simplest RDF serialization format
that there is
* In addition to the faithful dumps, some simplified dumps are also
available (one statement = one triple; no qualifiers and references).
* Links to external data sets are added to the data for Wikidata
properties that point to datasets with RDF exports. That's the
"Linked" in "Linked Open Data".
Suggestions for improvements and contributions on github are welcome.
Cheers,
Markus
[1]
http://korrekt.org/page/__Introducing_Wikidata_to_the___Linked_Data_Web
<http://korrekt.org/page/Introducing_Wikidata_to_the_Linked_Data_Web>
[2]
https://www.mediawiki.org/__wiki/Wikidata_Toolkit
<https://www.mediawiki.org/wiki/Wikidata_Toolkit>
--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486 <tel:%2B49%20351%20463%2038486>
http://korrekt.org/
_________________________________________________
Wikidata-l mailing list
Wikidata-l(a)lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org>
https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
<https://lists.wikimedia.org/mailman/listinfo/wikidata-l>
_______________________________________________
Wikidata-l mailing list
Wikidata-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l