Markus,
Thanks for the thorough reply!
you can use SPARQL 1.1 transitive closure in queries (using "*" after
properties), so you can find "all
subclasses" there too. (You could also
try this in Protege ...)
I had a feeling I was missing something basic. (I'm also new to SPARQL.)
Using "*" after the property got me what I was looking for by default in
Protege. That is,
SELECT ?subject
WHERE
{
?subject rdfs:subClassOf* <http://www.wikidata.org/entity/Q82586> .
}
-- with an asterisk after rdfs:subClassOf -- got me the transitive closure
and returned all subclasses of Q82586 / "lepton".
Should we maybe create an English label file for the classes? Descriptions
A file with English labels and descriptions for classes would be great and,
I think, address this use case. Per your note, I suppose one would simply
concatenate that English terms file and wikidata-taxonomy.nt into a new .nt
file, then import that into Protege to explore the class hierarchy.
(Having every line in the ontology be self-contained in N3 is very
convenient!)
Regarding the pruned subset, I think the command-line approach in your
examples is enough for me to get started making my own.
I won't have time to experiment with these things for a few weeks, but I
will return to this then and let you know any interesting findings.
Cheers,
Eric
On Sat, Jun 14, 2014 at 4:41 AM, Markus Krötzsch <
markus(a)semantic-mediawiki.org> wrote:
Eric,
Two general remarks first:
(1) Protege is for small and medium ontologies, but not really for such
large datasets. To get SPARQL support for the whole data, you could to
install Virtuoso. It also comes with a simple Web query UI. Virtuoso does
not do much reasoning, but you can use SPARQL 1.1 transitive closure in
queries (using "*" after properties), so you can find "all
subclasses"
there too. (You could also try this in Protege ...)
(2) If you want to explore the class hierarchy, you can also try our new
class browser:
http://tools.wmflabs.org/wikidata-exports/miga/?classes
It has the whole class hierarchy, but without the "leaves" (=instances of
classes + subclasses that have no own subclasses/instances). For example,
it tells you that "lepton" has 5 direct subclasses, but shows only one:
http://tools.wmflabs.org/wikidata-exports/miga/?classes#_item=3338
On the other hand, it includes relationships of classes and properties
that are not part of the RDF (we extract this from the data by considering
co-occurrence). Example:
"Classes that have no superclasses but at least 10 instances, and which
are often used with the property 'sex or gender'":
http://tools.wmflabs.org/wikidata-exports/miga/?
classes#_cat=Classes/Direct%20superclasses=__null/Number%
20of%20direct%20instances=10%20-%2020000/Related%
20properties=sex%20or%20gender
I already added superclasses for some of those in Wikidata now -- data in
the browser is updated with some delay based on dump files.
More answers below:
On 14/06/14 05:52, emw wrote:
Markus,
Thank you very much for this. Translating Wikidata into the language of
the Semantic Web is important. Being able to explore the Wikidata
taxonomy [1] by doing SPARQL queries in Protege [2] (even primitive
queries) is really neat, e.g.
SELECT ?subject
WHERE
{
?subject rdfs:subClassOf <http://www.wikidata.org/entity/Q82586> .
}
This is more of an issue of my ignorance of Protege, but I notice that
the above query returns only the direct subclasses of Q82586. The full
set of subclasses for Q82586 ("lepton") is visible at
http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q82586&rp=279&la…
-- a few of the 2nd-level subclasses (muon neutrino, tau neutrino,
electron neutrino) are shown there but not returned by that SPARQL
query. It seems rdfs:subClassOf isn't being treated as a transitive
property in Protege. Any ideas?
You need a reasoner to compute this properly. For a plain class hierarchy
as in our case, ELK should be a good choice [1]. You can install the ELK
Protege plugin and use it to classify the ontology [2]. Protege will then
show the copmuted class hierarchy in the browser; I am not sure what
happens to the SPARQL queries (it's quite possible that they don't use the
reasoner).
[1]
https://code.google.com/p/elk-reasoner/
[2]
https://code.google.com/p/elk-reasoner/wiki/ElkProtege
Do you know when the taxonomy data in OWL will
have labels available?
We had not thought of this as a use case. A challenge is that the label
data is quite big because of the many languages. Should we maybe create an
English label file for the classes? Descriptions too or just labels?
Also, regarding the complete dumps, would it be
possible to export a
smaller subset of the faithful data? The files under "Complete Data
Dumps" in
http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/ look too
big to load into Protege on most personal computers, and would likely
require adjusting JVM settings on higher-end computers to load. If it's
feasible to somehow prune those files -- and maybe even combine them
into one file that could be easily loaded into Protege -- that would be
especially nice.
What kind of "pruning" do you have in mind? You can of course take a
subset of the data, but then some of the data will be missing.
A general remark on mixing and matching RDF files. We use N3 format, where
every line in the ontology is self-contained (no multi-line constructs, no
header, no namespaces). Therefore, any subset of the lines of any of our
files is still a valid file. So if you want to have only a slice of the
data (maybe to experiment with), then you could simply do something like:
gunzip -c wikidata-statements.nt.gz | head -10000 > partial-data.nt
"head" simply selects the first 10000 lines here. You could also use grep
to select specific triples instead, such as:
zgrep "http://www.w3.org/2000/01/rdf-schema#label" wikidata-terms.nt.gz |
grep "@en ." > en-labels.nt
This selects all English labels. I am using zgrep here for a change; you
can also use gunzip as above. Similar methods can also be used to count
things in the ontology (use grep -c to count lines = triples).
Finally, you can combine multiple files into one by simply concatenating
them in any order:
cat partial-data-1.nt > mydata.nt
cat partial-data-2.nt >> mydata.nt
...
Maybe you can experiment a bit and let us know if there is any export that
would be particularly meaningful for you.
Cheers,
Markus
Thanks,
Eric
https://www.wikidata.org/wiki/User:Emw
1.
http://tools.wmflabs.org/wikidata-exports/rdf/exports/
20140526/wikidata-taxonomy.nt.gz
2.
http://protege.stanford.edu/
On Tue, Jun 10, 2014 at 4:43 AM, Markus Kroetzsch
<markus.kroetzsch(a)tu-dresden.de <mailto:markus.kroetzsch@tu-dresden.de>>
wrote:
Hi all,
We are now offering regular RDF dumps for the content of Wikidata:
http://tools.wmflabs.org/__wikidata-exports/rdf/
<http://tools.wmflabs.org/wikidata-exports/rdf/>
RDF is the Resource Description Framework of the W3C that can be
used to exchange data on the Web. The Wikidata RDF exports consist
of several files that contain different parts and views of the data,
and which can be used independently. Details on the available
exports and the RDF encoding used in each can be found in the paper
"Introducing Wikidata to the Linked Data Web" [1].
The available RDF exports can be found in the directory
http://tools.wmflabs.org/__wikidata-exports/rdf/exports/
<http://tools.wmflabs.org/wikidata-exports/rdf/exports/>. New
exports are generated regularly from current data dumps of Wikidata
and will appear in this directory shortly afterwards.
All dump files have been generated using Wikidata Toolkit [2]. There
are some important differences in comparison to earlier dumps:
* Data is split into several dump files for convenience. Pick
whatever you are most interested in.
* All dumps are generated using the OpenRDF library for Java (better
quality than ad hoc serialization; much slower too ;-)
* All dumps are in N3 format, the simplest RDF serialization format
that there is
* In addition to the faithful dumps, some simplified dumps are also
available (one statement = one triple; no qualifiers and references).
* Links to external data sets are added to the data for Wikidata
properties that point to datasets with RDF exports. That's the
"Linked" in "Linked Open Data".
Suggestions for improvements and contributions on github are welcome.
Cheers,
Markus
[1]
http://korrekt.org/page/__Introducing_Wikidata_to_the___
Linked_Data_Web
<http://korrekt.org/page/Introducing_Wikidata_to_the_Linked_Data_Web>
[2]
https://www.mediawiki.org/__wiki/Wikidata_Toolkit
<https://www.mediawiki.org/wiki/Wikidata_Toolkit>
--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486 <tel:%2B49%20351%20463%2038486>
http://korrekt.org/
_________________________________________________
Wikidata-l mailing list
Wikidata-l(a)lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
<https://lists.wikimedia.org/mailman/listinfo/wikidata-l>
_______________________________________________
Wikidata-l mailing list
Wikidata-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
_______________________________________________
Wikidata-l mailing list
Wikidata-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l