Markus,

Thanks for the thorough reply!

you can use SPARQL 1.1 transitive closure in queries (using "*" after properties), so you can find "all subclasses" there too. (You could also try this in Protege ...)

I had a feeling I was missing something basic.  (I'm also new to SPARQL.)  Using "*" after the property got me what I was looking for by default in Protege.  That is,

SELECT ?subject
WHERE

   ?subject rdfs:subClassOf* <http://www.wikidata.org/entity/Q82586> .
}

-- with an asterisk after rdfs:subClassOf -- got me the transitive closure and returned all subclasses of Q82586 / "lepton".

Should we maybe create an English label file for the classes? Descriptions too or just labels?

A file with English labels and descriptions for classes would be great and, I think, address this use case.  Per your note, I suppose one would simply concatenate that English terms file and wikidata-taxonomy.nt into a new .nt file, then import that into Protege to explore the class hierarchy.  (Having every line in the ontology be self-contained in N3 is very convenient!)

Regarding the pruned subset, I think the command-line approach in your examples is enough for me to get started making my own.

I won't have time to experiment with these things for a few weeks, but I will return to this then and let you know any interesting findings.

Cheers,
Eric


On Sat, Jun 14, 2014 at 4:41 AM, Markus Krötzsch <markus@semantic-mediawiki.org> wrote:
Eric,

Two general remarks first:

(1) Protege is for small and medium ontologies, but not really for such large datasets. To get SPARQL support for the whole data, you could to install Virtuoso. It also comes with a simple Web query UI. Virtuoso does not do much reasoning, but you can use SPARQL 1.1 transitive closure in queries (using "*" after properties), so you can find "all subclasses" there too. (You could also try this in Protege ...)

(2) If you want to explore the class hierarchy, you can also try our new class browser:

http://tools.wmflabs.org/wikidata-exports/miga/?classes

It has the whole class hierarchy, but without the "leaves" (=instances of classes + subclasses that have no own subclasses/instances). For example, it tells you that "lepton" has 5 direct subclasses, but shows only one:

http://tools.wmflabs.org/wikidata-exports/miga/?classes#_item=3338

On the other hand, it includes relationships of classes and properties that are not part of the RDF (we extract this from the data by considering co-occurrence). Example:

"Classes that have no superclasses but at least 10 instances, and which are often used with the property 'sex or gender'":

http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Direct%20superclasses=__null/Number%20of%20direct%20instances=10%20-%2020000/Related%20properties=sex%20or%20gender

I already added superclasses for some of those in Wikidata now -- data in the browser is updated with some delay based on dump files.


More answers below:


On 14/06/14 05:52, emw wrote:
Markus,

Thank you very much for this.  Translating Wikidata into the language of
the Semantic Web is important.  Being able to explore the Wikidata
taxonomy [1] by doing SPARQL queries in Protege [2] (even primitive
queries) is really neat, e.g.

SELECT ?subject
WHERE
{
    ?subject rdfs:subClassOf <http://www.wikidata.org/entity/Q82586> .
}

This is more of an issue of my ignorance of Protege, but I notice that
the above query returns only the direct subclasses of Q82586.  The full
set of subclasses for Q82586 ("lepton") is visible at
http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q82586&rp=279&lang=en
-- a few of the 2nd-level subclasses (muon neutrino, tau neutrino,
electron neutrino) are shown there but not returned by that SPARQL
query.  It seems rdfs:subClassOf isn't being treated as a transitive
property in Protege.  Any ideas?

You need a reasoner to compute this properly. For a plain class hierarchy as in our case, ELK should be a good choice [1]. You can install the ELK Protege plugin and use it to classify the ontology [2]. Protege will then show the copmuted class hierarchy in the browser; I am not sure what happens to the SPARQL queries (it's quite possible that they don't use the reasoner).

[1] https://code.google.com/p/elk-reasoner/
[2] https://code.google.com/p/elk-reasoner/wiki/ElkProtege



Do you know when the taxonomy data in OWL will have labels available?

We had not thought of this as a use case. A challenge is that the label data is quite big because of the many languages. Should we maybe create an English label file for the classes? Descriptions too or just labels?



Also, regarding the complete dumps, would it be possible to export a
smaller subset of the faithful data?  The files under "Complete Data
Dumps" in
http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/ look too
big to load into Protege on most personal computers, and would likely
require adjusting JVM settings on higher-end computers to load.  If it's
feasible to somehow prune those files -- and maybe even combine them
into one file that could be easily loaded into Protege -- that would be
especially nice.

What kind of "pruning" do you have in mind? You can of course take a subset of the data, but then some of the data will be missing.

A general remark on mixing and matching RDF files. We use N3 format, where every line in the ontology is self-contained (no multi-line constructs, no header, no namespaces). Therefore, any subset of the lines of any of our files is still a valid file. So if you want to have only a slice of the data (maybe to experiment with), then you could simply do something like:

gunzip -c wikidata-statements.nt.gz | head -10000 > partial-data.nt

"head" simply selects the first 10000 lines here. You could also use grep to select specific triples instead, such as:

zgrep "http://www.w3.org/2000/01/rdf-schema#label" wikidata-terms.nt.gz | grep "@en ." > en-labels.nt

This selects all English labels. I am using zgrep here for a change; you can also use gunzip as above. Similar methods can also be used to count things in the ontology (use grep -c to count lines = triples).

Finally, you can combine multiple files into one by simply concatenating them in any order:

cat partial-data-1.nt > mydata.nt
cat partial-data-2.nt >> mydata.nt
...

Maybe you can experiment a bit and let us know if there is any export that would be particularly meaningful for you.

Cheers,

Markus


Thanks,
Eric
https://www.wikidata.org/wiki/User:Emw

1.
http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/wikidata-taxonomy.nt.gz
2. http://protege.stanford.edu/





On Tue, Jun 10, 2014 at 4:43 AM, Markus Kroetzsch
<markus.kroetzsch@tu-dresden.de <mailto:markus.kroetzsch@tu-dresden.de>>

wrote:

    Hi all,

    We are now offering regular RDF dumps for the content of Wikidata:

    http://tools.wmflabs.org/__wikidata-exports/rdf/

    <http://tools.wmflabs.org/wikidata-exports/rdf/>

    RDF is the Resource Description Framework of the W3C that can be
    used to exchange data on the Web. The Wikidata RDF exports consist
    of several files that contain different parts and views of the data,
    and which can be used independently. Details on the available
    exports and the RDF encoding used in each can be found in the paper
    "Introducing Wikidata to the Linked Data Web" [1].

    The available RDF exports can be found in the directory
    http://tools.wmflabs.org/__wikidata-exports/rdf/exports/
    <http://tools.wmflabs.org/wikidata-exports/rdf/exports/>. New

    exports are generated regularly from current data dumps of Wikidata
    and will appear in this directory shortly afterwards.

    All dump files have been generated using Wikidata Toolkit [2]. There
    are some important differences in comparison to earlier dumps:

    * Data is split into several dump files for convenience. Pick
    whatever you are most interested in.
    * All dumps are generated using the OpenRDF library for Java (better
    quality than ad hoc serialization; much slower too ;-)
    * All dumps are in N3 format, the simplest RDF serialization format
    that there is
    * In addition to the faithful dumps, some simplified dumps are also
    available (one statement = one triple; no qualifiers and references).
    * Links to external data sets are added to the data for Wikidata
    properties that point to datasets with RDF exports. That's the
    "Linked" in "Linked Open Data".

    Suggestions for improvements and contributions on github are welcome.

    Cheers,

    Markus

    [1]
    http://korrekt.org/page/__Introducing_Wikidata_to_the___Linked_Data_Web
    <http://korrekt.org/page/Introducing_Wikidata_to_the_Linked_Data_Web>
    [2] https://www.mediawiki.org/__wiki/Wikidata_Toolkit
    <https://www.mediawiki.org/wiki/Wikidata_Toolkit>

    --
    Markus Kroetzsch
    Faculty of Computer Science
    Technische Universität Dresden
    +49 351 463 38486 <tel:%2B49%20351%20463%2038486>
    http://korrekt.org/


    _________________________________________________
    Wikidata-l mailing list
    Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org>
    https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
    <https://lists.wikimedia.org/mailman/listinfo/wikidata-l>




_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l



_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l