Re: [Wikidata-l] Wikidata RDF exports

14 Jun 2014

Eric,

Two general remarks first:

(1) Protege is for small and medium ontologies, but not really for such 
large datasets. To get SPARQL support for the whole data, you could to 
install Virtuoso. It also comes with a simple Web query UI. Virtuoso 
does not do much reasoning, but you can use SPARQL 1.1 transitive 
closure in queries (using "*" after properties), so you can find "all 
subclasses" there too. (You could also try this in Protege ...)

(2) If you want to explore the class hierarchy, you can also try our new 
class browser:

http://tools.wmflabs.org/wikidata-exports/miga/?classes

It has the whole class hierarchy, but without the "leaves" (=instances 
of classes + subclasses that have no own subclasses/instances). For 
example, it tells you that "lepton" has 5 direct subclasses, but shows 
only one:

http://tools.wmflabs.org/wikidata-exports/miga/?classes#_item=3338

On the other hand, it includes relationships of classes and properties 
that are not part of the RDF (we extract this from the data by 
considering co-occurrence). Example:

"Classes that have no superclasses but at least 10 instances, and which 
are often used with the property 'sex or gender'":

http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Direct…

I already added superclasses for some of those in Wikidata now -- data 
in the browser is updated with some delay based on dump files.

More answers below:

On 14/06/14 05:52, emw wrote:
...
  Markus,

 Thank you very much for this.  Translating Wikidata into the language of
 the Semantic Web is important.  Being able to explore the Wikidata
 taxonomy [1] by doing SPARQL queries in Protege [2] (even primitive
 queries) is really neat, e.g.

 SELECT ?subject
 WHERE
 {
     ?subject rdfs:subClassOf <http://www.wikidata.org/entity/Q82586> .
 }

 This is more of an issue of my ignorance of Protege, but I notice that
 the above query returns only the direct subclasses of Q82586.  The full
 set of subclasses for Q82586 ("lepton") is visible at
 http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q82586&rp=279&la…
 -- a few of the 2nd-level subclasses (muon neutrino, tau neutrino,
 electron neutrino) are shown there but not returned by that SPARQL
 query.  It seems rdfs:subClassOf isn't being treated as a transitive
 property in Protege.  Any ideas? 
You need a reasoner to compute this properly. For a plain class 
hierarchy as in our case, ELK should be a good choice [1]. You can 
install the ELK Protege plugin and use it to classify the ontology [2]. 
Protege will then show the copmuted class hierarchy in the browser; I am 
not sure what happens to the SPARQL queries (it's quite possible that 
they don't use the reasoner).

[1] https://code.google.com/p/elk-reasoner/
[2] https://code.google.com/p/elk-reasoner/wiki/ElkProtege

...

 Do you know when the taxonomy data in OWL will have labels available? 
We had not thought of this as a use case. A challenge is that the label 
data is quite big because of the many languages. Should we maybe create 
an English label file for the classes? Descriptions too or just labels?

...

 Also, regarding the complete dumps, would it be possible to export a
 smaller subset of the faithful data?  The files under "Complete Data
 Dumps" in
 http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/ look too
 big to load into Protege on most personal computers, and would likely
 require adjusting JVM settings on higher-end computers to load.  If it's
 feasible to somehow prune those files -- and maybe even combine them
 into one file that could be easily loaded into Protege -- that would be
 especially nice. 
What kind of "pruning" do you have in mind? You can of course take a 
subset of the data, but then some of the data will be missing.

A general remark on mixing and matching RDF files. We use N3 format, 
where every line in the ontology is self-contained (no multi-line 
constructs, no header, no namespaces). Therefore, any subset of the 
lines of any of our files is still a valid file. So if you want to have 
only a slice of the data (maybe to experiment with), then you could 
simply do something like:

gunzip -c wikidata-statements.nt.gz | head -10000 > partial-data.nt

"head" simply selects the first 10000 lines here. You could also use 
grep to select specific triples instead, such as:

zgrep "http://www.w3.org/2000/01/rdf-schema#label" wikidata-terms.nt.gz 
| grep "@en ." > en-labels.nt

This selects all English labels. I am using zgrep here for a change; you 
can also use gunzip as above. Similar methods can also be used to count 
things in the ontology (use grep -c to count lines = triples).

Finally, you can combine multiple files into one by simply concatenating 
them in any order:

cat partial-data-1.nt > mydata.nt
cat partial-data-2.nt >> mydata.nt
...

Maybe you can experiment a bit and let us know if there is any export 
that would be particularly meaningful for you.

Cheers,

Markus

...

 Thanks,
 Eric
 https://www.wikidata.org/wiki/User:Emw

 1.
 http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/wikidata-tax…
 2. http://protege.stanford.edu/

 On Tue, Jun 10, 2014 at 4:43 AM, Markus Kroetzsch
 &lt;markus.kroetzsch(a)tu-dresden.de <mailto:markus.kroetzsch@tu-dresden.de>>
 wrote:

     Hi all,

     We are now offering regular RDF dumps for the content of Wikidata:

     http://tools.wmflabs.org/__wikidata-exports/rdf/
     <http://tools.wmflabs.org/wikidata-exports/rdf/>

     RDF is the Resource Description Framework of the W3C that can be
     used to exchange data on the Web. The Wikidata RDF exports consist
     of several files that contain different parts and views of the data,
     and which can be used independently. Details on the available
     exports and the RDF encoding used in each can be found in the paper
     "Introducing Wikidata to the Linked Data Web" [1].

     The available RDF exports can be found in the directory
     http://tools.wmflabs.org/__wikidata-exports/rdf/exports/
     <http://tools.wmflabs.org/wikidata-exports/rdf/exports/>. New
     exports are generated regularly from current data dumps of Wikidata
     and will appear in this directory shortly afterwards.

     All dump files have been generated using Wikidata Toolkit [2]. There
     are some important differences in comparison to earlier dumps:

     * Data is split into several dump files for convenience. Pick
     whatever you are most interested in.
     * All dumps are generated using the OpenRDF library for Java (better
     quality than ad hoc serialization; much slower too ;-)
     * All dumps are in N3 format, the simplest RDF serialization format
     that there is
     * In addition to the faithful dumps, some simplified dumps are also
     available (one statement = one triple; no qualifiers and references).
     * Links to external data sets are added to the data for Wikidata
     properties that point to datasets with RDF exports. That's the
     "Linked" in "Linked Open Data".

     Suggestions for improvements and contributions on github are welcome.

     Cheers,

     Markus

     [1]
     http://korrekt.org/page/__Introducing_Wikidata_to_the___Linked_Data_Web
     <http://korrekt.org/page/Introducing_Wikidata_to_the_Linked_Data_Web>
     [2] https://www.mediawiki.org/__wiki/Wikidata_Toolkit
     <https://www.mediawiki.org/wiki/Wikidata_Toolkit>

     --
     Markus Kroetzsch
     Faculty of Computer Science
     Technische Universität Dresden
     +49 351 463 38486 <tel:%2B49%20351%20463%2038486>
     http://korrekt.org/

     _________________________________________________
     Wikidata-l mailing list
     Wikidata-l(a)lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org>
     https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
     <https://lists.wikimedia.org/mailman/listinfo/wikidata-l>

 _______________________________________________
 Wikidata-l mailing list
 Wikidata-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata-l] Wikidata RDF exports