I fed the Wikidata dump into a JSON profiling tool; in the first stage it
identified unique paths one could follow through the JSON data structures.
The table below shows a count of the literal data items that can be found
behind a path -- we're not counting how many claims of P31[] have been
made, we are also counting all of the literals inside the node, so the
more information that is qualifying the claim the bigger this number gets.
/claims/P31[]144350720 instance of/claims/P625[]35948377 geographic
coordinates/claims/P17[]35165614 country: sovereign
state/claims/P646[]31095359
freebase identifier/claims/P569[]30881885 date of birth/claims/P21[]30466476
sex or gender/claims/P105[] 29234406 taxon rank/claims/P225[]27808194
taxon name/claims/P131[]27806448 located in administrative div
/claims/P171[]25159278 parent taxon
None of those are a surprise at all: the two great hierarchies (spatial
and biological) are represented and there are properties about people,
oddly though the most documented property connected with creative works is
P161, which ranks in at #20.
Anyhow, it is not all claims, if you look at the highest level you see
/datatype1328/id16647896/type16647896/aliases17824992/sitelinks82865847
/descriptions112796452/labels120721644/claims772152821
Everything above the /claims is part of what I have been calling the
"taxonomic core". There are quite a few reasons to treat this data
specially, and I'd guess this solved a chicken-vs-egg problem for WD.
In Freebase the taxonomic core is roughly half the mass of the whole
thing. The claims are certainly bulked up in Wikidata because of the
qualifying information.
If anything is weak about the fundamental data model it is that aliases and
labels are not reified the way the claims are. This is a big deal if you
want a usable lexical database. For instance, labels should be taggable
as to
* being potentially offensive (i.e. insults that start with "N")
* generic name for drug/brand names for drug
* Japanese labels should be available in kanji, hiragana, and romanized
form and should be identifiable that way
* in English we have it easy and you can generate "Mad Lib" style texts
correctly if you can (1) know which article to use and (2) how to make both
the plural and singular forms. (1) is easy to guess if you have semantic
data and you can get away with being imperfect at (2).
* for German however you need to tag by grammatical gender and the choice
of the article is a function of said gender and the relationship between
the concept and the predicate as well as the verb tense
* similar things exist for most of the other languages...
* various organizations have defined viewpoints on terminology; for
instance firefighters want you to say 'flammable' because people might get
the morphology wrong on 'inflammable'; in the army you could be sexually
harassed if you call your "Rifle" your "Gun"
--
Paul Houle
(607) 539 6254 paul.houle on Skype ontology2(a)gmail.com
http://legalentityidentifier.info/lei/lookup