I fed the Wikidata dump into a JSON profiling tool; in the first stage it identified unique paths one could follow through the JSON data structures.

The table below shows a count of the literal data items that can be found behind a path -- we're not counting how many claims of P31[] have been made, we are also counting all of the literals inside the node, so the more information that is qualifying the claim the bigger this number gets.

/claims/P31[]	144350720 instance of
/claims/P625[]	35948377 geographic coordinates
/claims/P17[]	35165614 country: sovereign state
/claims/P646[]	31095359 freebase identifier
/claims/P569[]	30881885 date of birth
/claims/P21[]	30466476 sex or gender
/claims/P105[]	29234406 taxon rank
/claims/P225[]	27808194 taxon name
/claims/P131[]	27806448 located in administrative div
/claims/P171[]	25159278 parent taxon

None of those are a surprise at all: the two great hierarchies (spatial and biological) are represented and there are properties about people, oddly though the most documented property connected with creative works is

P161, which ranks in at #20.

Anyhow, it is not all claims, if you look at the highest level you see

/datatype	1328
/id	16647896
/type	16647896
/aliases	17824992
/sitelinks	82865847
/descriptions	112796452
/labels	120721644
/claims	772152821

Everything above the /claims is part of what I have been calling the "taxonomic core". There are quite a few reasons to treat this data specially, and I'd guess this solved a chicken-vs-egg problem for WD.

In Freebase the taxonomic core is roughly half the mass of the whole thing. The claims are certainly bulked up in Wikidata because of the qualifying information.

If anything is weak about the fundamental data model it is that aliases and labels are not reified the way the claims are. This is a big deal if you want a usable lexical database. For instance, labels should be taggable as to

* being potentially offensive (i.e. insults that start with "N")

* generic name for drug/brand names for drug

* Japanese labels should be available in kanji, hiragana, and romanized form and should be identifiable that way

* in English we have it easy and you can generate "Mad Lib" style texts correctly if you can (1) know which article to use and (2) how to make both the plural and singular forms. (1) is easy to guess if you have semantic data and you can get away with being imperfect at (2).

* for German however you need to tag by grammatical gender and the choice of the article is a function of said gender and the relationship between the concept and the predicate as well as the verb tense

* similar things exist for most of the other languages...

* various organizations have defined viewpoints on terminology; for instance firefighters want you to say 'flammable' because people might get the morphology wrong on 'inflammable'; in the army you could be sexually harassed if you call your "Rifle" your "Gun"

Paul Houle
(607) 539 6254 paul.houle on Skype ontology2@gmail.com

http://legalentityidentifier.info/lei/lookup