I fed the Wikidata dump into a JSON profiling tool; in the first stage it identified unique paths one could follow through the JSON data structures. The table below shows a count of the literal data items that can be found behind a path -- we're not counting how many claims of P31[] have been made, we are also counting all of the literals inside the node, so the more information that is qualifying the claim the bigger this number gets.
/claims/P31[]144350720 instance of/claims/P625[]35948377 geographic coordinates/claims/P17[]35165614 country: sovereign state/claims/P646[]31095359 freebase identifier/claims/P569[]30881885 date of birth/claims/P21[]30466476 sex or gender/claims/P105[] 29234406 taxon rank/claims/P225[]27808194 taxon name/claims/P131[]27806448 located in administrative div /claims/P171[]25159278 parent taxon
None of those are a surprise at all: the two great hierarchies (spatial and biological) are represented and there are properties about people, oddly though the most documented property connected with creative works is P161, which ranks in at #20.
Anyhow, it is not all claims, if you look at the highest level you see
/datatype1328/id16647896/type16647896/aliases17824992/sitelinks82865847 /descriptions112796452/labels120721644/claims772152821
Everything above the /claims is part of what I have been calling the "taxonomic core". There are quite a few reasons to treat this data specially, and I'd guess this solved a chicken-vs-egg problem for WD.
In Freebase the taxonomic core is roughly half the mass of the whole thing. The claims are certainly bulked up in Wikidata because of the qualifying information.
If anything is weak about the fundamental data model it is that aliases and labels are not reified the way the claims are. This is a big deal if you want a usable lexical database. For instance, labels should be taggable as to
* being potentially offensive (i.e. insults that start with "N") * generic name for drug/brand names for drug * Japanese labels should be available in kanji, hiragana, and romanized form and should be identifiable that way * in English we have it easy and you can generate "Mad Lib" style texts correctly if you can (1) know which article to use and (2) how to make both the plural and singular forms. (1) is easy to guess if you have semantic data and you can get away with being imperfect at (2). * for German however you need to tag by grammatical gender and the choice of the article is a function of said gender and the relationship between the concept and the predicate as well as the verb tense * similar things exist for most of the other languages... * various organizations have defined viewpoints on terminology; for instance firefighters want you to say 'flammable' because people might get the morphology wrong on 'inflammable'; in the army you could be sexually harassed if you call your "Rifle" your "Gun"
Nice data generation. We also have https://www.wikidata.org/wiki/Wikidata:Database_reports/List_of_properties/T...
Paul Houle, 11/01/2015 18:11:
None of those are a surprise at all: the two great hierarchies (spatial and biological) are represented and there are properties about people, oddly though the most documented property connected with creative works is P161, which ranks in at #20.
Those probably come from Wikipedia articles on films etc. Having 90k such articles is a lot. You can find work in this area (and contribute) at https://www.wikidata.org/wiki/Category:Cultural_WikiProjects
Am 11.01.2015 17:11, schrieb Paul Houle:
If anything is weak about the fundamental data model it is that aliases and labels are not reified the way the claims are. This is a big deal if you want a usable lexical database.
The plan is to support even more fine grained modeling of grammatical properties, translations, etc - but not on "data items" (which model concepts) but using "lexical entities", which model words, phrases or expressions. This is just a rough draft yet, though, see https://www.wikidata.org/wiki/Wikidata:Wiktionary. It seems prudent to have lexical entities as first class citizens, but separate from concepts. Basically, lexical WD would relate to Wiktionary like the conceptual WD relates to Wikipedia.
Until then, you can always go ahead and create properties like "official name" (P1448) and qualify it in whatever way. Having labels, descriptions and aliases separate does not keep you from modeling them again as statements. One remaining obstacle is the fact that WD does not have a data type for multilingual text yet, but that will be added soon(ish).
-- daniel