Hello everybody,
Since I am working on the conversion from the dump files to the wdtk data model, I will have to take apart the "refs" section of the JSON representing the stored items.
Now a "refs"-section most likely looks like this: (Tried to format it for readability)
"refs": [ [ [ "value",248, "wikibase-entityid",{"entity-type":"item","numeric-id":15241312} ], [ "value",577, "time",{"time":"+00000002013-10-28T00:00:00Z","timezone":0,"before":0,"after":0,"precision":11,"calendarmodel":"http://www.wikidata.org/entity/Q1985727"} ] ] ]
So I figured out the following: The outer array groups all references. The second level array groups information about one reference (so if we had multiple references, we would also have multiple second-level arrays) and the inner arrays each group one specific information about one specific reference (as determined by the second level array they are nested in). Am I correct so far?
The integer following the "value"-string denotes what the following information is about. So if I read that the value is 577, I know that there must be a specification of time, don't I?
Is there any specific reason, why this is done in array form and not in a JSON object? (Since the "value"-key is always there one could know from its value, what other keys must be available.) If yes, why mention the type of information (e.g. "time") again? Am I overlooking something?
Thanks so far, -- Fredo Erxleben
Hi Fredo,
On 20/02/14 19:59, Fredo Erxleben wrote:
Hello everybody,
Since I am working on the conversion from the dump files to the wdtk data model, I will have to take apart the "refs" section of the JSON representing the stored items.
Now a "refs"-section most likely looks like this: (Tried to format it for readability)
"refs": [ [ [ "value",248, "wikibase-entityid",{"entity-type":"item","numeric-id":15241312} ], [ "value",577, "time",{"time":"+00000002013-10-28T00:00:00Z","timezone":0,"before":0,"after":0,"precision":11,"calendarmodel":"http://www.wikidata.org/entity/Q1985727"}
] ]
]
So I figured out the following: The outer array groups all references. The second level array groups information about one reference (so if we had multiple references, we would also have multiple second-level arrays) and the inner arrays each group one specific information about one specific reference (as determined by the second level array they are nested in). Am I correct so far?
Yes, where the "one specific information" in inner arrays is the encoding of one snak.
The integer following the "value"-string denotes what the following information is about. So if I read that the value is 577, I know that there must be a specification of time, don't I?
The values in the snak arrays are as follows in the case of ValueSnaks: [0]: snak type ("value" for ValueSnaks; other possible values are "novalue" and "somevalue") [1]: snak property ("577" is for "P577") [2]: primitive type of datavalue (these correspond to the ...Value classes) [3]: encoding of the primitive datavalue
If you know the datatype of P577, then you could indeed infer that the primitive value used here must be "time". However, the datatype is not given in this place in the dump, so it would be impossible to interpret the dump of one entity without knowing external context information. This is why the type of the primitive value is explicitly specified.
Is there any specific reason, why this is done in array form and not in a JSON object? (Since the "value"-key is always there one could know from its value, what other keys must be available.)
"value" is not a key but an entry that denotes a snak type.
If yes, why mention the type of information (e.g. "time") again? Am I overlooking something?
Answered above. Another important point is that each primitive value can decide on its encoding locally, without depending on the encoding of other value types. Therefore, a tool that reads this data cannot "guess" the primitive type by looking at the encoding only. It seems obvious that the map
{"time":"+00000002013-10-28T00:00:00Z","timezone":0,"before":0,"after":0,"precision":11,"calendarmodel":"http://www.wikidata.org/entity/Q1985727"}
encodes a time value, but it should not be assumed that one can always do this. There could even be values of different types that have the same encoding.
Cheers,
Markus
wikidata-tech@lists.wikimedia.org