Dear all,
the main discussion Denny proposes is not the one that we have had on the lists so far. Denny said that the use of the IRI datavalue type would require us to use a specific serialisation format that shows, e.g., the protocol as a separate string. This is a detail of the internal structure of the IRI datatype that we had not talked about yet.
Just to get this out of the way, let me explain briefly why IRIs are considered to consist of multiple strings in some data models (esp. in SMW). The main reason for representing IRIs as several strings (protocol, ...) internally is to aid validation, since these strings allow different characters (also, protocol is case-insensitive while the rest is case sensitive). This is why the SMW dataitem object for URIs takes multiple strings in its constructor.
However, this does not mean that you have to store the value as a compound object that contains many strings. In fact, this strikes me as a rather cumbersome approach that would make it harder to use the data. In SMW we store URIs as one string. Splitting this string into parts (under the assumption that it was a well-formed URL to start with) is quite easy, if this is needed (SMW does this). Conclusion: the use of a datatype for IRIs is in no way tied to the use of an impractical serialisation; reference implementations exist.
So back to my original concern. The point of my email was to insist that URIs need to be treated differently from strings in many important applications, and that it therefore makes sense to keep the knowledge about this difference in the data model. This only requires us to write "iri" instead of "string" as the datavalue type in the serialisation. That's all I was arguing for. This should also address Denny's one point not related to internal data structures (diffing could use the same code as for strings).
The other discussion items that Denny brought up might be interesting at some point, but I would rather focus on the immediate questions for now. Especially if we need to make a decision by Monday, we should narrow the discussion down as much as possible. In particular, introducing the property datatype as an additional information into the external JSON format would be a much more complex change, and at the same time would not solve the problem (which was related to processing the JSON dumps).
I agree with Daniel that it would be better if the so-called "internal" format were really internal, but this is not the reality of Wikidata today. Even if we intend to replace the current dumps by new dumps that use "external" formats, we should make sure that our internal format is at least as specific as the basic external formats. In other words: the internal format may contain auxiliary "internal" information and maybe "unofficial" values (like "bad", though this was not intended); but it should also contain all the information that the most basic external formats require. I strongly feel that the internal serialisation is (a representation of) the de facto data model, whatever we may write elsewhere. Code is more powerful than words. Making strings into IRIs there will make strings into IRIs everywhere. I don't think this would be a good design for a data model today.
Cheers,
Markus
P.S. I also do not agree that the "IRI vs. string" question is equally relevant or equally clear as the "commons media vs. string vs. IRI" question. Commons media is an application-level datatype specific to Wikimedia, while IRI and string are fundamental types in formats like XML, RDF and OWL. Most programming languages have special handing for IRIs, comparable to special handling for times, even if neither is a fundamental machine-level type. The question of Commons media is clearly much less important and should not be intertwined here.
On 29/08/13 16:41, Denny Vrandečić wrote:
We are planning to deploy URLs as data values rather soon (i.e. September 9, if all goes well).
There was a discussion on wikidata-l mailing list: http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg02664.html
The current implementation for URLs uses a string data value. There was also a IRI data value developed (for this use case), but in a previous (internal) discussion it was decided to use string value instead.
The above thread included a few strong arguments by Markus for using the IRI data value. If we want to do this, we need to decide that very quickly, and change it accordingly.
Let's see if we can make the decision here on this list. We need to make the decision by Monday latest, better earlier.
Here are my current thoughts (check also the above mentioned thread if you did not have already). Currently I have a preference to using the string value, just to point out my current bias, but I want wider input.
- I do not see the advantage of representing
'http://www.ietf.org/rfc/rfc1738.txt' as a structured data value of the form { protocol : 'http', hierarchicalpart : 'www.ietf.org/rfc/rfc1738.txt http://www.ietf.org/rfc/rfc1738.txt', query : '', fragment : '' }.
- If we use string value, a number of necessary features come for free,
like the diffing, displaying it in the diffs, etc. Sure, there is the argument that we can use the getString method for these, but then what is the use case that we actually serve by using the structured data?
- I understood the advantages of being able to *identify* whether the
value of a snak is a string or a URL, but that seems to be the same advantages as for knowing whether the value of a snak is a Commons media file name or a string. None of the the use cases though have been explaining why using the above data structure is advantageous over a simple string value.
Please let us collect the arguments for and against using the IRI data value *structure* here (not for being able to *identify* whether a string is an IRI or a string).
Not completely independent of that, there are a few questions that need to be answered but that are not as immediate, i.e. do not have to be decided by next week:
- should, in the external JSON structure, for every snak the data value
type be listed (as it currently is)? I.e. should it state "string" instead of "Commons media filename"?
- should, in the external JSON structure, for every snak the data type
of the property used be listed? This would then say URL, and this would solve all the use cases mentioned by Markus, which rely on *identifying* this distinction, not on the actual IRI data structure.
- should, in the internal JSON structure, something be changed?
The external JSON structure is the one used when communicating through the API. The internal JSON structure is the one that you get when using the dumps.
We need to have an export of the whole Wikidata knowledge base in the external JSON format, rather sooner than later, and hopefully also in RDF. The lack of these dumps should not influence our decision right now, imho :)
Cheers, Denny
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech