We are planning to deploy URLs as data values rather soon (i.e. September 9, if all goes well).

There was a discussion on wikidata-l mailing list:

The current implementation for URLs uses a string data value. There was also a IRI data value developed (for this use case), but in a previous (internal) discussion it was decided to use string value instead.

The above thread included a few strong arguments by Markus for using the IRI data value. If we want to do this, we need to decide that very quickly, and change it accordingly.

Let's see if we can make the decision here on this list. We need to make the decision by Monday latest, better earlier.

Here are my current thoughts (check also the above mentioned thread if you did not have already). Currently I have a preference to using the string value, just to point out my current bias, but I want wider input.

* I do not see the advantage of representing 'http://www.ietf.org/rfc/rfc1738.txt' as a structured data value of the form { protocol : 'http', hierarchicalpart : 'www.ietf.org/rfc/rfc1738.txt', query : '', fragment : '' }. 

* If we use string value, a number of necessary features come for free, like the diffing, displaying it in the diffs, etc. Sure, there is the argument that we can use the getString method for these, but then what is the use case that we actually serve by using the structured data?

* I understood the advantages of being able to *identify* whether the value of a snak is a string or a URL, but that seems to be the same advantages as for knowing whether the value of a snak is a Commons media file name or a string. None of the the use cases though have been explaining why using the above data structure is advantageous over a simple string value.

Please let us collect the arguments for and against using the IRI data value *structure* here (not for being able to *identify* whether a string is an IRI or a string). 

Not completely independent of that, there are a few questions that need to be answered but that are not as immediate, i.e. do not have to be decided by next week:

* should, in the external JSON structure, for every snak the data value type be listed (as it currently is)? I.e. should it state "string" instead of "Commons media filename"?

* should, in the external JSON structure, for every snak the data type of the property used be listed? This would then say URL, and this would solve all the use cases mentioned by Markus, which rely on *identifying* this distinction, not on the actual IRI data structure.

* should, in the internal JSON structure, something be changed?

The external JSON structure is the one used when communicating through the API.
The internal JSON structure is the one that you get when using the dumps.

We need to have an export of the whole Wikidata knowledge base in the external JSON format, rather sooner than later, and hopefully also in RDF. The lack of these dumps should not influence our decision right now, imho :)


Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.