We are planning to deploy URLs as data values rather soon (i.e. September
9, if all goes well).
There was a discussion on wikidata-l mailing list:
<http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg02664.html>
The current implementation for URLs uses a string data value. There was
also a IRI data value developed (for this use case), but in a previous
(internal) discussion it was decided to use string value instead.
The above thread included a few strong arguments by Markus for using the
IRI data value. If we want to do this, we need to decide that very quickly,
and change it accordingly.
Let's see if we can make the decision here on this list. We need to make
the decision by Monday latest, better earlier.
Here are my current thoughts (check also the above mentioned thread if you
did not have already). Currently I have a preference to using the string
value, just to point out my current bias, but I want wider input.
* I do not see the advantage of representing '
http://www.ietf.org/rfc/rfc1738.txt' as a structured data value of the form
{ protocol : 'http', hierarchicalpart : 'www.ietf.org/rfc/rfc1738.txt',
query : '', fragment : '' }.
* If we use string value, a number of necessary features come for free,
like the diffing, displaying it in the diffs, etc. Sure, there is the
argument that we can use the getString method for these, but then what is
the use case that we actually serve by using the structured data?
* I understood the advantages of being able to *identify* whether the value
of a snak is a string or a URL, but that seems to be the same advantages as
for knowing whether the value of a snak is a Commons media file name or a
string. None of the the use cases though have been explaining why using the
above data structure is advantageous over a simple string value.
Please let us collect the arguments for and against using the IRI data
value *structure* here (not for being able to *identify* whether a string
is an IRI or a string).
Not completely independent of that, there are a few questions that need to
be answered but that are not as immediate, i.e. do not have to be decided
by next week:
* should, in the external JSON structure, for every snak the data value
type be listed (as it currently is)? I.e. should it state "string" instead
of "Commons media filename"?
* should, in the external JSON structure, for every snak the data type of
the property used be listed? This would then say URL, and this would solve
all the use cases mentioned by Markus, which rely on *identifying* this
distinction, not on the actual IRI data structure.
* should, in the internal JSON structure, something be changed?
The external JSON structure is the one used when communicating through the
API.
The internal JSON structure is the one that you get when using the dumps.
We need to have an export of the whole Wikidata knowledge base in the
external JSON format, rather sooner than later, and hopefully also in RDF.
The lack of these dumps should not influence our decision right now, imho :)
Cheers,
Denny
--
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 |
http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.