IRI-value or string-value for URLs?

List overview All Threads
Download

newer

older

UI for badges

Code quality metrics

Denny Vrandečić

29 Aug 2013 29 Aug '13

12:41 p.m.

We are planning to deploy URLs as data values rather soon (i.e. September 9, if all goes well).

There was a discussion on wikidata-l mailing list: http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg02664.html

The current implementation for URLs uses a string data value. There was also a IRI data value developed (for this use case), but in a previous (internal) discussion it was decided to use string value instead.

The above thread included a few strong arguments by Markus for using the IRI data value. If we want to do this, we need to decide that very quickly, and change it accordingly.

Let's see if we can make the decision here on this list. We need to make the decision by Monday latest, better earlier.

Here are my current thoughts (check also the above mentioned thread if you did not have already). Currently I have a preference to using the string value, just to point out my current bias, but I want wider input.

* I do not see the advantage of representing ' http://www.ietf.org/rfc/rfc1738.txt' as a structured data value of the form { protocol : 'http', hierarchicalpart : 'www.ietf.org/rfc/rfc1738.txt', query : '', fragment : '' }.

* If we use string value, a number of necessary features come for free, like the diffing, displaying it in the diffs, etc. Sure, there is the argument that we can use the getString method for these, but then what is the use case that we actually serve by using the structured data?

* I understood the advantages of being able to *identify* whether the value of a snak is a string or a URL, but that seems to be the same advantages as for knowing whether the value of a snak is a Commons media file name or a string. None of the the use cases though have been explaining why using the above data structure is advantageous over a simple string value.

Please let us collect the arguments for and against using the IRI data value *structure* here (not for being able to *identify* whether a string is an IRI or a string).

Not completely independent of that, there are a few questions that need to be answered but that are not as immediate, i.e. do not have to be decided by next week:

* should, in the external JSON structure, for every snak the data value type be listed (as it currently is)? I.e. should it state "string" instead of "Commons media filename"?

* should, in the external JSON structure, for every snak the data type of the property used be listed? This would then say URL, and this would solve all the use cases mentioned by Markus, which rely on *identifying* this distinction, not on the actual IRI data structure.

* should, in the internal JSON structure, something be changed?

The external JSON structure is the one used when communicating through the API. The internal JSON structure is the one that you get when using the dumps.

We need to have an export of the whole Wikidata knowledge base in the external JSON format, rather sooner than later, and hopefully also in RDF. The lack of these dumps should not influence our decision right now, imho :)

Cheers, Denny

-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Attachments:

attachment.htm (text/html — 4.2 KB)

Show replies by date

Daniel Kinzler

29 Aug 29 Aug

12:52 p.m.

My two cents:

...

Please let us collect the arguments for and against using the IRI data value *structure* here (not for being able to *identify* whether a string is an IRI or a string).

I see no advantage in using the IriValue internally, as we have no use case for accessing parts of the URL individually.

I see some slight disadvantages (a little extra code to be written for plain text display and validation of IriValues).

...

should, in the external JSON structure, for every snak the data value

type be listed (as it currently is)? I.e. should it state "string" instead of "Commons media filename"?

I think it's useless and even misleading to have that info in the external format. It should never have been there.

However, now that it is there, it may be prudent to keep it. The biggest problem with it is that people mistake/misuse it for identifying the type of the snak value - which is *not* what this represents.

...

should, in the external JSON structure, for every snak the data type of

the property used be listed? This would then say URL, and this would solve all the use cases mentioned by Markus, which rely on *identifying* this distinction, not on the actual IRI data structure.

Yes, I think that would be very helpful for various reasons.

I would even go so far as to say that Snaks should also have this information internally (in php and in the internal JSON format).

...

We need to have an export of the whole Wikidata knowledge base in the external JSON format, rather sooner than later, and hopefully also in RDF. The lack of these dumps should not influence our decision right now, imho :)

Oh yes. It's really bad that people are relying on the internal format found in the XML dumps.

-- daniel

Markus Krötzsch

8:17 p.m.

Dear all,

the main discussion Denny proposes is not the one that we have had on the lists so far. Denny said that the use of the IRI datavalue type would require us to use a specific serialisation format that shows, e.g., the protocol as a separate string. This is a detail of the internal structure of the IRI datatype that we had not talked about yet.

Just to get this out of the way, let me explain briefly why IRIs are considered to consist of multiple strings in some data models (esp. in SMW). The main reason for representing IRIs as several strings (protocol, ...) internally is to aid validation, since these strings allow different characters (also, protocol is case-insensitive while the rest is case sensitive). This is why the SMW dataitem object for URIs takes multiple strings in its constructor.

However, this does not mean that you have to store the value as a compound object that contains many strings. In fact, this strikes me as a rather cumbersome approach that would make it harder to use the data. In SMW we store URIs as one string. Splitting this string into parts (under the assumption that it was a well-formed URL to start with) is quite easy, if this is needed (SMW does this). Conclusion: the use of a datatype for IRIs is in no way tied to the use of an impractical serialisation; reference implementations exist.

So back to my original concern. The point of my email was to insist that URIs need to be treated differently from strings in many important applications, and that it therefore makes sense to keep the knowledge about this difference in the data model. This only requires us to write "iri" instead of "string" as the datavalue type in the serialisation. That's all I was arguing for. This should also address Denny's one point not related to internal data structures (diffing could use the same code as for strings).

The other discussion items that Denny brought up might be interesting at some point, but I would rather focus on the immediate questions for now. Especially if we need to make a decision by Monday, we should narrow the discussion down as much as possible. In particular, introducing the property datatype as an additional information into the external JSON format would be a much more complex change, and at the same time would not solve the problem (which was related to processing the JSON dumps).

I agree with Daniel that it would be better if the so-called "internal" format were really internal, but this is not the reality of Wikidata today. Even if we intend to replace the current dumps by new dumps that use "external" formats, we should make sure that our internal format is at least as specific as the basic external formats. In other words: the internal format may contain auxiliary "internal" information and maybe "unofficial" values (like "bad", though this was not intended); but it should also contain all the information that the most basic external formats require. I strongly feel that the internal serialisation is (a representation of) the de facto data model, whatever we may write elsewhere. Code is more powerful than words. Making strings into IRIs there will make strings into IRIs everywhere. I don't think this would be a good design for a data model today.

Cheers,

Markus

P.S. I also do not agree that the "IRI vs. string" question is equally relevant or equally clear as the "commons media vs. string vs. IRI" question. Commons media is an application-level datatype specific to Wikimedia, while IRI and string are fundamental types in formats like XML, RDF and OWL. Most programming languages have special handing for IRIs, comparable to special handling for times, even if neither is a fundamental machine-level type. The question of Commons media is clearly much less important and should not be intertwined here.

On 29/08/13 16:41, Denny Vrandečić wrote:

...

We are planning to deploy URLs as data values rather soon (i.e. September 9, if all goes well).

There was a discussion on wikidata-l mailing list: http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg02664.html

The current implementation for URLs uses a string data value. There was also a IRI data value developed (for this use case), but in a previous (internal) discussion it was decided to use string value instead.

The above thread included a few strong arguments by Markus for using the IRI data value. If we want to do this, we need to decide that very quickly, and change it accordingly.

Let's see if we can make the decision here on this list. We need to make the decision by Monday latest, better earlier.

Here are my current thoughts (check also the above mentioned thread if you did not have already). Currently I have a preference to using the string value, just to point out my current bias, but I want wider input.

I do not see the advantage of representing

'http://www.ietf.org/rfc/rfc1738.txt' as a structured data value of the form { protocol : 'http', hierarchicalpart : 'www.ietf.org/rfc/rfc1738.txt http://www.ietf.org/rfc/rfc1738.txt', query : '', fragment : '' }.

If we use string value, a number of necessary features come for free,

like the diffing, displaying it in the diffs, etc. Sure, there is the argument that we can use the getString method for these, but then what is the use case that we actually serve by using the structured data?

I understood the advantages of being able to *identify* whether the

value of a snak is a string or a URL, but that seems to be the same advantages as for knowing whether the value of a snak is a Commons media file name or a string. None of the the use cases though have been explaining why using the above data structure is advantageous over a simple string value.

Please let us collect the arguments for and against using the IRI data value *structure* here (not for being able to *identify* whether a string is an IRI or a string).

Not completely independent of that, there are a few questions that need to be answered but that are not as immediate, i.e. do not have to be decided by next week:

should, in the external JSON structure, for every snak the data value

type be listed (as it currently is)? I.e. should it state "string" instead of "Commons media filename"?

should, in the external JSON structure, for every snak the data type

of the property used be listed? This would then say URL, and this would solve all the use cases mentioned by Markus, which rely on *identifying* this distinction, not on the actual IRI data structure.

should, in the internal JSON structure, something be changed?

The external JSON structure is the one used when communicating through the API. The internal JSON structure is the one that you get when using the dumps.

We need to have an export of the whole Wikidata knowledge base in the external JSON format, rather sooner than later, and hopefully also in RDF. The lack of these dumps should not influence our decision right now, imho :)

Cheers, Denny

-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Jeroen De Dauw

30 Aug 30 Aug

10:39 a.m.

Hey Markus,

Thanks for the writeup. This clarified some things, at least for me.

However, this does not mean that you have to store the value as a compound

...

object that contains many strings. In fact, this strikes me as a rather cumbersome approach that would make it harder to use the data. In SMW we store URIs as one string. Splitting this string into parts (under the assumption that it was a well-formed URL to start with) is quite easy, if this is needed (SMW does this). Conclusion: the use of a datatype for IRIs is in no way tied to the use of an impractical serialisation; reference implementations exist.

Agreed. The IriValue implementation is based on the SMW one, and retains this capability. Using serialize and unserialize will cause concaternation of the parts into one string, and then split them back up to a bunch before they are passed to the constructor.

We are currently not using this though. Instead we are using the two last methods here: https://github.com/wikimedia/mediawiki-extensions-DataValues/blob/eea0d0e194...

So the solution to the problem at hand seems to be either to change these two methods to do the same as serialize and unserialize, or to simply not use these methods in our serialization process. The former approach is the most local and easy to implement, and given the urgency of this, the one I suggest going with.

(Somewhat different topic, mainly directed at the WD team itself:) It is however an indication that having these two methods in the DataValue implementations is not the best idea to begin with. This has been clear for some time, though in order to fix this, we effectively need to go with the second approach and implement proper serialization infrastructure for DataValues. That'd also fix a number of other problems and awkwardness the current approach is causing.

Cheers

-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --

Denny Vrandečić

12:21 p.m.

Just following up on some discussion I had with DanielK and Jeroen today on this, and summarizing it for the mailing list.

I still fail to see what the advantage would be to use the IRI datavalue - especially when it is basically stripped down to be a string datavalue, as Jeroen suggests in the last mail here.

I do see an advantage of stating the property datatype in a snak in the external JSON representation, and am trying to understand what prevents us from doing so. If we would do so, we would enable all use cases that were mentioned, or am I missing something?

I also recognize the importance of having soon an official JSON dump, the lack of which currently forces people to rely on the internal representation in the available dumps. As said, the format of the internal dumps is expected to keep changing. I will bump this up on the priority list accordingly.

(Re Commons media file being different: Commons media file is, in the end, also just a URI represented by a string, I do not see why it is so different to URIs).

2013/8/30 Jeroen De Dauw jeroendedauw@gmail.com

...

Hey Markus,

Thanks for the writeup. This clarified some things, at least for me.

However, this does not mean that you have to store the value as a compound

...
object that contains many strings. In fact, this strikes me as a rather cumbersome approach that would make it harder to use the data. In SMW we store URIs as one string. Splitting this string into parts (under the assumption that it was a well-formed URL to start with) is quite easy, if this is needed (SMW does this). Conclusion: the use of a datatype for IRIs is in no way tied to the use of an impractical serialisation; reference implementations exist.

Agreed. The IriValue implementation is based on the SMW one, and retains this capability. Using serialize and unserialize will cause concaternation of the parts into one string, and then split them back up to a bunch before they are passed to the constructor.

We are currently not using this though. Instead we are using the two last methods here: https://github.com/wikimedia/mediawiki-extensions-DataValues/blob/eea0d0e194...

So the solution to the problem at hand seems to be either to change these two methods to do the same as serialize and unserialize, or to simply not use these methods in our serialization process. The former approach is the most local and easy to implement, and given the urgency of this, the one I suggest going with.

(Somewhat different topic, mainly directed at the WD team itself:) It is however an indication that having these two methods in the DataValue implementations is not the best idea to begin with. This has been clear for some time, though in order to fix this, we effectively need to go with the second approach and implement proper serialization infrastructure for DataValues. That'd also fix a number of other problems and awkwardness the current approach is causing.

Cheers

-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --

Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Daniel Kinzler

1 Sep 1 Sep

6:52 p.m.

Am 30.08.2013 17:21, schrieb Denny Vrandečić:

...

I do see an advantage of stating the property datatype in a snak in the external JSON representation, and am trying to understand what prevents us from doing so.

Not much, the SnakSerializer would need access to the PropertyDataTypeLookup service, injected via the SerializerFactory. SnakSerializer already has: // TODO: we might want to include the data type of the property here as well

-- daniel

Jeroen De Dauw

7:57 p.m.

Hey,

(Denny: you can consider this a reply to your question from last Friday)

SnakSerializer already has:

...

// TODO: we might want to include the data type of the property here as well

This was written nearly as year ago, some things have changed since then. In particular, it is now clear people want this, while there was no real interest in the question back then. Of at least equal importance is our change in defining what a DataType is and how these should be used. It started of as the enabler of having an SMWDataValue like interface, and now has been reduced to being a set of validation rules on top of a (Wikibase) DataValue. While in the former case there are many reasons not to have any logical dependency from DataModel on DataType, there are no such concerns with the later (and current) case. Given what we want a DataType to be (something that still needs to be reflected in its actual PHP definition) I think it makes more sense to have it inside DataModel then in its own component as it is now. It can then be linked to (logically or physically) from PropertySnak, or newly introduced containers, as we desire.

Putting DT ids in DVs remains problematic, for the same reasons as before. Luckily there are better alternatives then this approach.

Thought needs to be put into how to best move DT into DataModel, and how to best make use of it there. Before any concrete action is taken, we might want to have a close look at ValueValidators as well. This component has some open design issues, and will be introduced as a new dependency of DM if we move DTs there. Both of these concerns deserve their own discussions and are quite distinct from the current one (which is more about "what?" than "how?").

Cheers

-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --

Daniel Kinzler

2 Sep 2 Sep

4:20 a.m.

Am 02.09.2013 00:57, schrieb Jeroen De Dauw:

...

Putting DT ids in DVs remains problematic, for the same reasons as before. Luckily there are better alternatives then this approach.

Yes, I agree. The DT belongs to the Snak, not the DV. There are three models where we may want this:

* the canonical JSON (I say yes!) * the PHP object (I say yes!) * the internal JSON (probably yes, but let's think about it)

...

Thought needs to be put into how to best move DT into DataModel, and how to best make use of it there. Before any concrete action is taken, we might want to have a close look at ValueValidators as well. This component has some open design issues, and will be introduced as a new dependency of DM if we move DTs there. Both of these concerns deserve their own discussions and are quite distinct from the current one (which is more about "what?" than "how?").

Note that we currently use the ValueValidator interface, but not the base class and none (?) of the default implementations. Wikibase/lib has its own validators. This may make it easier to find a good way to avoid dependencies.

-- daniel

Denny Vrandečić

6:56 a.m.

OK, based on the discussion so far, we will add the data type to the snak in the external export, and keep the string data value for the URL data type. That should satisfy all use cases that have been brought up.

2013/9/2 Daniel Kinzler daniel.kinzler@wikimedia.de

...

Am 02.09.2013 00:57, schrieb Jeroen De Dauw:

Putting DT ids in DVs remains problematic, for the same reasons as before.

...
Luckily there are better alternatives then this approach.

Yes, I agree. The DT belongs to the Snak, not the DV. There are three models where we may want this:

the canonical JSON (I say yes!)

the PHP object (I say yes!)

the internal JSON (probably yes, but let's think about it)

Thought needs to be put into how to best move DT into DataModel, and how

...
to best make use of it there. Before any concrete action is taken, we might want to have a close look at ValueValidators as well. This component has some open design issues, and will be introduced as a new dependency of DM if we move DTs there. Both of these concerns deserve their own discussions and are quite distinct from the current one (which is more about "what?" than "how?").

Note that we currently use the ValueValidator interface, but not the base class and none (?) of the default implementations. Wikibase/lib has its own validators. This may make it easier to find a good way to avoid dependencies.

-- daniel

______________________________**_________________ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.**org Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-tech https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Lydia Pintscher

3 Sep 3 Sep

6:50 a.m.

On Mon, Sep 2, 2013 at 11:56 AM, Denny Vrandečić denny.vrandecic@wikimedia.de wrote:

...

OK, based on the discussion so far, we will add the data type to the snak in the external export, and keep the string data value for the URL data type. That should satisfy all use cases that have been brought up.

Just so I know what's coming: Is this doable for the deployment in a week?

Cheers Lydia

-- Lydia Pintscher - http://about.me/lydia.pintscher Community Communications for Technical Projects Wikimedia Deutschland e.V. Obentrautstr. 72 10963 Berlin www.wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Daniel Kinzler

7:17 a.m.

Am 03.09.2013 11:50, schrieb Lydia Pintscher:

...

On Mon, Sep 2, 2013 at 11:56 AM, Denny Vrandečić denny.vrandecic@wikimedia.de wrote:

...
OK, based on the discussion so far, we will add the data type to the snak in the external export, and keep the string data value for the URL data type. That should satisfy all use cases that have been brought up.

Just so I know what's coming: Is this doable for the deployment in a week?

If we push back something else, yes. But I think this is mainly useful in JSON dumps - which we don't have yet. Not hard to do, but won't happen in a week.

-- daniel

Jeroen De Dauw

12:14 p.m.

Hey,

...

Just so I know what's coming: Is this doable for the deployment in a week?

We would need to do some evil stuff to have this done so soon. I'd rather have it done properly. The DataType id has not been part of the snak serialization since we provided such serializations. Keeping it like that for another week or two is not going to make a huge difference as far as I can see.

Cheers

-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --

Lydia Pintscher

12:22 p.m.

On Tue, Sep 3, 2013 at 5:14 PM, Jeroen De Dauw jeroendedauw@gmail.com wrote:

...

We would need to do some evil stuff to have this done so soon. I'd rather have it done properly. The DataType id has not been part of the snak serialization since we provided such serializations. Keeping it like that for another week or two is not going to make a huge difference as far as I can see.

It will make a large difference.

Cheers Lydia

Jeroen De Dauw

1:26 p.m.

Hey,

...

We would need to do some evil stuff to have this done so soon. I'd rather

...
have it done properly. The DataType id has not been part of the snak serialization since we provided such serializations. Keeping it like that for another week or two is not going to make a huge difference as far as

I

...
can see.

It will make a large difference.

I suspect you are talking about the URL data type itself? I'm talking about adding a DataType id field to the serialization of snaks. The former can be done quickly without doing the later, and thus not introduce much evilness.

Cheers

-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --

4133

Age (days ago)

4138

Last active (days ago)

wikidata-tech@lists.wikimedia.org

13 comments

5 participants

tags (0)

participants (5)

Daniel Kinzler
Denny Vrandečić
Jeroen De Dauw
Lydia Pintscher
Markus Krötzsch