Am 26.08.2013 12:41, schrieb Markus Krötzsch:
Hi Daniel,
if I understand you correctly, you are in favour of equating datavalue types and property types. This would solve indeed the problems at hand.
The reason why both kinds of types are distinct in SMW and also in Wikidata is that property types are naturally more extensible than datavalue types. CommonsMedia is a good example of this: all you need is a custom UI and you can handle "new" data without changing the underlying data model. This makes it easy for contributors to add new types without far-reaching ramifications in the backend (think of numbers, which could be decimal, natural, positive, range-restricted, etc. but would still be treated as a "number" in the backend).
This could be solved using polymorphism: CommonsMedia, IRI, etc could simply derive from StringValue. Similarly, Percentage could derive from NumberValue, etc.
This is largely academic though, I don't see a good way to transition from the current system to what I have in mind.
Using fewer datavalue types also improves interoperability. E.g., you want to compare two numbers, even if one is a natural number and another one is a decimal.
Indeed. Which is why I'm reluctant to add more, like the IRI type.
There is no simple rule for deciding how many datavalue types there should be. The general guideline is to decide on datavalue types based on use cases. I am arguing for diversifying IRIs and strings since there are many contexts and applications where this is a crucial difference. Conversely, I don't know of any application where it makes sense to keep the two similar (this would have to be something where we compare strings and IRIs on a data level, e.g., if you were looking for all websites with URLs that are alphabetically greater than the postcode of a city in England :-p).
Currently, my primary concern are validators and simple renderers to be used e.g. in diffs. For validation against a max length as well as regular expressions, it would be useful to be able to treat URLs as strings. The same is true for basic rendering in diffs.
As for the possible confusion, I think some naming discipline would clarify this. In SMW, there is a stronger difference between both kinds of types, and a fixed schema for property type ids that makes it easy to recognise them.
I try to use "data value type" vs. "property type", but whenever "data type" is used, it's unclear what is meant.
In any case, using string for IRIs does not seem to solve any problem. It does not simplify the type system in general and it does not help with the use cases that I mentioned.
Well, for my use cases mentioned above, URLs should be strings :)
What I do not agree with are your arguments about all of this being "internal". We would not have this discussion if it were. The data model of Wikidata is the primary conceptual model that specifies what Wikidata stores. You might still be right that some of the implementation is internal, but the arguments we both exchange are not really on the implementation level ;-).
I do not see why it is useful for a property value to expose two types. That's the situation we currently have, and it's confusing. For a canonical representation, there should be only one type, namely the one that is needed to be able to fully interpret the value given. Whether a URL can be treated as a string or not depends on the use case and should be determined be the respective code. It seems a bad idea to me to try and provide an arbitrary set of base types with an arbitrary mapping to concrete/semantic types. If anything, a type hierarchy would make sense.
-- daniel