Hello All ,
i was looking at WikiData dump , specifically this one : wikidatawiki-20130818-pages-meta-hist-incr.xml.bz2
then i came to this statement
1. { 2. "m":[ 3. "value", 4. 158, 5. "string", 6. "Great Seal of the United States (obverse).svg" 7. ], 8. "q":[ 9. 10. ], 11. "g":"q30$D680D948-C2C1-493F-88AC-E4E2FB3764D2", 12. "rank":1, 13. "refs":[ 14. 15. ] 16. },
the property P158 which is the seal flag image . http://www.wikidata.org/wiki/Property:P158 and it's DataType should be "Commons media file" ? not "string" ? i'm not sure if it's always the same way and i don't get it , or the statement data is not consistent with property datatypes?
another Question : should i usually rely on the datatypes written in the json dumps or should i build and index of wkidata properties and their datatypes to avoid such situation ?
thanks Regards ------------------------------------------------- Hady El-Sahar Research Assistant Center of Informatics Sciences | Nile Universityhttp://nileuniversity.edu.eg/
Den 21-08-2013 19:45, Hady elsahar skrev:
Hello All ,
i was looking at WikiData dump , specifically this one : wikidatawiki-20130818-pages-meta-hist-incr.xml.bz2
then i came to this statement
the property P158 which is the seal flag image . http://www.wikidata.org/wiki/Property:P158 and it's DataType should be "Commons media file" ? not "string" ? i'm not sure if it's always the same way and i don't get it , or the statement data is not consistent with property datatypes?
The values for properties of type commonsMedia are always stored as strings. That's why there is no row for commonsMedia in my table at http://www.wikidata.org/wiki/User:Byrial/Statement_statistics#Properties_aft...
another Question : should i usually rely on the datatypes written in the json dumps or should i build and index of wkidata properties and their datatypes to avoid such situation ?
You need both the datatype of the property (to distinguish ordinary string values from commonsMedia strings) and the datatype for each stored value (to know if it is an ordinary value or novalue or somevalue).
Regards, - Byrial
Hello Byrial ,
by taking a look to low level datatypes list : http://www.wikidata.org/wiki/Special:ListDatatypes does representing common media files in strings considered to be an in consistency that will be fixed in the next releases ? or just it will stay as this forever ?
another question:
if i wanted to get all the datatypes and map them to the parallel xsd types , how can i get an updated list with all wikiData properties and their datatypes not the low level.
thanks Regards
On Wed, Aug 21, 2013 at 8:12 PM, Byrial Jensen byrial@vip.cybercity.dkwrote:
Den 21-08-2013 19:45, Hady elsahar skrev:
Hello All ,
i was looking at WikiData dump , specifically this one : wikidatawiki-20130818-pages-**meta-hist-incr.xml.bz2
then i came to this statement
the property P158 which is the seal flag image . http://www.wikidata.org/wiki/**Property:P158http://www.wikidata.org/wiki/Property:P158 and it's DataType should be "Commons media file" ? not "string" ? i'm not sure if it's always the same way and i don't get it , or the statement data is not consistent with property datatypes?
The values for properties of type commonsMedia are always stored as strings. That's why there is no row for commonsMedia in my table at http://www.wikidata.org/wiki/**User:Byrial/Statement_** statistics#Properties_after_**their_value_typehttp://www.wikidata.org/wiki/User:Byrial/Statement_statistics#Properties_after_their_value_type
another Question : should i usually rely on the datatypes written in the
json dumps or should i build and index of wkidata properties and their datatypes to avoid such situation ?
You need both the datatype of the property (to distinguish ordinary string values from commonsMedia strings) and the datatype for each stored value (to know if it is an ordinary value or novalue or somevalue).
Regards,
- Byrial
______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-lhttps://lists.wikimedia.org/mailman/listinfo/wikidata-l
Den 21-08-2013 21:38, Hady elsahar skrev:
Hello Byrial ,
by taking a look to low level datatypes list : http://www.wikidata.org/wiki/Special:ListDatatypes does representing common media files in strings considered to be an in consistency that will be fixed in the next releases ? or just it will stay as this forever ?
I am not a Wikidata developer, so I cannot say. But I see no problem needing fixing, and would therefore not expect this to be changed.
another question:
if i wanted to get all the datatypes and map them to the parallel xsd types , how can i get an updated list with all wikiData properties and their datatypes not the low level.
You started this thread by telling that you are looking at a Wikidata database dump. You can find all properties in the dump. You can probably also query the server to get the properties.
Regards, - Byrial
Hey,
and it's DataType should be "Commons media file" ? not "string" ?
The DataType is not specified in the JSON segment you pasted. It is not stored in entity pages. The "string" indicates the type of DataValue, which is a more low level concept. We have a limited set of these DataValue types, and a potentially much bigger set of DataTypes build on top of that. For instance the DataTypes "integer", "positive integer", "percentage" and "probability" would presumably all use the "number" DataValue.
another Question : should i usually rely on the datatypes written in the
json dumps or should i build and index of wkidata properties and their datatypes to avoid such situation ?
If you need the actual DataType, you will indeed need to build an index with the properties.
Cheers
-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --
Hello Jeroen ,
can i get from your words that this page : http://www.wikidata.org/wiki/Special:ListDatatypes is not up to date ?if so how can i get all the datatypes in Wikidata ?
but what i can get is list of datatypes available , these has to be the way of representation of data (like the lower level ) not the semantic datatype of the thing .
so it's either an item or string or common media file or time or geolocation
string could be anything ( so time could be a string) , but there's a defined lower level representation of common media files . so is it wrong to represent it as string ,
thanks Regards
On Wed, Aug 21, 2013 at 8:14 PM, Jeroen De Dauw jeroendedauw@gmail.comwrote:
Hey,
and it's DataType should be "Commons media file" ? not "string" ?
The DataType is not specified in the JSON segment you pasted. It is not stored in entity pages. The "string" indicates the type of DataValue, which is a more low level concept. We have a limited set of these DataValue types, and a potentially much bigger set of DataTypes build on top of that. For instance the DataTypes "integer", "positive integer", "percentage" and "probability" would presumably all use the "number" DataValue.
another Question : should i usually rely on the datatypes written in the
json dumps or should i build and index of wkidata properties and their datatypes to avoid such situation ?
If you need the actual DataType, you will indeed need to build an index with the properties.
Cheers
-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Den 21-08-2013 21:09, Hady elsahar skrev:
Hello Jeroen ,
can i get from your words that this page : http://www.wikidata.org/wiki/Special:ListDatatypes is not up to date ?if so how can i get all the datatypes in Wikidata ?
Pages in the virtual Special namespace are generated by MediaWiki on demand, and are therefore always (in principle - there can be caching in some cases) up to date.
string could be anything ( so time could be a string) , but there's a defined lower level representation of common media files . so is it wrong to represent it as string ,
Time cannot be a string, as there are several components in a time value (time, timezone, precision, calendar model, before and after precisions).
I see nothing wrong in storing commonsMedia values as string values. You will know from the property's datatype that the string is a CommonsMedia string.
Regards, - Byrial
Hi all,
I think one source of confusion here are the overlapping names of property datatypes and datavalue types. Basically, the mapping is as follows right now:
[Format: property type => datavalue type occurring in current dumps]
'wikibase-item' => 'wikibase-entityid' 'string' => 'string' 'time' => 'time' 'globe-coordinate' => 'globecoordinate' 'commonsMedia' => 'string'
The point is that "string" on the left is not the same as "string" on the right. (Also note the lack of a consistent naming scheme for these ids :-/ ...) In most cases, however, you can infer the property type from the datavalue type, but not in all. Unfortunately, you do not generally find the property type in a dump before you find its first use.
The wda script's RDF export has code for dealing with this. It remembers all types that it finds (from P entities in the dump), it infers types from values where possible, and it uses the API to find out the type of a property if all else fails (typically, if you find a string value but don't know yet if the property is of type string or commonsMedia). In addition, the script has a hardcoded list of known types that can be extended (there are not so many properties and their types never change, hence one can do this quite easily). You can find all the code at [1].
Cheers,
Markus
[1] https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py (esp. see __getPropertyType() and __fetchPropertyType())
On 21/08/13 21:00, Byrial Jensen wrote:
Den 21-08-2013 21:09, Hady elsahar skrev:
Hello Jeroen ,
can i get from your words that this page : http://www.wikidata.org/wiki/Special:ListDatatypes is not up to date ?if so how can i get all the datatypes in Wikidata ?
Pages in the virtual Special namespace are generated by MediaWiki on demand, and are therefore always (in principle - there can be caching in some cases) up to date.
string could be anything ( so time could be a string) , but there's a defined lower level representation of common media files . so is it wrong to represent it as string ,
Time cannot be a string, as there are several components in a time value (time, timezone, precision, calendar model, before and after precisions).
I see nothing wrong in storing commonsMedia values as string values. You will know from the property's datatype that the string is a CommonsMedia string.
Regards,
- Byrial
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Hello Markus ,
thanks for pointing to wda code it's very useful , i guess by looking on the Wikidata glossary property data types and data value types are the same thing : http://www.wikidata.org/wiki/Wikidata:Glossary#Datatypes
this may be shallow a little bit , but what i saw is that (correct me if i'm mistaken) : - they don't use the same names when you search for the datatype of the property item and the value type of the item that uses this property.
another problem is that they decided to represent commonsMedia in strings , for some purpose i don't know that's why i didn't get it and thought it's some sort of consistency
In most cases, however, you can infer the property type from the datavalue
type, but not in all. Unfortunately, you do not generally find the property type in a dump before you find its first use.
could you point me why depending on such mappings didn't always work , for just wikipedia common files ?
'wikibase-item' => 'wikibase-entityid' 'string' => 'string' 'time' => 'time' 'globe-coordinate' => 'globecoordinate' 'commonsMedia' => 'string'
thanks Regards
On Thu, Aug 22, 2013 at 11:33 AM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Hi all,
I think one source of confusion here are the overlapping names of property datatypes and datavalue types. Basically, the mapping is as follows right now:
[Format: property type => datavalue type occurring in current dumps]
'wikibase-item' => 'wikibase-entityid' 'string' => 'string' 'time' => 'time' 'globe-coordinate' => 'globecoordinate' 'commonsMedia' => 'string'
The point is that "string" on the left is not the same as "string" on the right. (Also note the lack of a consistent naming scheme for these ids :-/ ...) In most cases, however, you can infer the property type from the datavalue type, but not in all. Unfortunately, you do not generally find the property type in a dump before you find its first use.
The wda script's RDF export has code for dealing with this. It remembers all types that it finds (from P entities in the dump), it infers types from values where possible, and it uses the API to find out the type of a property if all else fails (typically, if you find a string value but don't know yet if the property is of type string or commonsMedia). In addition, the script has a hardcoded list of known types that can be extended (there are not so many properties and their types never change, hence one can do this quite easily). You can find all the code at [1].
Cheers,
Markus
[1] https://github.com/mkroetzsch/**wda/blob/master/includes/** epTurtleFileWriter.pyhttps://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py(esp. see __getPropertyType() and __fetchPropertyType())
On 21/08/13 21:00, Byrial Jensen wrote:
Den 21-08-2013 21:09, Hady elsahar skrev:
Hello Jeroen ,
can i get from your words that this page : http://www.wikidata.org/wiki/**Special:ListDatatypeshttp://www.wikidata.org/wiki/Special:ListDatatypes is not up to date ?if so how can i get all the datatypes in Wikidata ?
Pages in the virtual Special namespace are generated by MediaWiki on demand, and are therefore always (in principle - there can be caching in some cases) up to date.
string could be anything ( so time could be a string) , but there's a
defined lower level representation of common media files . so is it wrong to represent it as string ,
Time cannot be a string, as there are several components in a time value (time, timezone, precision, calendar model, before and after precisions).
I see nothing wrong in storing commonsMedia values as string values. You will know from the property's datatype that the string is a CommonsMedia string.
Regards,
- Byrial
______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-lhttps://lists.wikimedia.org/mailman/listinfo/wikidata-l
______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-lhttps://lists.wikimedia.org/mailman/listinfo/wikidata-l
Dear Hady,
On 22/08/13 14:44, Hady elsahar wrote:
Hello Markus ,
thanks for pointing to wda code it's very useful , i guess by looking on the Wikidata glossary property data types and data value types are the same thing : http://www.wikidata.org/wiki/Wikidata:Glossary#Datatypes
this may be shallow a little bit , but what i saw is that (correct me if i'm mistaken) :
- they don't use the same names when you search for the datatype of the
property item and the value type of the item that uses this property.
another problem is that they decided to represent commonsMedia in strings , for some purpose i don't know that's why i didn't get it and thought it's some sort of consistency
In most cases, however, you can infer the property type from the datavalue type, but not in all. Unfortunately, you do not generally find the property type in a dump before you find its first use.could you point me why depending on such mappings didn't always work , for just wikipedia common files ?
'wikibase-item' => 'wikibase-entityid' 'string' => 'string' 'time' => 'time' 'globe-coordinate' => 'globecoordinate' 'commonsMedia' => 'string'
The key is to understand that property types and value types are *not* the same. They match in many cases, but not in all. In the future, there might be more property types that use the same value type. Property types are what the user sees; they define every detail of user interaction and UI. Value types are part of the underlying data model; they define what the content of the data is. For most data processing, you should not need to know the property type.
The situation with commonsMedia is a bit bad because it should be a URL rather than a string. What I do in wda is effectively a type conversion from string to URI in this particular case. Maybe we can fix this somehow in the future when URIs are supported as a value datatype.
Markus
On Thu, Aug 22, 2013 at 11:33 AM, Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:
Hi all, I think one source of confusion here are the overlapping names of property datatypes and datavalue types. Basically, the mapping is as follows right now: [Format: property type => datavalue type occurring in current dumps] 'wikibase-item' => 'wikibase-entityid' 'string' => 'string' 'time' => 'time' 'globe-coordinate' => 'globecoordinate' 'commonsMedia' => 'string' The point is that "string" on the left is not the same as "string" on the right. (Also note the lack of a consistent naming scheme for these ids :-/ ...) In most cases, however, you can infer the property type from the datavalue type, but not in all. Unfortunately, you do not generally find the property type in a dump before you find its first use. The wda script's RDF export has code for dealing with this. It remembers all types that it finds (from P entities in the dump), it infers types from values where possible, and it uses the API to find out the type of a property if all else fails (typically, if you find a string value but don't know yet if the property is of type string or commonsMedia). In addition, the script has a hardcoded list of known types that can be extended (there are not so many properties and their types never change, hence one can do this quite easily). You can find all the code at [1]. Cheers, Markus [1] https://github.com/mkroetzsch/__wda/blob/master/includes/__epTurtleFileWriter.py <https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py> (esp. see __getPropertyType() and __fetchPropertyType()) On 21/08/13 21:00, Byrial Jensen wrote: Den 21-08-2013 21 <tel:21-08-2013%2021>:09, Hady elsahar skrev: Hello Jeroen , can i get from your words that this page : http://www.wikidata.org/wiki/__Special:ListDatatypes <http://www.wikidata.org/wiki/Special:ListDatatypes> is not up to date ?if so how can i get all the datatypes in Wikidata ? Pages in the virtual Special namespace are generated by MediaWiki on demand, and are therefore always (in principle - there can be caching in some cases) up to date. string could be anything ( so time could be a string) , but there's a defined lower level representation of common media files . so is it wrong to represent it as string , Time cannot be a string, as there are several components in a time value (time, timezone, precision, calendar model, before and after precisions). I see nothing wrong in storing commonsMedia values as string values. You will know from the property's datatype that the string is a CommonsMedia string. Regards, - Byrial _________________________________________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org> https://lists.wikimedia.org/__mailman/listinfo/wikidata-l <https://lists.wikimedia.org/mailman/listinfo/wikidata-l> _________________________________________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org> https://lists.wikimedia.org/__mailman/listinfo/wikidata-l <https://lists.wikimedia.org/mailman/listinfo/wikidata-l>--
Hady El-Sahar Research Assistant Center of Informatics Sciences | Nile University http://nileuniversity.edu.eg/
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Hey,
The situation with commonsMedia is a bit bad because it should be a URL
rather than a string. What I do in wda is effectively a type conversion from string to URI in this particular case. Maybe we can fix this somehow in the future when URIs are supported as a value datatype.
Ok, this makes me somewhat concerned. We do have a IriValue DV [0], which we've had for nearly a year. It is indeed not used for commonsMedia, not sure why. What concerns me is that we are now introducing a "url" data type, which will also just use the string DV, rather then the IRI DV. I'm not very happy with this, though it is what most of the team wants. If there is a problem with this approach, it should be outlined _soon_, since this is something not far from deployment if I understand it correctly.
[0] https://github.com/wikimedia/mediawiki-extensions-DataValues/blob/master/Dat...
Cheers
-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --
Hi,
On 24/08/13 19:45, Jeroen De Dauw wrote:
Hey,
The situation with commonsMedia is a bit bad because it should be a URL rather than a string. What I do in wda is effectively a type conversion from string to URI in this particular case. Maybe we can fix this somehow in the future when URIs are supported as a value datatype.Ok, this makes me somewhat concerned. We do have a IriValue DV [0], which we've had for nearly a year. It is indeed not used for commonsMedia, not sure why. What concerns me is that we are now introducing a "url" data type, which will also just use the string DV, rather then the IRI DV. I'm not very happy with this, though it is what most of the team wants. If there is a problem with this approach, it should be outlined _soon_, since this is something not far from deployment if I understand it correctly.
If we have an IRI DV, considering that URLs are special IRIs, it seems clear that IRI would be the best way of storing them. For any Web-based format (esp. OWL and RDF), there is a big difference between "some arbitrary string" and an IRI. Similarly, many tools that use data will naturally treat URLs in a different way than other strings when displaying them to users. If this difference is not captured in the data, then applications have to look it up, use some kind of hard-coded handling for certain properties, or apply heuristics to decide which strings are supposed to be URLs. Using IRI DVs would solve the problem in a cleaner way with less effort.
Of course, you could just use "string" for all types of datavalue without loosing datavalue information. However, this would make the Wikidata data model inadequate for some important uses. The exported RDF will fix this in a sense, so people using this will get the important information from there. However, RDF has other problems that make it difficult to use as a primary data dump format (esp. heavy normalisation), and it is not available from Wikibase yet. Therefore, I think it would be problematic if the Wikidata data model is simplified to such an extent that practically important information is no longer easy to get for external users.
I appreciate that there might be split opinions about this among the developers (who see the immediate technical consequences, esp. for their piece of work). However, this decision has important long-term consequences beyond current engineering aspects. Luckily, Wikidata has a recognized expert in Web data technologies as its technical director ;-) -- the team should trust his judgement here.
Cheers,
Markus
Am 25.08.2013 19:19, schrieb Markus Krötzsch:
If we have an IRI DV, considering that URLs are special IRIs, it seems clear that IRI would be the best way of storing them.
The best way of storing them really depends on the storage platform. It may be a string or something else.
I think the real issue here is that we are exposing something that is really an internal detail (the data value type) instead of the high level information we actually should be exposing, namely property type.
I think splitting the two was a mistake, and I think exposing the DV type while making the property type all but inaccessible makes things a lot worse.
In my opinion, data should be self-descriptive, so the *semantic* type of the property should be included along with the value. People expect this, and assume that this is what the DV type is. But it's not, and should not be used or abused for this purpose.
Ideally, it should not matter at all to any 3rd party if use use a string or IRI DV internally. The (semantic) property type would be URL, and that's all that matters.
I'm quite unhappy about the current situation; we are beginning to see the backlash of the decision not to include the property type inline. If we don't do anything about this now, I fear the confusion is going to get worse.
-- daniel
Hi Daniel,
if I understand you correctly, you are in favour of equating datavalue types and property types. This would solve indeed the problems at hand.
The reason why both kinds of types are distinct in SMW and also in Wikidata is that property types are naturally more extensible than datavalue types. CommonsMedia is a good example of this: all you need is a custom UI and you can handle "new" data without changing the underlying data model. This makes it easy for contributors to add new types without far-reaching ramifications in the backend (think of numbers, which could be decimal, natural, positive, range-restricted, etc. but would still be treated as a "number" in the backend).
Using fewer datavalue types also improves interoperability. E.g., you want to compare two numbers, even if one is a natural number and another one is a decimal.
There is no simple rule for deciding how many datavalue types there should be. The general guideline is to decide on datavalue types based on use cases. I am arguing for diversifying IRIs and strings since there are many contexts and applications where this is a crucial difference. Conversely, I don't know of any application where it makes sense to keep the two similar (this would have to be something where we compare strings and IRIs on a data level, e.g., if you were looking for all websites with URLs that are alphabetically greater than the postcode of a city in England :-p).
In general, however, it will be good to keep the set of basic datavalue types small, while allowing the set of property types to grow. The set of base datavalue types that we use is based on the experience in SMW as well as on existing formats like XSD (which also has many derived types but only a few base types).
As for the possible confusion, I think some naming discipline would clarify this. In SMW, there is a stronger difference between both kinds of types, and a fixed schema for property type ids that makes it easy to recognise them.
In any case, using string for IRIs does not seem to solve any problem. It does not simplify the type system in general and it does not help with the use cases that I mentioned. What I do not agree with are your arguments about all of this being "internal". We would not have this discussion if it were. The data model of Wikidata is the primary conceptual model that specifies what Wikidata stores. You might still be right that some of the implementation is internal, but the arguments we both exchange are not really on the implementation level ;-).
Best wishes
Markus, offline soon for travelling
On 26/08/13 10:35, Daniel Kinzler wrote:
Am 25.08.2013 19:19, schrieb Markus Krötzsch:
If we have an IRI DV, considering that URLs are special IRIs, it seems clear that IRI would be the best way of storing them.
The best way of storing them really depends on the storage platform. It may be a string or something else.
I think the real issue here is that we are exposing something that is really an internal detail (the data value type) instead of the high level information we actually should be exposing, namely property type.
I think splitting the two was a mistake, and I think exposing the DV type while making the property type all but inaccessible makes things a lot worse.
In my opinion, data should be self-descriptive, so the *semantic* type of the property should be included along with the value. People expect this, and assume that this is what the DV type is. But it's not, and should not be used or abused for this purpose.
Ideally, it should not matter at all to any 3rd party if use use a string or IRI DV internally. The (semantic) property type would be URL, and that's all that matters.
I'm quite unhappy about the current situation; we are beginning to see the backlash of the decision not to include the property type inline. If we don't do anything about this now, I fear the confusion is going to get worse.
-- daniel
Am 26.08.2013 12:41, schrieb Markus Krötzsch:
Hi Daniel,
if I understand you correctly, you are in favour of equating datavalue types and property types. This would solve indeed the problems at hand.
The reason why both kinds of types are distinct in SMW and also in Wikidata is that property types are naturally more extensible than datavalue types. CommonsMedia is a good example of this: all you need is a custom UI and you can handle "new" data without changing the underlying data model. This makes it easy for contributors to add new types without far-reaching ramifications in the backend (think of numbers, which could be decimal, natural, positive, range-restricted, etc. but would still be treated as a "number" in the backend).
This could be solved using polymorphism: CommonsMedia, IRI, etc could simply derive from StringValue. Similarly, Percentage could derive from NumberValue, etc.
This is largely academic though, I don't see a good way to transition from the current system to what I have in mind.
Using fewer datavalue types also improves interoperability. E.g., you want to compare two numbers, even if one is a natural number and another one is a decimal.
Indeed. Which is why I'm reluctant to add more, like the IRI type.
There is no simple rule for deciding how many datavalue types there should be. The general guideline is to decide on datavalue types based on use cases. I am arguing for diversifying IRIs and strings since there are many contexts and applications where this is a crucial difference. Conversely, I don't know of any application where it makes sense to keep the two similar (this would have to be something where we compare strings and IRIs on a data level, e.g., if you were looking for all websites with URLs that are alphabetically greater than the postcode of a city in England :-p).
Currently, my primary concern are validators and simple renderers to be used e.g. in diffs. For validation against a max length as well as regular expressions, it would be useful to be able to treat URLs as strings. The same is true for basic rendering in diffs.
As for the possible confusion, I think some naming discipline would clarify this. In SMW, there is a stronger difference between both kinds of types, and a fixed schema for property type ids that makes it easy to recognise them.
I try to use "data value type" vs. "property type", but whenever "data type" is used, it's unclear what is meant.
In any case, using string for IRIs does not seem to solve any problem. It does not simplify the type system in general and it does not help with the use cases that I mentioned.
Well, for my use cases mentioned above, URLs should be strings :)
What I do not agree with are your arguments about all of this being "internal". We would not have this discussion if it were. The data model of Wikidata is the primary conceptual model that specifies what Wikidata stores. You might still be right that some of the implementation is internal, but the arguments we both exchange are not really on the implementation level ;-).
I do not see why it is useful for a property value to expose two types. That's the situation we currently have, and it's confusing. For a canonical representation, there should be only one type, namely the one that is needed to be able to fully interpret the value given. Whether a URL can be treated as a string or not depends on the use case and should be determined be the respective code. It seems a bad idea to me to try and provide an arbitrary set of base types with an arbitrary mapping to concrete/semantic types. If anything, a type hierarchy would make sense.
-- daniel
Hey,
This could be solved using polymorphism: CommonsMedia, IRI, etc could
simply derive from StringValue. Similarly, Percentage could derive from NumberValue, etc.
That violates Liskov Substitution [0] ;) If we go with this approach, we also should have a SquareValue that derives from RectangleValue, to have the basics covered :D
[0] https://en.wikipedia.org/wiki/Liskov_substitution_principle
Cheers
-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --
Den 22-08-2013 11:33, Markus Krötzsch skrev:
Hi all,
I think one source of confusion here are the overlapping names of property datatypes and datavalue types. Basically, the mapping is as follows right now:
[Format: property type => datavalue type occurring in current dumps]
'wikibase-item' => 'wikibase-entityid' 'string' => 'string' 'time' => 'time' 'globe-coordinate' => 'globecoordinate' 'commonsMedia' => 'string'
Note that in the 2013-08-27 database dump you will also find:
'globe-coordinate' => 'bad'
for values which was accepted before stricter format checking was introduced in the lastest software revision, but cannot be accepted now (values without indication of globe or precision).
Regards, - Byrial
Den 29-08-2013 08:58, Byrial Jensen skrev:
Den 22-08-2013 11:33, Markus Krötzsch skrev:
Hi all,
I think one source of confusion here are the overlapping names of property datatypes and datavalue types. Basically, the mapping is as follows right now:
[Format: property type => datavalue type occurring in current dumps]
'wikibase-item' => 'wikibase-entityid' 'string' => 'string' 'time' => 'time' 'globe-coordinate' => 'globecoordinate' 'commonsMedia' => 'string'
Note that in the 2013-08-27 database dump you will also find:
'globe-coordinate' => 'bad'
for values which was accepted before stricter format checking was introduced in the lastest software revision, but cannot be accepted now (values without indication of globe or precision).
I just found that there also are cases with
'time' => 'bad'
See for example https://www.wikidata.org/w/api.php?action=wbgetclaims&entity=Q7415505&am.... It has the time given as "+00000001984-23-01T00:00:00Z"; note that the month number is 23.
Regards, - Byrial
On Thu, Aug 29, 2013 at 11:48 AM, Byrial Jensen byrial@vip.cybercity.dkwrote:
Den 29-08-2013 08:58, Byrial Jensen skrev:
Den 22-08-2013 11:33, Markus Krötzsch skrev:
Hi all,
I think one source of confusion here are the overlapping names of property datatypes and datavalue types. Basically, the mapping is as follows right now:
[Format: property type => datavalue type occurring in current dumps]
'wikibase-item' => 'wikibase-entityid' 'string' => 'string' 'time' => 'time' 'globe-coordinate' => 'globecoordinate' 'commonsMedia' => 'string'
Note that in the 2013-08-27 database dump you will also find:
'globe-coordinate' => 'bad'
for values which was accepted before stricter format checking was introduced in the lastest software revision, but cannot be accepted now (values without indication of globe or precision).
I just found that there also are cases with
'time' => 'bad'
See for example https://www.wikidata.org/w/**api.php?action=wbgetclaims&** entity=Q7415505&format=xmlhttps://www.wikidata.org/w/api.php?action=wbgetclaims&entity=Q7415505&format=xml. It has the time given as "+00000001984-23-01T00:00:00Z"**; note that the month number is 23.
A list of "bad" time values would also be very helpful. The "bad" value type is used to flag values that can't be "parsed" into one of the valid types. These were likely added when wikidata has less strict validation of api input so are still in the database.
A bot would be able to fix them or they can be removed/re-added.
Cheers, Katie
Regards,
- Byrial
______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-lhttps://lists.wikimedia.org/mailman/listinfo/wikidata-l
Den 29-08-2013 11:58, Katie Filbert skrev:
On Thu, Aug 29, 2013 at 11:48 AM, Byrial Jensen <byrial@vip.cybercity.dk mailto:byrial@vip.cybercity.dk> wrote:
I just found that there also are cases with 'time' => 'bad' See for example https://www.wikidata.org/w/__api.php?action=wbgetclaims&__entity=Q7415505&format=xml <https://www.wikidata.org/w/api.php?action=wbgetclaims&entity=Q7415505&format=xml>. It has the time given as "+00000001984-23-01T00:00:00Z"__; note that the month number is 23.A list of "bad" time values would also be very helpful. The "bad" value type is used to flag values that can't be "parsed" into one of the valid types. These were likely added when wikidata has less strict validation of api input so are still in the database.
I will make a list, but I had to modify my database dump parser program once more after also finding the "bad" time values and restart reading the database dump, so it take some extra time to be ready.
A bot would be able to fix them or they can be removed/re-added.
Cheers, Katie
Regards, - Byrial
On Thu, Aug 29, 2013 at 12:14 PM, Byrial Jensen byrial@vip.cybercity.dkwrote:
Den 29-08-2013 11:58, Katie Filbert skrev:
On Thu, Aug 29, 2013 at 11:48 AM, Byrial Jensen <byrial@vip.cybercity.dk <mailto:byrial@vip.cybercity.**dk byrial@vip.cybercity.dk>> wrote:
I just found that there also are cases with 'time' => 'bad' See for example https://www.wikidata.org/w/__**api.php?action=wbgetclaims&__**entity=Q7415505&format=xmlhttps://www.wikidata.org/w/__api.php?action=wbgetclaims&__entity=Q7415505&format=xml <https://www.wikidata.org/w/**api.php?action=wbgetclaims&** entity=Q7415505&format=xmlhttps://www.wikidata.org/w/api.php?action=wbgetclaims&entity=Q7415505&format=xml
.
It has the time given as "+00000001984-23-01T00:00:00Z"**__; notethat
the month number is 23.A list of "bad" time values would also be very helpful. The "bad" value type is used to flag values that can't be "parsed" into one of the valid types. These were likely added when wikidata has less strict validation of api input so are still in the database.
I will make a list, but I had to modify my database dump parser program once more after also finding the "bad" time values and restart reading the database dump, so it take some extra time to be ready.
No hurry.
Cheers, Katie
A bot would be able to fix them or they can be removed/re-added.
Cheers, Katie
Regards,
- Byrial
______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-lhttps://lists.wikimedia.org/mailman/listinfo/wikidata-l
I just send an Email to the wikidata-tech mailing list on this topic of using data value types and property data types, and on using the string value or IRI value for the URL datatype.
I would very much appreciate input there. The latter decision has to happen rather quickly (if we are not to move the URL datatype once more), the former decision has a bit more time but should be made soonish.
Thank you for your input so far, and sorry for changing mailing lists, but I think wikidata-tech is more appropriate.
Cheers, Denny
2013/8/29 Katie Filbert katie.filbert@wikimedia.de
On Thu, Aug 29, 2013 at 12:14 PM, Byrial Jensen byrial@vip.cybercity.dkwrote:
Den 29-08-2013 11:58, Katie Filbert skrev:
On Thu, Aug 29, 2013 at 11:48 AM, Byrial Jensen <byrial@vip.cybercity.dk <mailto:byrial@vip.cybercity.**dk byrial@vip.cybercity.dk>> wrote:
I just found that there also are cases with 'time' => 'bad' See for example https://www.wikidata.org/w/__**api.php?action=wbgetclaims&__**entity=Q7415505&format=xmlhttps://www.wikidata.org/w/__api.php?action=wbgetclaims&__entity=Q7415505&format=xml <https://www.wikidata.org/w/**api.php?action=wbgetclaims&** entity=Q7415505&format=xmlhttps://www.wikidata.org/w/api.php?action=wbgetclaims&entity=Q7415505&format=xml
.
It has the time given as "+00000001984-23-01T00:00:00Z"**__; notethat
the month number is 23.A list of "bad" time values would also be very helpful. The "bad" value type is used to flag values that can't be "parsed" into one of the valid types. These were likely added when wikidata has less strict validation of api input so are still in the database.
I will make a list, but I had to modify my database dump parser program once more after also finding the "bad" time values and restart reading the database dump, so it take some extra time to be ready.
No hurry.
Cheers, Katie
A bot would be able to fix them or they can be removed/re-added.
Cheers, Katie
Regards,
- Byrial
______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-lhttps://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- Katie Filbert Wikidata Developer
Wikimedia Germany e.V. | NEW: Obentrautstr. 72 | 10963 Berlin Phone (030) 219 158 26-0
Wikimedia Germany - Society for the Promotion of free knowledge eV Entered in the register of Amtsgericht Berlin-Charlottenburg under the number 23 855 as recognized as charitable by the Inland Revenue for corporations I Berlin, tax number 27/681/51985.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Den 29-08-2013 14:49, Katie Filbert skrev:
On Thu, Aug 29, 2013 at 12:14 PM, Byrial Jensen <byrial@vip.cybercity.dk mailto:byrial@vip.cybercity.dk> wrote:
Den 29-08-2013 11:58, Katie Filbert skrev: On Thu, Aug 29, 2013 at 11:48 AM, Byrial Jensen <byrial@vip.cybercity.dk <mailto:byrial@vip.cybercity.dk> <mailto:byrial@vip.cybercity.__dk <mailto:byrial@vip.cybercity.dk>>> wrote: I just found that there also are cases with 'time' => 'bad' See for example https://www.wikidata.org/w/____api.php?action=wbgetclaims&____entity=Q7415505&format=xml <https://www.wikidata.org/w/__api.php?action=wbgetclaims&__entity=Q7415505&format=xml> <https://www.wikidata.org/w/__api.php?action=wbgetclaims&__entity=Q7415505&format=xml <https://www.wikidata.org/w/api.php?action=wbgetclaims&entity=Q7415505&format=xml>>. It has the time given as "+00000001984-23-01T00:00:00Z"____; note that the month number is 23. A list of "bad" time values would also be very helpful. The "bad" value type is used to flag values that can't be "parsed" into one of the valid types. These were likely added when wikidata has less strict validation of api input so are still in the database. I will make a list, but I had to modify my database dump parser program once more after also finding the "bad" time values and restart reading the database dump, so it take some extra time to be ready.No hurry.
It turned out that Q7415505 mentioned above is the only case in the 2013-08-27 database dump where a time value is given with "bad" value data type.
So "bad" is used 1 time for a time value, and 79 times for coordinate values (list at https://www.wikidata.org/wiki/User:Byrial/Globes).
Regards, - Byrial
On Thu, Aug 29, 2013 at 7:14 PM, Byrial Jensen byrial@vip.cybercity.dkwrote:
[snip]
It turned out that Q7415505 mentioned above is the only case in the 2013-08-27 database dump where a time value is given with "bad" value data type.
So "bad" is used 1 time for a time value, and 79 times for coordinate values (list at https://www.wikidata.org/wiki/**User:Byrial/Globeshttps://www.wikidata.org/wiki/User:Byrial/Globes ).
Thanks Byrial.
Cheers, Katie
Regards,
- Byrial
______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-lhttps://lists.wikimedia.org/mailman/listinfo/wikidata-l
Am 29.08.2013 08:58, schrieb Byrial Jensen:
Note that in the 2013-08-27 database dump you will also find:
'globe-coordinate' => 'bad'
for values which was accepted before stricter format checking was introduced in the lastest software revision, but cannot be accepted now (values without indication of globe or precision).
This was caused by a bug: these values aare indeed "bad" by some internal definition, but that should not be recorded inthe database and dumps. The bug has been fixed [1], but sadly, we now have some revisions where the "bad" type appears.
-- daniel