As Phase 2 is progressing, we have to decide on how to represent data values.
I have created a draft for representing numbers and units, points in time, and locations, which can be found here:
https://meta.wikimedia.org/wiki/Wikidata/Development/Representing_values
including a first suggestion on the functionality of the UI which we would be aiming at eventually.
The draft is unfortunately far from perfect, and I would very welcome comments and discussion.
We probably will implement them in the following order: geolocation, date and time, numbers.
Cheers, Denny
Thanks for the input so far. Here are a few explicit questions that I have:
* Time: right now the data model assumes that the precision is given on the level "decade / year / month" etc., which means you can enter a date of birth like 1435 or May 1918. But is this sufficient? We cannot enter a value like 2nd-5th century AD (we could enter 1st millenium AD, which would be a loss of precision).
* Geo: the model assumes latitude, longitude and altitude, and defines altitude as over mean sea level (simplified). Is altitude at all useful? Should it be removed from Geolocation and be moved instead to a property called "height" or "altitude" which is dealt with outside of the geolocation?
* Units are currently planned to be defined on the property page (as it is done in SMW). So you say that the height is measured in Meter which corresponds to 3.28084 feet, etc. Wikidata would allow to defined linear translations within the wiki and can thus be done by the community. This makes everything a bit more complicated -- one could also imagine to define all dimensions and units in PHP and then have the properties reference the dimensions. Since there are only a few hundred units and dimensions, this could be viable.
(Non-linear transformations -- most notoriously temperature -- will get its own implementation anyway)
Opinions?
2012/12/17 Denny Vrandečić denny.vrandecic@wikimedia.de
As Phase 2 is progressing, we have to decide on how to represent data values.
I have created a draft for representing numbers and units, points in time, and locations, which can be found here:
https://meta.wikimedia.org/wiki/Wikidata/Development/Representing_values
including a first suggestion on the functionality of the UI which we would be aiming at eventually.
The draft is unfortunately far from perfect, and I would very welcome comments and discussion.
We probably will implement them in the following order: geolocation, date and time, numbers.
Cheers, Denny
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Hello,
On 2012-12-18 15:29, Denny Vrandečić wrote:
Thanks for the input so far. Here are a few explicit questions that I have:
- Time: right now the data model assumes that the precision is given on
the level "decade / year / month" etc., which means you can enter a date of birth like 1435 or May 1918. But is this sufficient? We cannot enter a value like 2nd-5th century AD (we could enter 1st millenium AD, which would be a loss of precision).
Sometimes we do not have exact dates for an events, because they are in future, like the opening of the airport Berlin/Brandenburg ;-)
So in this case it would be great to have property values like N/A for "not available" for some parts. A date could then look like May, N/A 1435.
Expressions like 2nd to 56h century could easily be expressed by using tolerances. This would then look like 300 AD ± 200 years, which will is (max+min)/2 ± |(max-(max+min)/2)|.
- Geo: the model assumes latitude, longitude and altitude, and defines
altitude as over mean sea level (simplified). Is altitude at all useful? Should it be removed from Geolocation and be moved instead to a property called "height" or "altitude" which is dealt with outside of the geolocation?
Altitudes are for sure very important. Maybe not in Germany but for Austria I know several altitudes. It's very interested for passes, mountains, cities, villages, wells of brooks, etc.
How will you create a list of the highest mountains without an altitude of their highest peaks (Phase 3)? Or even one for the lowest passes for crossing the alps.
Look at the articles for living creatures. Often there is mentioned from which to which altitude they can be found. There we also have it twice. But here we do not have geolocations.
IMHO it would be make sense to have something hybrid. The datatype for geolocation should accept something like a NAN-value for optional altitudes. But it should also be possible to use altitudes without longitude and latitude.
So we can store something like
http://www.grassyknolltv.com/2012/tour-de-france/resources/maps/ETAPE%2007%2... http://www.grassyknolltv.com/2012/tour-de-france/resources/profiles/profile-...
which is both about the 7th stage of the Tour de France 2012, just in different views. Providing the possibility to store seperate altitudes allows us to store properties like "grows at altitudes from 1800 to 2200 meters above sea level".
Separate altitudes is possibly very similar to lengths of snakes which is also measured in meters and has also to value just in another dimension.
- Units are currently planned to be defined on the property page (as it
is done in SMW). So you say that the height is measured in Meter which corresponds to 3.28084 feet, etc. Wikidata would allow to defined linear translations within the wiki and can thus be done by the community. This makes everything a bit more complicated -- one could also imagine to define all dimensions and units in PHP and then have the properties reference the dimensions. Since there are only a few hundred units and dimensions, this could be viable.
(Non-linear transformations -- most notoriously temperature -- will get its own implementation anyway)
Opinions?
IMHO it would make sense to use the [[International System of Units]] for internal storage. It is not consequently used in other realms, not even in the German spoken countries (PS vs. kW for cars). Maybe it would be possible to use small scripts (such as WP-templates) to transcalc values, which can easily be developed by the community.
Cheers
Marco
2012/12/17 Denny Vrandečić <denny.vrandecic@wikimedia.de mailto:denny.vrandecic@wikimedia.de>
As Phase 2 is progressing, we have to decide on how to represent data values. I have created a draft for representing numbers and units, points in time, and locations, which can be found here: <https://meta.wikimedia.org/wiki/Wikidata/Development/Representing_values> including a first suggestion on the functionality of the UI which we would be aiming at eventually. The draft is unfortunately far from perfect, and I would very welcome comments and discussion. We probably will implement them in the following order: geolocation, date and time, numbers. Cheers, Denny -- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 <tel:%2B49-30-219%20158%2026-0> | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985 <tel:27%2F681%2F51985>.
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Thank you for your comments, Marco.
2012/12/18 Marco Fleckinger marco.fleckinger@wikipedia.at
On 2012-12-18 15:29, Denny Vrandečić wrote:
- Time: right now the data model assumes that the precision is given on
the level "decade / year / month" etc., which means you can enter a date of birth like 1435 or May 1918. But is this sufficient? We cannot enter a value like 2nd-5th century AD (we could enter 1st millenium AD, which would be a loss of precision).
Sometimes we do not have exact dates for an events, because they are in
future, like the opening of the airport Berlin/Brandenburg ;-)
So in this case it would be great to have property values like N/A for "not
available" for some parts. A date could then look like May, N/A 1435.
The current model allows to set a precision to a specific level. In that case you would set it to "Month" and say May 1435.
Expressions like 2nd to 56h century could easily be expressed by using tolerances. This would then look like 300 AD ± 200 years, which will is (max+min)/2 ± |(max-(max+min)/2)|.
This on the other hand would be a completely different system for precision (the one that we are using for numbers, actually), where have an uncertainty. The problem with uncertainty and time is that neither years nor months nor almost anything have uniform length. So we would need to save both the uncertainty and the unit of the uncertainty, which would make the model more complex. This would be a viable solution, I guess.
- Geo: the model assumes latitude, longitude and altitude, and defines
altitude as over mean sea level (simplified). Is altitude at all useful? Should it be removed from Geolocation and be moved instead to a property called "height" or "altitude" which is dealt with outside of the geolocation?
Altitudes are for sure very important. Maybe not in Germany but for
Austria I know several altitudes. It's very interested for passes, mountains, cities, villages, wells of brooks, etc.
How will you create a list of the highest mountains without an altitude of their highest peaks (Phase 3)? Or even one for the lowest passes for crossing the alps.
The question is not whether we need an altitude at all. The question is whether it should be part of the Geolocation datavalue or if it should be in a property of its own. If you want to have a list of the highest mountains, you would actually sort them by the height-property.
Look at the articles for living creatures. Often there is mentioned from which to which altitude they can be found. There we also have it twice. But here we do not have geolocations.
IMHO it would be make sense to have something hybrid. The datatype for geolocation should accept something like a NAN-value for optional altitudes. But it should also be possible to use altitudes without longitude and latitude.
Why would it make sense to have both? What is the altitude in the geolocation good for? Is there an example in Wikipedia where it is or would be used, and where a property of its own would not be better?
So we can store something like
http://www.grassyknolltv.com/**2012/tour-de-france/resources/** maps/ETAPE%2007%202012.jpghttp://www.grassyknolltv.com/2012/tour-de-france/resources/maps/ETAPE%2007%202012.jpg http://www.grassyknolltv.com/**2012/tour-de-france/resources/** profiles/profile-07.jpghttp://www.grassyknolltv.com/2012/tour-de-france/resources/profiles/profile-07.jpg
which is both about the 7th stage of the Tour de France 2012, just in different views. Providing the possibility to store seperate altitudes allows us to store properties like "grows at altitudes from 1800 to 2200 meters above sea level".
Separate altitudes is possibly very similar to lengths of snakes which is also measured in meters and has also to value just in another dimension.
Exactly.
- Units are currently planned to be defined on the property page (as it
is done in SMW). So you say that the height is measured in Meter which corresponds to 3.28084 feet, etc. Wikidata would allow to defined linear translations within the wiki and can thus be done by the community. This makes everything a bit more complicated -- one could also imagine to define all dimensions and units in PHP and then have the properties reference the dimensions. Since there are only a few hundred units and dimensions, this could be viable.
(Non-linear transformations -- most notoriously temperature -- will get its own implementation anyway)
Opinions?
IMHO it would make sense to use the [[International System of Units]]
for internal storage. It is not consequently used in other realms, not even in the German spoken countries (PS vs. kW for cars). Maybe it would be possible to use small scripts (such as WP-templates) to transcalc values, which can easily be developed by the community.
Internally we would translate it, yes, otherwise the numbers would not be comparable. But for editing we need to keep the unit of the source / of the editor, or else we will loose precision.
Cheers
Marco
2012/12/17 Denny Vrandečić <denny.vrandecic@wikimedia.de
<mailto:denny.vrandecic@**wikimedia.de denny.vrandecic@wikimedia.de>>
As Phase 2 is progressing, we have to decide on how to represent data values. I have created a draft for representing numbers and units, points in time, and locations, which can be found here: <https://meta.wikimedia.org/**wiki/Wikidata/Development/**
Representing_valueshttps://meta.wikimedia.org/wiki/Wikidata/Development/Representing_values
including a first suggestion on the functionality of the UI which we would be aiming at eventually. The draft is unfortunately far from perfect, and I would very welcome comments and discussion. We probably will implement them in the following order: geolocation, date and time, numbers. Cheers, Denny -- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 <tel:%2B49-30-219%20158%2026-**0> | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985 <tel:27%2F681%2F51985>.
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-lhttps://lists.wikimedia.org/mailman/listinfo/wikidata-l
______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-lhttps://lists.wikimedia.org/mailman/listinfo/wikidata-l
On 2012-12-18 16:52, Denny Vrandečić wrote:
Thank you for your comments, Marco.
NP
2012/12/18 Marco Fleckinger <marco.fleckinger@wikipedia.at mailto:marco.fleckinger@wikipedia.at>
On 2012-12-18 15:29, Denny Vrandečić wrote: * Time: right now the data model assumes that the precision is given on the level "decade / year / month" etc., which means you can enter a date of birth like 1435 or May 1918. But is this sufficient? We cannot enter a value like 2nd-5th century AD (we could enter 1st millenium AD, which would be a loss of precision). Sometimes we do not have exact dates for an events, because they are in future, like the opening of the airport Berlin/Brandenburg ;-) So in this case it would be great to have property values like N/A for "not available" for some parts. A date could then look like May, N/A 1435.
The current model allows to set a precision to a specific level. In that case you would set it to "Month" and say May 1435.
Expressions like 2nd to 56h century could easily be expressed by using tolerances. This would then look like 300 AD ± 200 years, which will is (max+min)/2 ± |(max-(max+min)/2)|.
This on the other hand would be a completely different system for precision (the one that we are using for numbers, actually), where have an uncertainty. The problem with uncertainty and time is that neither years nor months nor almost anything have uniform length. So we would need to save both the uncertainty and the unit of the uncertainty, which would make the model more complex. This would be a viable solution, I guess.
I'd use the same unit as tha base – its 300 years AD so the uncertainty is also measured in years.
* Geo: the model assumes latitude, longitude and altitude, and defines altitude as over mean sea level (simplified). Is altitude at all useful? Should it be removed from Geolocation and be moved instead to a property called "height" or "altitude" which is dealt with outside of the geolocation? Altitudes are for sure very important. Maybe not in Germany but for Austria I know several altitudes. It's very interested for passes, mountains, cities, villages, wells of brooks, etc. How will you create a list of the highest mountains without an altitude of their highest peaks (Phase 3)? Or even one for the lowest passes for crossing the alps.
The question is not whether we need an altitude at all. The question is whether it should be part of the Geolocation datavalue or if it should be in a property of its own. If you want to have a list of the highest mountains, you would actually sort them by the height-property.
A list of the highest mountains with their peaks within a specific area?
Look at the articles for living creatures. Often there is mentioned from which to which altitude they can be found. There we also have it twice. But here we do not have geolocations. IMHO it would be make sense to have something hybrid. The datatype for geolocation should accept something like a NAN-value for optional altitudes. But it should also be possible to use altitudes without longitude and latitude.
Why would it make sense to have both?
The answer is below.
What is the altitude in the geolocation good for?
Peaks of mountains e.g. are geolocated with an altitude. Also passes, some events, places where something scientific was found or took place.
Is there an example in Wikipedia where it is or would be used, and where a property of its own would not be better?
E.g. preventing having one geolocation and multiple altitudes. A list of point on one section of the Tour de France should contain all point where each one should contain all three components, because they belong together. The place is always mentioned with the altitude.
So we can store something like http://www.grassyknolltv.com/__2012/tour-de-france/resources/__maps/ETAPE%2007%202012.jpg <http://www.grassyknolltv.com/2012/tour-de-france/resources/maps/ETAPE%2007%202012.jpg> http://www.grassyknolltv.com/__2012/tour-de-france/resources/__profiles/profile-07.jpg <http://www.grassyknolltv.com/2012/tour-de-france/resources/profiles/profile-07.jpg> which is both about the 7th stage of the Tour de France 2012, just in different views. Providing the possibility to store seperate altitudes allows us to store properties like "grows at altitudes from 1800 to 2200 meters above sea level". Separate altitudes is possibly very similar to lengths of snakes which is also measured in meters and has also to value just in another dimension.
Exactly.
* Units are currently planned to be defined on the property page (as it is done in SMW). So you say that the height is measured in Meter which corresponds to 3.28084 feet, etc. Wikidata would allow to defined linear translations within the wiki and can thus be done by the community. This makes everything a bit more complicated -- one could also imagine to define all dimensions and units in PHP and then have the properties reference the dimensions. Since there are only a few hundred units and dimensions, this could be viable. (Non-linear transformations -- most notoriously temperature -- will get its own implementation anyway) Opinions? IMHO it would make sense to use the [[International System of Units]] for internal storage. It is not consequently used in other realms, not even in the German spoken countries (PS vs. kW for cars). Maybe it would be possible to use small scripts (such as WP-templates) to transcalc values, which can easily be developed by the community.
Internally we would translate it, yes, otherwise the numbers would not be comparable. But for editing we need to keep the unit of the source / of the editor, or else we will loose precision.
+1
Cheers
Marco
IMHO it would make sense to use the [[International System of Units]] for internal storage. It is not consequently used in other realms, not even in the German spoken countries (PS vs. kW for cars). Maybe it would be possible to use small scripts (such as WP-templates) to transcalc values, which can easily be developed by the community.
Internally we would translate it, yes, otherwise the numbers would not be comparable. But for editing we need to keep the unit of the source / of the editor, or else we will loose precision.
I propose to use a system which allow to store an SI prefix separatedly from the number and unit. Else we would have to express the proton (proton is about 1.6–1.7 fm) as 0.0000000000000016-0.0000000000000017 m and a fungus (2.8 km) as 2800 m. Especially the small values would be very error prone.
Allowing fixed order-of-magnitude prefixes would ease both error control and presentation.
(ASIDE: Regarding presentation: it is not always algorthmically eay whether to present 0.00000000000001 m as 1 * 10e-14 or a 10 fm = 10 * 10-15. In a scientific context, only the SI steps should be used, in another context the closest decimal may be appropriate.)
Gregor
On 2012-12-18 17:49, Gregor Hagedorn wrote:
IMHO it would make sense to use the [[International System of Units]] for internal storage. It is not consequently used in other realms, not even in the German spoken countries (PS vs. kW for cars). Maybe it would be possible to use small scripts (such as WP-templates) to transcalc values, which can easily be developed by the community.
Internally we would translate it, yes, otherwise the numbers would not be comparable. But for editing we need to keep the unit of the source / of the editor, or else we will loose precision.
I propose to use a system which allow to store an SI prefix separatedly from the number and unit. Else we would have to express the proton (proton is about 1.6–1.7 fm) as 0.0000000000000016-0.0000000000000017 m and a fungus (2.8 km) as 2800 m. Especially the small values would be very error prone.
Allowing fixed order-of-magnitude prefixes would ease both error control and presentation.
(ASIDE: Regarding presentation: it is not always algorthmically eay whether to present 0.00000000000001 m as 1 * 10e-14 or a 10 fm = 10 * 10-15. In a scientific context, only the SI steps should be used, in another context the closest decimal may be appropriate.)
But floating point numbers are handled by the implementation of [[IEEE floating-point standard]].
Displaying the numbers is another question. There I have to agree that it always makes sense to also store a typical used unit for that type of data.
Distances for towns are usually displayed in km the High of mountains is displayd in metres if we're talking of the metric system. So I assume that this is user or language preference also called locale-set.
Cheers
Marco
(ASIDE: Regarding presentation: it is not always algorthmically eay whether to present 0.00000000000001 m as 1 * 10e-14 or a 10 fm = 10 * 10-15. In a scientific context, only the SI steps should be used, in another context the closest decimal may be appropriate.)
But floating point numbers are handled by the implementation of [[IEEE floating-point standard]].
Displaying the numbers is another question. There I have to agree that it always makes sense to also store a typical used unit for that type of data.
I agree. What I propose is that the user interface supports entering and proofreading "10.6 nm" as "10.6" plus "n" (= nano) plus "meter". How the value is stored in the data property, whether as 10.6 floating point or as 1.6e-8 is a second issue -- the latter is probably preferable. I only intend to show that scientific values are not always trivial to reverse engineer from a floating point value to the intended value.
In addition to a storage option of the desired unit prefix (this may be considered a original-prefix, since naturally re-users may wish to reformat this) it is probably necessary to store the number of significant decimals. I believe in the user interface this needs not be any visible setting, simply the number of digits can be preserved. Without these is impossible to store and reproduce information like "10.20 nm", it would be returned as 1.02 10^-8 m. Complex heuristic may "guess" when to use the scientific SI prefixes instead. The trailing zero cannot be reproduced however when completely relying on IEEE floating-point.
Gregor
On 19/12/12 08:53, Gregor Hagedorn wrote:
I agree. What I propose is that the user interface supports entering and proofreading "10.6 nm" as "10.6" plus "n" (= nano) plus "meter". How the value is stored in the data property, whether as 10.6 floating point or as 1.6e-8 is a second issue -- the latter is probably preferable. I only intend to show that scientific values are not
Perhaps both should be stored. 1.6e-8 is necessary for sorting and comparison. But 10.6 nm is how the user entered it, presumably how it was written in the source that the user used, how is it preferably used in the given field, and how other users would want to see it and edit it.
As an example, human height is commonly given in centimetres, while building height is commonly given in metres. So, users will probably prefer to edit the tallest person as 282cm and the lowest building as 2.1m even though the absolute values are similar.
I don't understand why 1.6e-8 is absolutly necessary for sorting and comparison. PHP allows for the definition of custom sorting functions. If a custom datatype is defined, a custom sorting/comparision function can be defined too. Or am i missing some performance points?
On Wed, Dec 19, 2012 at 10:30 AM, Nikola Smolenski smolensk@eunet.rswrote:
On 19/12/12 08:53, Gregor Hagedorn wrote:
I agree. What I propose is that the user interface supports entering and proofreading "10.6 nm" as "10.6" plus "n" (= nano) plus "meter". How the value is stored in the data property, whether as 10.6 floating point or as 1.6e-8 is a second issue -- the latter is probably preferable. I only intend to show that scientific values are not
Perhaps both should be stored. 1.6e-8 is necessary for sorting and comparison. But 10.6 nm is how the user entered it, presumably how it was written in the source that the user used, how is it preferably used in the given field, and how other users would want to see it and edit it.
As an example, human height is commonly given in centimetres, while building height is commonly given in metres. So, users will probably prefer to edit the tallest person as 282cm and the lowest building as 2.1m even though the absolute values are similar.
______________________________**_________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-lhttps://lists.wikimedia.org/mailman/listinfo/wikidata-l
On 19.12.2012 11:56, Friedrich Röhrs wrote:
I don't understand why 1.6e-8 is absolutly necessary for sorting and comparison. PHP allows for the definition of custom sorting functions. If a custom datatype is defined, a custom sorting/comparision function can be defined too. Or am i missing some performance points?
We are talking about searching and sorting millions of data entries - doing that in PHP would be extremely slow and would take far more memory than we have. It has to be done natively in the database. So we have to use a data representation that can be natively compared and sorted by the database (at the very least by MySQL, but ideally, by many different database systems).
-- daniel
On 19 December 2012 11:56, Friedrich Röhrs f.roehrs@mis.uni-saarland.de wrote:
I don't understand why 1.6e-8 is absolutly necessary for sorting and comparison. PHP allows for the definition of custom sorting functions. If a custom datatype is defined, a custom sorting/comparision function can be defined too. Or am i missing some performance points?
I believe for performance reasons in many cases sorting needs to be performed already by the database and supported by an index. -- gregor
Hi,
Sorry for my ignorance, if this is common knowledge: What is the use case for sorting millions of different measures from different objects? The only use case i can think of where the sorting would be necessary is lists like "cities in France by size" or "list of cars by fuel/per 100km". Those list do not have millions of entries, and even there you would need some further logic, since the size is not just one single value, there could be multiple values for the size (or fuel consumption) with different sources and for different times. For cars there could be entries by the manufacturer, by some car-testing magazine, etc. I don't see how this could be adequatly represented/sorted by a database only query.
If however this is necessary, i still don't understand why it must affect the datavalue structure. If a index is necessary it could be done over a serialized representation of the value. This needs to be done anyway, since the values are saved at a specific unit (which is just a wikidata item). To compare them on a database level they must all be saved at the same unit, or some sort of procedure must be used to compare them (or am i missing something again?).
Friedrich
On Wed, Dec 19, 2012 at 12:25 PM, Gregor Hagedorn g.m.hagedorn@gmail.comwrote:
On 19 December 2012 11:56, Friedrich Röhrs f.roehrs@mis.uni-saarland.de wrote:
I don't understand why 1.6e-8 is absolutly necessary for sorting and comparison. PHP allows for the definition of custom sorting functions.
If a
custom datatype is defined, a custom sorting/comparision function can be defined too. Or am i missing some performance points?
I believe for performance reasons in many cases sorting needs to be performed already by the database and supported by an index. -- gregor
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
On 19.12.2012 14:34, Friedrich Röhrs wrote:
Hi,
Sorry for my ignorance, if this is common knowledge: What is the use case for sorting millions of different measures from different objects?
Finding all cities with more than 100000 inhabitants requires the database to look through all values for the property "population" (or even all properties with countable values, depending on implementation an query planning), compare each value with "100000" and return those with a greater value. To speed this up, an index sorted by this value would be needed.
For cars there could be entries by the manufacturer, by some car-testing magazine, etc. I don't see how this could be adequatly represented/sorted by a database only query.
If this cannot be done adequatly on the database level, then it cannot be done efficiently, which means we will not allow it. So our task is to come up with an architecture that does allow this.
(One way to allow "scripted" queries like this to run efficiently is to do this in a massively parallel way, using a map/reduce framework. But that's also not trivial, and would require a whole new server infrastructure).
If however this is necessary, i still don't understand why it must affect the datavalue structure. If a index is necessary it could be done over a serialized representation of the value.
"Serialized" can mean a lot of things, but an index on some data blob is only useful for exact matches, it can not be used for greater/lesser queries. We need to map our values to scalar data types the database can understand directly, and use for indexing.
This needs to be done anyway, since the values are saved at a specific unit (which is just a wikidata item). To compare them on a database level they must all be saved at the same unit, or some sort of procedure must be used to compare them (or am i missing something again?).
If they measure the same dimension, they should be saved using the same unit (probably the SI base unit for that dimension). Saving values using different units would make it impossible to run efficient queries against these values, thereby defying one of the major reasons for Wikidata's existance. I don't see a way around this.
-- daniel
Hey wikidatians,
occasionally checking threads in this list like the current one, I get a mixed feeling: on one hand, it is sad to see the efforts and resources waisted as Wikidata tries to reinvent RDF, and now also triplestore design as well as XSD datatypes. What's next, WikiQL instead of SPARQL?
On the other hand, it feels reassuring as I was right to predict this: http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00056.html http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00750.html
Best,
Martynas graphity.org
On Wed, Dec 19, 2012 at 4:11 PM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
On 19.12.2012 14:34, Friedrich Röhrs wrote:
Hi,
Sorry for my ignorance, if this is common knowledge: What is the use case for sorting millions of different measures from different objects?
Finding all cities with more than 100000 inhabitants requires the database to look through all values for the property "population" (or even all properties with countable values, depending on implementation an query planning), compare each value with "100000" and return those with a greater value. To speed this up, an index sorted by this value would be needed.
For cars there could be entries by the manufacturer, by some car-testing magazine, etc. I don't see how this could be adequatly represented/sorted by a database only query.
If this cannot be done adequatly on the database level, then it cannot be done efficiently, which means we will not allow it. So our task is to come up with an architecture that does allow this.
(One way to allow "scripted" queries like this to run efficiently is to do this in a massively parallel way, using a map/reduce framework. But that's also not trivial, and would require a whole new server infrastructure).
If however this is necessary, i still don't understand why it must affect the datavalue structure. If a index is necessary it could be done over a serialized representation of the value.
"Serialized" can mean a lot of things, but an index on some data blob is only useful for exact matches, it can not be used for greater/lesser queries. We need to map our values to scalar data types the database can understand directly, and use for indexing.
This needs to be done anyway, since the values are saved at a specific unit (which is just a wikidata item). To compare them on a database level they must all be saved at the same unit, or some sort of procedure must be used to compare them (or am i missing something again?).
If they measure the same dimension, they should be saved using the same unit (probably the SI base unit for that dimension). Saving values using different units would make it impossible to run efficient queries against these values, thereby defying one of the major reasons for Wikidata's existance. I don't see a way around this.
-- daniel
-- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Martynas,
could you please let me know where RDF or any of the W3C standards covers topics like units, uncertainty, and their conversion. I would be very much interested in that.
Cheers, Denny
2012/12/19 Martynas Jusevičius martynas@graphity.org
Hey wikidatians,
occasionally checking threads in this list like the current one, I get a mixed feeling: on one hand, it is sad to see the efforts and resources waisted as Wikidata tries to reinvent RDF, and now also triplestore design as well as XSD datatypes. What's next, WikiQL instead of SPARQL?
On the other hand, it feels reassuring as I was right to predict this: http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00056.html http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00750.html
Best,
Martynas graphity.org
On Wed, Dec 19, 2012 at 4:11 PM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
On 19.12.2012 14:34, Friedrich Röhrs wrote:
Hi,
Sorry for my ignorance, if this is common knowledge: What is the use
case for
sorting millions of different measures from different objects?
Finding all cities with more than 100000 inhabitants requires the
database to
look through all values for the property "population" (or even all
properties
with countable values, depending on implementation an query planning),
compare
each value with "100000" and return those with a greater value. To speed
this
up, an index sorted by this value would be needed.
For cars there could be entries by the manufacturer, by some car-testing magazine, etc. I don't see how this could be adequatly represented/sorted by a database only query.
If this cannot be done adequatly on the database level, then it cannot
be done
efficiently, which means we will not allow it. So our task is to come up
with an
architecture that does allow this.
(One way to allow "scripted" queries like this to run efficiently is to
do this
in a massively parallel way, using a map/reduce framework. But that's
also not
trivial, and would require a whole new server infrastructure).
If however this is necessary, i still don't understand why it must
affect the
datavalue structure. If a index is necessary it could be done over a
serialized
representation of the value.
"Serialized" can mean a lot of things, but an index on some data blob is
only
useful for exact matches, it can not be used for greater/lesser queries.
We need
to map our values to scalar data types the database can understand
directly, and
use for indexing.
This needs to be done anyway, since the values are saved at a specific unit (which is just a wikidata item). To compare
them on a
database level they must all be saved at the same unit, or some sort of procedure must be used to compare them (or am i missing something
again?).
If they measure the same dimension, they should be saved using the same
unit
(probably the SI base unit for that dimension). Saving values using
different
units would make it impossible to run efficient queries against these
values,
thereby defying one of the major reasons for Wikidata's existance. I
don't see a
way around this.
-- daniel
-- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
On Wed, 19 Dec 2012, Denny Vrandečić wrote:
Martynas, could you please let me know where RDF or any of the W3C standards covers topics like units, uncertainty, and their conversion. I would be very much interested in that.
NIST has created a standard in OWL: "QUDT - Quantities, Units, Dimensions and Data Types in OWL and XML": http://www.qudt.org/qudt/owl/1.0.0/index.html
I fully share Martynas' concerns: most of the problems that are being discussed in this thread (and that are very relevant and interesting) should not be solved with an "object oriented" approach (that is, via properties of objects, and "inheritance") but by semantic modelling (that is, "composition" of knowledge). For example, one single data base representation of a unit can have multiple "displays" depending on who wants to see the unit, and in which context; the viewer and the context are rather simple to add via semantic primitives. For example, the "Topic Map" semantic standard would fit here very well, in my opinion: http://en.wikipedia.org/wiki/Topic_map.
Cheers, Denny
Herman
2012/12/19 Martynas Jusevičius martynas@graphity.org Hey wikidatians,
occasionally checking threads in this list like the current one, I get a mixed feeling: on one hand, it is sad to see the efforts and resources waisted as Wikidata tries to reinvent RDF, and now also triplestore design as well as XSD datatypes. What's next, WikiQL instead of SPARQL? On the other hand, it feels reassuring as I was right to predict this: http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00056.html http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00750.html Best, Martynas graphity.org On Wed, Dec 19, 2012 at 4:11 PM, Daniel Kinzler <daniel.kinzler@wikimedia.de> wrote: > On 19.12.2012 14:34, Friedrich Röhrs wrote: >> Hi, >> >> Sorry for my ignorance, if this is common knowledge: What is the use case for >> sorting millions of different measures from different objects? > > Finding all cities with more than 100000 inhabitants requires the database to > look through all values for the property "population" (or even all properties > with countable values, depending on implementation an query planning), compare > each value with "100000" and return those with a greater value. To speed this > up, an index sorted by this value would be needed. > >> For cars there could be entries by the manufacturer, by some >> car-testing magazine, etc. I don't see how this could be adequatly >> represented/sorted by a database only query. > > If this cannot be done adequatly on the database level, then it cannot be done > efficiently, which means we will not allow it. So our task is to come up with an > architecture that does allow this. > > (One way to allow "scripted" queries like this to run efficiently is to do this > in a massively parallel way, using a map/reduce framework. But that's also not > trivial, and would require a whole new server infrastructure). > >> If however this is necessary, i still don't understand why it must affect the >> datavalue structure. If a index is necessary it could be done over a serialized >> representation of the value. > > "Serialized" can mean a lot of things, but an index on some data blob is only > useful for exact matches, it can not be used for greater/lesser queries. We need > to map our values to scalar data types the database can understand directly, and > use for indexing. > >> This needs to be done anyway, since the values are >> saved at a specific unit (which is just a wikidata item). To compare them on a >> database level they must all be saved at the same unit, or some sort of >> procedure must be used to compare them (or am i missing something again?). > > If they measure the same dimension, they should be saved using the same unit > (probably the SI base unit for that dimension). Saving values using different > units would make it impossible to run efficient queries against these values, > thereby defying one of the major reasons for Wikidata's existance. I don't see a > way around this. > > -- daniel > > -- > Daniel Kinzler, Softwarearchitekt > Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. > > > _______________________________________________ > Wikidata-l mailing list > Wikidata-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata-l _______________________________________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
-- KU Leuven, Mechanical Engineering, Robotics Research Group http://people.mech.kuleuven.be/~bruyninc Tel: +32 16 328056 Vice-President Research euRobotics http://www.eu-robotics.net Open RObot COntrol Software http://www.orocos.org Associate Editor JOSER http://www.joser.org, IJRR http://www.ijrr.org
The NIST ontology defines 4 basic classes that are great:
_qudt:QuantityKind [11]_, _qudt:Quantity [12]_, _qudt:QuantityValue [13]_, _qudt:Unit [14]_
but the properties set leaves me a bit thirsty. Take "Area" as an example. I'd like to reference properties named .ft2 and .m2 so that, for instance, an annotation might be [[Leasable area.ft2::12345]]. To state the precision applicable to that measurement, might be [[Leasable area.ft2:fractionDigits :: 0]] to indicate say, rounding. However, in the NIST ontology, there is no "ft2" property at all -- this is an SI unit though, so it seems identifying first the system of measurement units, and then the specific measurement unit is not a great idea because these notations are then divorced from the property name itself, a scenario guaranteed to cause more user errors & omissions I think.
Someone's mentioned uncertainty facets, so I suggest these from the qudt ontology:
Property: .anyType:relativeStandardUncertainty Property: .anyType:standardUncertainty
Other facets noted might include
Property: .anyType:abbreviation Property: .anyType:description
Property: .anyType:symbol
-john
On 19.12.2012 08:10, Herman Bruyninckx wrote:
On Wed, 19 Dec 2012, Denny Vrandečić wrote:
Martynas, could you please let me know where RDF or any of the W3C standards covers topics like units, uncertainty, and their conversion. I would be very much interested in that.
NIST has created a standard
in OWL: "QUDT - Quantities, Units, Dimensions and Data Types in OWL and XML":
<http://www.qudt.org/qudt/owl/1.0.0/index.html [5]>
I fully
share Martynas' concerns: most of the problems that are being
discussed in this thread (and that are very relevant and interesting)
should not be solved with an "object oriented" approach (that is, via
properties of objects, and "inheritance") but by semantic modelling (that
is, "composition" of knowledge). For example, one single data
base
representation of a unit can have multiple "displays" depending
on who
wants to see the unit, and in which context; the viewer and the
context are
rather simple to add via semantic primitives. For example,
the "Topic Map"
semantic standard would fit here very well, in my
opinion:
<http://en.wikipedia.org/wiki/Topic_map [6]>.
Cheers,
Denny
Herman http://people.mech.kuleuven.be/~bruyninc%3E Tel: +32
16 328056 Vice-President Research euRobotics http://www.eu-robotics.net [7] Open RObot COntrol Software <http://www.orocos.org [8]> Associate Editor JOSER <http://www.joser.org [9]>, IJRR <http://www.ijrr.org [10]>
Wikidata-l
mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l [1]
Links: ------ [1] https://lists.wikimedia.org/mailman/listinfo/wikidata-l [2] http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00056.html [3] http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00750.html [4] http://wikimedia.de [5] http://www.qudt.org/qudt/owl/1.0.0/index.html [6] http://en.wikipedia.org/wiki/Topic_map [7] http://www.eu-robotics.net [8] http://www.orocos.org [9] http://www.joser.org [10] http://www.ijrr.org [11] http://www.qudt.org/qudt/owl/1.0.0/qudt/index.html#QuantityKind [12] http://www.qudt.org/qudt/owl/1.0.0/qudt/index.html#Quantity [13] http://www.qudt.org/qudt/owl/1.0.0/qudt/index.html#QuantityValue [14] http://www.qudt.org/qudt/owl/1.0.0/qudt/index.html#Unit
Denny,
you're sidestepping the main issue here -- every sensible architecture should build on as much previous standards as possible, and build own custom solution only if a *very* compelling reason is found to do so instead of finding a compromise between the requirements and the standard. Wikidata seems to be constantly doing the opposite -- building a custom solution with whatever reason, or even without it. This drives the compatibility and reuse towards zero.
This thread originally discussed datatypes for values such as numbers, dates and their intervals -- semantics for all of those are defined in XML Schema Datatypes: http://www.w3.org/TR/xmlschema-2/ All the XML and RDF tools are compatible with XSD, however I don't think there is even a single mention of it in this thread? What makes Wikidata so special that its datatypes cannot build on XSD? And this is only one of the issues, I've pointed out others earlier.
Martynas graphity.org
On Wed, Dec 19, 2012 at 5:58 PM, Denny Vrandečić denny.vrandecic@wikimedia.de wrote:
Martynas,
could you please let me know where RDF or any of the W3C standards covers topics like units, uncertainty, and their conversion. I would be very much interested in that.
Cheers, Denny
2012/12/19 Martynas Jusevičius martynas@graphity.org
Hey wikidatians,
occasionally checking threads in this list like the current one, I get a mixed feeling: on one hand, it is sad to see the efforts and resources waisted as Wikidata tries to reinvent RDF, and now also triplestore design as well as XSD datatypes. What's next, WikiQL instead of SPARQL?
On the other hand, it feels reassuring as I was right to predict this: http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00056.html http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00750.html
Best,
Martynas graphity.org
On Wed, Dec 19, 2012 at 4:11 PM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
On 19.12.2012 14:34, Friedrich Röhrs wrote:
Hi,
Sorry for my ignorance, if this is common knowledge: What is the use case for sorting millions of different measures from different objects?
Finding all cities with more than 100000 inhabitants requires the database to look through all values for the property "population" (or even all properties with countable values, depending on implementation an query planning), compare each value with "100000" and return those with a greater value. To speed this up, an index sorted by this value would be needed.
For cars there could be entries by the manufacturer, by some car-testing magazine, etc. I don't see how this could be adequatly represented/sorted by a database only query.
If this cannot be done adequatly on the database level, then it cannot be done efficiently, which means we will not allow it. So our task is to come up with an architecture that does allow this.
(One way to allow "scripted" queries like this to run efficiently is to do this in a massively parallel way, using a map/reduce framework. But that's also not trivial, and would require a whole new server infrastructure).
If however this is necessary, i still don't understand why it must affect the datavalue structure. If a index is necessary it could be done over a serialized representation of the value.
"Serialized" can mean a lot of things, but an index on some data blob is only useful for exact matches, it can not be used for greater/lesser queries. We need to map our values to scalar data types the database can understand directly, and use for indexing.
This needs to be done anyway, since the values are saved at a specific unit (which is just a wikidata item). To compare them on a database level they must all be saved at the same unit, or some sort of procedure must be used to compare them (or am i missing something again?).
If they measure the same dimension, they should be saved using the same unit (probably the SI base unit for that dimension). Saving values using different units would make it impossible to run efficient queries against these values, thereby defying one of the major reasons for Wikidata's existance. I don't see a way around this.
-- daniel
-- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Martynas,
I think you misinterpret the thread. There is no discussion not to build on the datatypes defined in http://www.w3.org/TR/xmlschema-2/
What we are doing is discussing compositions of elements, all typed to xml datatypes, that shall be able to express scientific and engineering requirements as to statistics, signficant digits (except perhaps for duration, none of the data types in http://www.w3.org/TR/xmlschema-2/ supports that), as well as means to express uncertainty and confidence intervals.
Many existing xml schemata define such compositions, all squarely built on http://www.w3.org/TR/xmlschema-2/ - wikidata is certainly not unique in this effort. If you can point the team to further well reviewed solutions, this would be very useful.
Gregor
Hi Gregor - the root of the misconception I likely have about significant digits and the like, is that such is one example of a rendering parameter not a semantic property. But maybe I've missed the part of the discussion that cleanly separates these things into buckets, and perhaps I have my head way too (deep in the sand) to understand the discussion altogether, but that's my two cents fwiw.
With regard to other relevant worthy efforts, I suggest http://www.w3.org/TR/vocab-data-cube/. [3]
On 19.12.2012 09:47, Gregor Hagedorn wrote:
Martynas,
I think you misinterpret the thread.
There is no discussion not to
build on the datatypes defined in
http://www.w3.org/TR/xmlschema-2/ [1]
What we are doing is
discussing compositions of elements, all typed to
xml datatypes, that
shall be able to express scientific and
engineering requirements as to
statistics, signficant digits (except
perhaps for duration, none of
the data types in
http://www.w3.org/TR/xmlschema-2/ [1] supports
that), as well as means to
express uncertainty and confidence
intervals.
Many existing xml schemata define such compositions, all
squarely
built on http://www.w3.org/TR/xmlschema-2/ [1] - wikidata is
certainly not
unique in this effort. If you can point the team to
further well
reviewed solutions, this would be very useful.
Gregor
Wikidata-l
mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l [2]
Links: ------ [1] http://www.w3.org/TR/xmlschema-2/ [2] https://lists.wikimedia.org/mailman/listinfo/wikidata-l [3] http://www.w3.org/TR/vocab-data-cube/
On 19 December 2012 20:01, jmcclure@hypergrove.com wrote:
Hi Gregor - the root of the misconception I likely have about significant digits and the like, is that such is one example of a rendering parameter not a semantic property.
It is about semantics, not formatting.
In science and engineering, the number of significant digits is not used to right align numbers, but to semantically indicate the order of magnitude of the accuracy and/or precision of a measurement or quantity. Thus, the weight of a machine can be given as 1.2 t (exact to +/- 50 kg), 1200 kg (+/- 1 kg), or 1200.000 g.
This is not part of IEEE floating point numbers, which always have the type dependent same precision or number of significant digits, regardless whether this is semantically justified or not. IEEE 754 standard double always has about 16 decimal significant digits, i.e. the value 1.2 tons will always be given as 1.200000000000000 tons. This is good for calculations, but lacks the information for final rounding.
Gregor
totally agree - hopefully XSD facets provide a solid start to meeting those concrete requrements - thanks.
On 19.12.2012 14:09, Gregor Hagedorn wrote:
On 19 December 2012 20:01,
jmcclure@hypergrove.com wrote:
Hi Gregor - the root of the
misconception I likely have about significant digits and the like, is that such is one example of a rendering parameter not a semantic property.
It is about semantics, not formatting.
In science
and engineering, the number of significant digits is not
used to right
align numbers, but to semantically indicate the order of
magnitude of
the accuracy and/or precision of a measurement or
quantity. Thus, the
weight of a machine can be given as 1.2 t (exact
to +/- 50 kg), 1200
kg (+/- 1 kg), or 1200.000 g.
This is not part of IEEE floating
point numbers, which always have the
type dependent same precision or
number of significant digits,
regardless whether this is semantically
justified or not. IEEE 754
standard double always has about 16 decimal
significant digits, i.e.
the value 1.2 tons will always be given as
1.200000000000000 tons.
This is good for calculations, but lacks the
information for final
rounding.
Gregor
_______________________________________________
Wikidata-l mailing
list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l [1]
Links: ------ [1] https://lists.wikimedia.org/mailman/listinfo/wikidata-l
totally agree - hopefully XSD facets provide a solid start to meeting those concrete requrements
they don't. They allow to define derived datatypes and thus apply to the datatype, not the measurement. Different measurements of the same datatype may be of different precision. --gregor
For me the question is how to name the precision information. Do not the XSD facets "totalDigits" and "fractionDigits" work well enough? I mean
.number:totalDigits contains a positive power of ten for precision .number:fractionDigits contains a negative power of ten for precision
The use of the word "datatype" is always interesting as somehow it's meant organically different from "the measurement" to which it's related. Both are resources with named properties - what are those names? Certain property names derived from (international standards) should be considered "builtin" to whatever foundation the implementing tool procides. I suggest that XSD names be used at least for concepts that appear to be the same, with or without the xsd: xml-namespace prefix.
But the word "datatype" fascinates me even more ever since SMW internalized the Datatype namespace. Because to me RDF made an error back when the rdf:type property got the range Class, when it should have been Datatype (though politics got in the way!) It gets more twisted, as now Category is the chosen implementation of rdfs:Class. The problem that presents is that categories are lists and a class (that is, rdf:type value) is, for some singular, and for others a plural, concept or label. Pure semantic mayhem.
I'm happy SMW internalized the datatype namespace to the extent it maps to its software chiefly because it clarifies that a standard "Type" namespace is needed -- which contains singular noun phrases -- which is the value range for rdf:type (if you will) properties. All Measurement types (eg Feet, Height & Lumens) would be represented there too, like any other "class", with its associated properties that (in the case of numerics) would include ".totalDigits" and ".fractionDigits".
Going this route -- establishing a standard Type namespace -- would allow wikis to have a separate vocabulary of singular noun phrases not in the Category namespace. The ultimate goal is to associate a given Type to its implemention as a wiki namespace, subpage or subobject; the Category namespace itself is already overloaded to handle that task.
-john
On 19.12.2012 14:50, Gregor Hagedorn wrote:
totally agree - hopefully XSD facets provide
a solid start to meeting those concrete requrements
they don't.
They allow to define derived datatypes and thus apply to
the datatype,
not the measurement. Different measurements of the same
datatype may
be of different precision. --gregor
_______________________________________________
Wikidata-l mailing
list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l [1]
Links: ------ [1] https://lists.wikimedia.org/mailman/listinfo/wikidata-l
On 20 December 2012 02:20, jmcclure@hypergrove.com wrote:
For me the question is how to name the precision information. Do not the XSD facets "totalDigits" and "fractionDigits" work well enough? I mean
Yes, that would be one way of modeling it. And I agree with you that, although the xsd attributes originally are devised for datatypes, there is nothing wrong with re-using it for quantities and measurements.
So one way of expressing a measurement with significant digits is: (Proposal 1) * normalizedValue * totalDigits * fractionDigits * originalUnit * normalizedUnit
To recover the original information (e.g. that the original value was in feet with a given number of significant digits) the software must convert normalizedUnit to originalUnit, scale to totalDigits with fractionDigits, calculate the remaining powers of ten, and use some information that must be stored together with each unit whether this then should be expressed using an SI unit prefix (the Exa, Tera, Giga, Mega, kilo, hekto, deka, centi, etc.). Some units use them, others not, and some units use only some. Hektoliter is common, hektometer would be very odd. This is slightly complicated by the fact that for some units prefix usage in lay topics differs from scientific use.
If all numbers were expressed ONLY as total digits with fraction digits and unit-prefix, i.e. no power-of-ten exponential, the above would be sufficiently complete. However, without additional information it does not allow to recover the entry:
100,230 * 10^3 tons (value 1.0023e8, 6 total, 3 fractional digits, original unit tons, normalized unit gram)
I had therefore made (on the wiki) the proposal to express it as:
(Proposal 2) * normalizedValue * significantDigits (= and I am happy with totalDigits instead) * originalUnit * originalUnitPrefix * normalizedUnit
However I see now that the analysis was wrong, indeed it needs fractionDigits in addition to totalDigits, else a similar problem may occur, i.e. the distribution of the total order of magnitude of the number between non-fractional digits, fractional digits, powers of 10 and powers-of-10-expressed through SI units is still not unambigous.
So the minimal representation seems to be:
(Proposal 3) * normalizedValue (xsd:double or xsd:decimal) * totalDigits (xsd:smallint) * fractionDigits (xsd:smallint) * originalUnit (a wikidata item) * originalUnitPrefix (a wikidata item) * normalizedUnit (a wikidata item)
Adding the originalUnitPrefix has the advantage that it gathers knowledge from users and data creators or resources about which unit prefix is appropriate in a given context.
I see the current wikidata plan to solve this problem by heuristics very critical, I do not see the data set that sufficiently tests the heuristics yet. Gathering information from data entered and creating a formatting heuristics modules over the coming years (instead of weeks) will be valuable for reformatting. The Proposal 3 allows to gather this information.
Gregor
Note 1: The question of other means to express accuracy or precision, e.g. by error margins, statistical measures of spread such as variance, confidence intervals, percentiles, min/max etc. is not yet covered.
Given the present discussion, this should probably be separately agreed upon.
Note 2: Wikipedia Infoboxes may desire to override it, this is for data entering, review, curation, and a default display where no other is defined
(Proposal 3, modified) * value (xsd:double or xsd:decimal)
* unit (a wikidata item) * totalDigits (xsd:smallint) * fractionDigits (xsd:smallint) * originalUnit (a wikidata item) * originalUnitPrefix (a wikidata item)
JMc: I rearranged the list a bit and suggested simpler naming
JMc: Is not originalUnitPrefix directly derived from originalUnit?
JMc: May be more efficient to store not reconstruct the original value. May even be better to store the original value somewhere else entirely, earlier in the process, eg within the context that you indicate would be worthwhile to capture, because I wouldnt expect alot of retrievals, but you anticipate usage patterns certainly better than I.
How about just:
Datatype: .number (Proposal 4)
----------------------------------------- :value (xsd:double or xsd:decimal)
:unit (a wikidata item) :totalDigits (xsd:smallint)
:fractionDigits (xsd:smallint)
:original (a wikidata item that is a number object)
On 20.12.2012 03:08, Gregor Hagedorn wrote:
On 20
December 2012 02:20, jmcclure@hypergrove.com wrote:
For me the
question is how to name the precision information. Do not the XSD facets "totalDigits" and "fractionDigits" work well enough? I mean
Yes,
that would be one way of modeling it. And I agree with you that,
although the xsd attributes originally are devised for datatypes,
there is nothing wrong with re-using it for quantities and
measurements.
So one way of expressing a measurement with
significant digits is:
(Proposal 1)
- normalizedValue
totalDigits
- fractionDigits
- originalUnit
- normalizedUnit
To recover the original information (e.g. that the original value was
in feet with a given number of significant digits) the software must
convert normalizedUnit to originalUnit, scale to totalDigits with
fractionDigits, calculate the remaining powers of ten, and use some
information that must be stored together with each unit whether this
then should be expressed using an SI unit prefix (the Exa, Tera, Giga,
Mega, kilo, hekto, deka, centi, etc.). Some units use them, others
not, and some units use only some. Hektoliter is common, hektometer
would be very odd. This is slightly complicated by the fact that for
some units prefix usage in lay topics differs from scientific use.
If all numbers were expressed ONLY as total digits with fraction
digits and unit-prefix, i.e. no power-of-ten exponential, the above
would be sufficiently complete. However, without additional
information it does not allow to recover the entry:
100,230 * 10^3
tons
(value 1.0023e8, 6 total, 3 fractional digits, original unit
tons,
normalized unit gram)
I had therefore made (on the wiki)
the proposal to express it as:
(Proposal 2)
- normalizedValue
significantDigits (= and I am happy with totalDigits instead)
originalUnit
- originalUnitPrefix
- normalizedUnit
However I
see now that the analysis was wrong, indeed it needs
fractionDigits in
addition to totalDigits, else a similar problem may
occur, i.e. the
distribution of the total order of magnitude of the
number between
non-fractional digits, fractional digits, powers of 10
and
powers-of-10-expressed through SI units is still not unambigous.
So
the minimal representation seems to be:
(Proposal 3)
normalizedValue (xsd:double or xsd:decimal)
- totalDigits
(xsd:smallint)
- fractionDigits (xsd:smallint)
- originalUnit (a
wikidata item)
- originalUnitPrefix (a wikidata item)
normalizedUnit (a wikidata item)
Adding the originalUnitPrefix has
the advantage that it gathers
knowledge from users and data creators
or resources about which unit
prefix is appropriate in a given
context.
I see the current wikidata plan to solve this problem by
heuristics
very critical, I do not see the data set that sufficiently
tests the
heuristics yet. Gathering information from data entered and
creating a
formatting heuristics modules over the coming years
(instead of weeks)
will be valuable for reformatting. The Proposal 3
allows to gather
this information.
Gregor
Note 1: The
question of other means to express accuracy or precision,
e.g. by
error margins, statistical measures of spread such as
variance,
confidence intervals, percentiles, min/max etc. is not yet
covered.
Given the present discussion, this should probably be separately
agreed upon.
Note 2: Wikipedia Infoboxes may desire to override it,
this is for
data entering, review, curation, and a default display
where no other
is defined
_______________________________________________
Wikidata-l mailing
list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l [1]
Links: ------ [1] https://lists.wikimedia.org/mailman/listinfo/wikidata-l
I am still trying to catch up with the whole discussion and to distill the results, both here and on the wiki.
In the meanwhile, I have tried to create a prototype of how a complex model can still be entered in a simple fashion. A simple demo can be found here:
The prototype is not i18n.
The user has to enter only the value, in a hopefully intuitive way (try it out), and the full interpretation is displayed here (that, alas, is not intuitive, admittedly).
Cheers, Denny
2012/12/20 jmcclure@hypergrove.com
**
(Proposal 3, modified)
value (xsd:double or xsd:decimal)
unit (a wikidata item)
totalDigits (xsd:smallint)
fractionDigits (xsd:smallint)
originalUnit (a wikidata item)
originalUnitPrefix (a wikidata item)
JMc: I rearranged the list a bit and suggested simpler naming
JMc: Is not originalUnitPrefix directly derived from originalUnit?
JMc: May be more efficient to store not reconstruct the original value. May even be better to store the original value somewhere else entirely, earlier in the process, eg within the context that you indicate would be worthwhile to capture, because I wouldnt expect alot of retrievals, but you anticipate usage patterns certainly better than I.
How about just:
Datatype: .number (Proposal 4)
:value (xsd:double or xsd:decimal)
:unit (a wikidata item) :totalDigits (xsd:smallint) :fractionDigits (xsd:smallint)
:original (a wikidata item that is a number object)
On 20.12.2012 03:08, Gregor Hagedorn wrote:
On 20 December 2012 02:20, jmcclure@hypergrove.com wrote:
For me the question is how to name the precision information. Do not the XSD facets "totalDigits" and "fractionDigits" work well enough? I mean
Yes, that would be one way of modeling it. And I agree with you that, although the xsd attributes originally are devised for datatypes, there is nothing wrong with re-using it for quantities and measurements.
So one way of expressing a measurement with significant digits is: (Proposal 1)
- normalizedValue
- totalDigits
- fractionDigits
- originalUnit
- normalizedUnit
To recover the original information (e.g. that the original value was in feet with a given number of significant digits) the software must convert normalizedUnit to originalUnit, scale to totalDigits with fractionDigits, calculate the remaining powers of ten, and use some information that must be stored together with each unit whether this then should be expressed using an SI unit prefix (the Exa, Tera, Giga, Mega, kilo, hekto, deka, centi, etc.). Some units use them, others not, and some units use only some. Hektoliter is common, hektometer would be very odd. This is slightly complicated by the fact that for some units prefix usage in lay topics differs from scientific use.
If all numbers were expressed ONLY as total digits with fraction digits and unit-prefix, i.e. no power-of-ten exponential, the above would be sufficiently complete. However, without additional information it does not allow to recover the entry:
100,230 * 10^3 tons (value 1.0023e8, 6 total, 3 fractional digits, original unit tons, normalized unit gram)
I had therefore made (on the wiki) the proposal to express it as:
(Proposal 2)
- normalizedValue
- significantDigits (= and I am happy with totalDigits instead)
- originalUnit
- originalUnitPrefix
- normalizedUnit
However I see now that the analysis was wrong, indeed it needs fractionDigits in addition to totalDigits, else a similar problem may occur, i.e. the distribution of the total order of magnitude of the number between non-fractional digits, fractional digits, powers of 10 and powers-of-10-expressed through SI units is still not unambigous.
So the minimal representation seems to be:
(Proposal 3)
- normalizedValue (xsd:double or xsd:decimal)
- totalDigits (xsd:smallint)
- fractionDigits (xsd:smallint)
- originalUnit (a wikidata item)
- originalUnitPrefix (a wikidata item)
- normalizedUnit (a wikidata item)
Adding the originalUnitPrefix has the advantage that it gathers knowledge from users and data creators or resources about which unit prefix is appropriate in a given context.
I see the current wikidata plan to solve this problem by heuristics very critical, I do not see the data set that sufficiently tests the heuristics yet. Gathering information from data entered and creating a formatting heuristics modules over the coming years (instead of weeks) will be valuable for reformatting. The Proposal 3 allows to gather this information.
Gregor
Note 1: The question of other means to express accuracy or precision, e.g. by error margins, statistical measures of spread such as variance, confidence intervals, percentiles, min/max etc. is not yet covered.
Given the present discussion, this should probably be separately agreed upon.
Note 2: Wikipedia Infoboxes may desire to override it, this is for data entering, review, curation, and a default display where no other is defined
Wikidata-l mailing listWikidata-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Thank you very much. I think this visualization will help.
Just tried it and found out that entering 18 and selecting ft results in 18 ft ± 1 ft. Shouldn't it be 18 ± 0.5 ft? So it has do be divided by 2.
Cheers
Marco
On 2012-12-20 17:10, Denny Vrandečić wrote:
I am still trying to catch up with the whole discussion and to distill the results, both here and on the wiki.
In the meanwhile, I have tried to create a prototype of how a complex model can still be entered in a simple fashion. A simple demo can be found here:
The prototype is not i18n.
The user has to enter only the value, in a hopefully intuitive way (try it out), and the full interpretation is displayed here (that, alas, is not intuitive, admittedly).
Cheers, Denny
2012/12/20 <jmcclure@hypergrove.com mailto:jmcclure@hypergrove.com>
__ (Proposal 3, modified) * value (xsd:double or xsd:decimal) * unit (a wikidata item) * totalDigits (xsd:smallint) * fractionDigits (xsd:smallint) * originalUnit (a wikidata item) * originalUnitPrefix (a wikidata item) JMc: I rearranged the list a bit and suggested simpler naming JMc: Is not originalUnitPrefix directly derived from originalUnit? JMc: May be more efficient to store not reconstruct the original value. May even be better to store the original value somewhere else entirely, earlier in the process, eg within the context that you indicate would be worthwhile to capture, because I wouldnt expect alot of retrievals, but you anticipate usage patterns certainly better than I. How about just: Datatype: .number (Proposal 4) ----------------------------------------- :value (xsd:double or xsd:decimal) :unit (a wikidata item) :totalDigits (xsd:smallint) :fractionDigits (xsd:smallint) :original (a wikidata item that is a number object) On 20.12.2012 03 <tel:20.12.2012%2003>:08, Gregor Hagedorn wrote:
On 20 December 2012 02:20,<jmcclure@hypergrove.com <mailto:jmcclure@hypergrove.com>> wrote:
For me the question is how to name the precision information. Do not the XSD facets "totalDigits" and "fractionDigits" work well enough? I mean
Yes, that would be one way of modeling it. And I agree with you that, although the xsd attributes originally are devised for datatypes, there is nothing wrong with re-using it for quantities and measurements. So one way of expressing a measurement with significant digits is: (Proposal 1) * normalizedValue * totalDigits * fractionDigits * originalUnit * normalizedUnit To recover the original information (e.g. that the original value was in feet with a given number of significant digits) the software must convert normalizedUnit to originalUnit, scale to totalDigits with fractionDigits, calculate the remaining powers of ten, and use some information that must be stored together with each unit whether this then should be expressed using an SI unit prefix (the Exa, Tera, Giga, Mega, kilo, hekto, deka, centi, etc.). Some units use them, others not, and some units use only some. Hektoliter is common, hektometer would be very odd. This is slightly complicated by the fact that for some units prefix usage in lay topics differs from scientific use. If all numbers were expressed ONLY as total digits with fraction digits and unit-prefix, i.e. no power-of-ten exponential, the above would be sufficiently complete. However, without additional information it does not allow to recover the entry: 100,230 * 10^3 tons (value 1.0023e8, 6 total, 3 fractional digits, original unit tons, normalized unit gram) I had therefore made (on the wiki) the proposal to express it as: (Proposal 2) * normalizedValue * significantDigits (= and I am happy with totalDigits instead) * originalUnit * originalUnitPrefix * normalizedUnit However I see now that the analysis was wrong, indeed it needs fractionDigits in addition to totalDigits, else a similar problem may occur, i.e. the distribution of the total order of magnitude of the number between non-fractional digits, fractional digits, powers of 10 and powers-of-10-expressed through SI units is still not unambigous. So the minimal representation seems to be: (Proposal 3) * normalizedValue (xsd:double or xsd:decimal) * totalDigits (xsd:smallint) * fractionDigits (xsd:smallint) * originalUnit (a wikidata item) * originalUnitPrefix (a wikidata item) * normalizedUnit (a wikidata item) Adding the originalUnitPrefix has the advantage that it gathers knowledge from users and data creators or resources about which unit prefix is appropriate in a given context. I see the current wikidata plan to solve this problem by heuristics very critical, I do not see the data set that sufficiently tests the heuristics yet. Gathering information from data entered and creating a formatting heuristics modules over the coming years (instead of weeks) will be valuable for reformatting. The Proposal 3 allows to gather this information. Gregor Note 1: The question of other means to express accuracy or precision, e.g. by error margins, statistical measures of spread such as variance, confidence intervals, percentiles, min/max etc. is not yet covered. Given the present discussion, this should probably be separately agreed upon. Note 2: Wikipedia Infoboxes may desire to override it, this is for data entering, review, curation, and a default display where no other is defined _______________________________________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
_______________________________________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
I believe there are a lot of dangerous assumptions on http://simia.net/valueparser/
First: there is no indication in a number that it is _not_ endlessly precise. Apostles = 12 has no uncertainty, representing it as 12 ± 1 is wrong, but also 12 ± 0.5 is wrong.
The same applies to a number like 12.2. The data source and author MAY desire to express significant digits, but we simply don't know. Wikidata should keep this at the don't know level and not force-convert a number of unknown measurement precision to a number with explicitly stated (but potentially totally wrong) precision or accuracy limits.
For example, in science it is quite common to give light microscopic measurement to one decimals behind the micrometer, even though the precision is 0.2 µm. The latter is simply known and therefore not constantly repeated, unless specific circumstances justify this.
As discussed above: plus minus 1 s.d. does not give you a confidence interval for the mean, it gives you a measure of dispersion.
---------
My proposal: make the default: plus-minus values unknown, only significant digits known. The interpretation of significant digits is not machine-available unless qualifiers say so. It can however be used to result in an estimate of significant digits after conversion.
Make the interval-points an option. If explicitly entered: excellent information. If not: don't try to create (false) knowledge from void.
Gregor
On 20.12.2012 20:52, Gregor Hagedorn wrote:
I believe there are a lot of dangerous assumptions on http://simia.net/valueparser/
First: there is no indication in a number that it is _not_ endlessly precise. Apostles = 12 has no uncertainty, representing it as 12 ± 1 is wrong, but also 12 ± 0.5 is wrong.
It will be possibly to explicitly state the level of uncertainty, the demo is just for the mechanism for determining the default.
My proposal: make the default: plus-minus values unknown, only significant digits known.
I don't like "significant digits" because it depends on the writing system (base 10). I'd much rather express this as absolute values.
The interpretation of significant digits is not machine-available unless qualifiers say so. It can however be used to result in an estimate of significant digits after conversion.
That means that the figure is not usable for query answering at all. If we don't know the level of certainty, we cannot use the number.
Make the interval-points an option. If explicitly entered: excellent information. If not: don't try to create (false) knowledge from void.
Yes, it will be an option. Making the default "unknown" would be bad though, I think.
However, we should probably store whether the level of certainty was given explicitly or estimated automatically based on the number of significant digits - then we can still ignore automatic values when desired.
-- daniel
I don't like "significant digits" because it depends on the writing system (base 10). I'd much rather express this as absolute values.
Yes, I would like too. What I argue is that the problem is that you simply in 99.99999 % (not a researched of number of course) of cases simply don't know more than that there is a given number of digits base 10. Whether that is meaningful or just sloppy or even a wilfull simplification (probably the vast majority of quantities in current Wikipedia belong to the latter category) is unknown.
That means that the figure is not usable for query answering at all. If we don't know the level of certainty, we cannot use the number.
that will usually be the case. Unless you know which kind of margin the numbers reflect, you cannot use it for answering anyways. What do you do with the two examples: 100 +/- 50 and 100 +/- 0.1 that are the results of the same dataset and precisely reflect the same quantity? If you know that the first is a 95% measure of dispersion, and the second a 95% CI for the mean, you can ask people whether they look for the mean (best estimate) or for a single observation.
Make the interval-points an option. If explicitly entered: excellent information. If not: don't try to create (false) knowledge from void.
Yes, it will be an option. Making the default "unknown" would be bad though, I think.
The default has to reflect reality. If you make it a complication to enter the actual default situation, and automatically add a margin of error/dispersion/tolerance whatever then people will simply allow it to happen, start ignoring it, don't understand it, and in the end Wikidata will be known as a bunch of unreliable encoded information.
However, we should probably store whether the level of certainty was given explicitly or estimated automatically based on the number of significant digits
- then we can still ignore automatic values when desired.
Which will force all re-users to understand this and to throw away these values prior to any analysis...
Why so complicated?
Gregor
Hi,
tried to enter the height of the eiffel tower. 324 meters. It suggested 324m +-100m. So i clicked on details and emptied the the +/- fields, because i do not know the upper or lower bounds. Would that have been the correct way to do it? Or should i set them to 0, even though i don't know at which precision this measurement is done. There should be a n/a value set as standard for the precision imho because for most items the precision is not known.
Friedrich
On Thu, Dec 20, 2012 at 5:10 PM, Denny Vrandečić < denny.vrandecic@wikimedia.de> wrote:
I am still trying to catch up with the whole discussion and to distill the results, both here and on the wiki.
In the meanwhile, I have tried to create a prototype of how a complex model can still be entered in a simple fashion. A simple demo can be found here:
The prototype is not i18n.
The user has to enter only the value, in a hopefully intuitive way (try it out), and the full interpretation is displayed here (that, alas, is not intuitive, admittedly).
Cheers, Denny
2012/12/20 jmcclure@hypergrove.com
**
(Proposal 3, modified)
value (xsd:double or xsd:decimal)
unit (a wikidata item)
totalDigits (xsd:smallint)
fractionDigits (xsd:smallint)
originalUnit (a wikidata item)
originalUnitPrefix (a wikidata item)
JMc: I rearranged the list a bit and suggested simpler naming
JMc: Is not originalUnitPrefix directly derived from originalUnit?
JMc: May be more efficient to store not reconstruct the original value. May even be better to store the original value somewhere else entirely, earlier in the process, eg within the context that you indicate would be worthwhile to capture, because I wouldnt expect alot of retrievals, but you anticipate usage patterns certainly better than I.
How about just:
Datatype: .number (Proposal 4)
:value (xsd:double or xsd:decimal)
:unit (a wikidata item) :totalDigits (xsd:smallint) :fractionDigits (xsd:smallint)
:original (a wikidata item that is a number object)
On 20.12.2012 03:08, Gregor Hagedorn wrote:
On 20 December 2012 02:20, jmcclure@hypergrove.com wrote:
For me the question is how to name the precision information. Do not the XSD facets "totalDigits" and "fractionDigits" work well enough? I mean
Yes, that would be one way of modeling it. And I agree with you that, although the xsd attributes originally are devised for datatypes, there is nothing wrong with re-using it for quantities and measurements.
So one way of expressing a measurement with significant digits is: (Proposal 1)
- normalizedValue
- totalDigits
- fractionDigits
- originalUnit
- normalizedUnit
To recover the original information (e.g. that the original value was in feet with a given number of significant digits) the software must convert normalizedUnit to originalUnit, scale to totalDigits with fractionDigits, calculate the remaining powers of ten, and use some information that must be stored together with each unit whether this then should be expressed using an SI unit prefix (the Exa, Tera, Giga, Mega, kilo, hekto, deka, centi, etc.). Some units use them, others not, and some units use only some. Hektoliter is common, hektometer would be very odd. This is slightly complicated by the fact that for some units prefix usage in lay topics differs from scientific use.
If all numbers were expressed ONLY as total digits with fraction digits and unit-prefix, i.e. no power-of-ten exponential, the above would be sufficiently complete. However, without additional information it does not allow to recover the entry:
100,230 * 10^3 tons (value 1.0023e8, 6 total, 3 fractional digits, original unit tons, normalized unit gram)
I had therefore made (on the wiki) the proposal to express it as:
(Proposal 2)
- normalizedValue
- significantDigits (= and I am happy with totalDigits instead)
- originalUnit
- originalUnitPrefix
- normalizedUnit
However I see now that the analysis was wrong, indeed it needs fractionDigits in addition to totalDigits, else a similar problem may occur, i.e. the distribution of the total order of magnitude of the number between non-fractional digits, fractional digits, powers of 10 and powers-of-10-expressed through SI units is still not unambigous.
So the minimal representation seems to be:
(Proposal 3)
- normalizedValue (xsd:double or xsd:decimal)
- totalDigits (xsd:smallint)
- fractionDigits (xsd:smallint)
- originalUnit (a wikidata item)
- originalUnitPrefix (a wikidata item)
- normalizedUnit (a wikidata item)
Adding the originalUnitPrefix has the advantage that it gathers knowledge from users and data creators or resources about which unit prefix is appropriate in a given context.
I see the current wikidata plan to solve this problem by heuristics very critical, I do not see the data set that sufficiently tests the heuristics yet. Gathering information from data entered and creating a formatting heuristics modules over the coming years (instead of weeks) will be valuable for reformatting. The Proposal 3 allows to gather this information.
Gregor
Note 1: The question of other means to express accuracy or precision, e.g. by error margins, statistical measures of spread such as variance, confidence intervals, percentiles, min/max etc. is not yet covered.
Given the present discussion, this should probably be separately agreed upon.
Note 2: Wikipedia Infoboxes may desire to override it, this is for data entering, review, curation, and a default display where no other is defined
Wikidata-l mailing listWikidata-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
On 20.12.2012 20:31, Friedrich Röhrs wrote:
Hi,
tried to enter the height of the eiffel tower. 324 meters. It suggested 324m +-100m.
That's strange. When I enter 324m, it correctly suggests 324m+/-1 for me.
-- daniel
Hi,
Does for me too now. Maybe i played around with the autouncertainty checkbox before trying 324 (probably had some 4 digit value before). Although the Problem remains, the height of the Eifeltower is not 324 +-1 meter. It is given as 324meters without any further information. So it should either be +-0Meters, or +-unkownMeters, no? I think for most items on Wikipedia the upperuncertainty or loweruncertainty is not known, or at least no source gives it. So in most use cases it is an unknown value. To make it easy for editors it should be marked as not available from the start.
(Staying with the eiffel tower as an example i don't see what additional information the 0.68 confidence supplies.)
On Fri, Dec 21, 2012 at 3:44 PM, Daniel Kinzler <daniel.kinzler@wikimedia.de
wrote:
On 20.12.2012 20:31, Friedrich Röhrs wrote:
Hi,
tried to enter the height of the eiffel tower. 324 meters. It suggested
324m
+-100m.
That's strange. When I enter 324m, it correctly suggests 324m+/-1 for me.
-- daniel
-- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Hi all,
wow! Thanks for all the input. I read it all through, and am trying to digest it currently into a new draft of the data model for the discussed data values. I will try to adress some questions here. Please be kind if I refer the wrong person at one place or the other.
Whenever I refer to the "current model", I mean the version as it was during this discussion < http://meta.wikimedia.org/w/index.php?title=Wikidata/Development/Representin...
The term "updated model" refers to the new one, which is not published yet. I hope I can do that soon.
== General comments ==
I want to remind everyone of the Wikidata requirements: < http://meta.wikimedia.org/wiki/Wikidata/Notes/Requirements%3E
Here especially: * The expressiveness of Wikidata will be limited. There will always be examples of knowledge that Wikidata will not be able to convey. We hope that this expressiveness can increase over time. * The first goal of Wikidata is to serve actual use cases in Wikipedia, not to enable some form of hypothetical perfection in knowledge representation. * Wikidata has to balance ease of use and expressiveness of statements. The user interface should not get complicated to merely cover a few exceptional edge cases. * What is an exceptional case, and what is not, will be defined by how often they appear in Wikipedia. Instead of anecdotal evidence or hypothetical examples we will analyse Wikipedia and see how frequent specific cases are.
In general this means that we cannot express everything that is expressible. A statement should not be intended to reflect the source as close as possible, but rather to be *supported* by the source. I.e. if the source says "He died during the early days of 1876" this would also support a statement like "died in - 19th century". It does not have to be more exact than that.
Martynas, there is no mention here of XSD etc. because it is not relevant on this level of discussion. For exporting the data we will obviously use XSD datatypes. This is so obvious that I didn't think it needed to be explicitly stated.
Tom, thanks for the links to EDTF and the Freebase work, this was certainly very enlightening.
Friedrich, the term "query answering" simply means the ability to answer queries against the database in Phase 3, e.g. the list of cities located in Ghana with a population over 25,000 ordered by population.
A query system that deals well with intervals -- I would need a pointer for that. For now I was always assuming to use a single value internally to answer such queries. If the values is 90+-20 then the query >100? would not contain that result. Sucks, but I don't know of any better system.
We do not anywhere rely on floats (besides in internal representations), but always use decimals. Floats have some inherent problems in representing some numbers that could be interesting for us.
== Time ==
Marco suggested to N/A some values of dates. This is partially the idea of the "precision" attribute in the current data. Anything below the precision would be N/A. It would not be possible to N/A the year when the month or day is known though, as Friedrich suggested.
Friedrich also suggested to use a value like April-July 1567 for uncertain time instead of the current precision model. I prefer his suggestion to the current one and will include that in the updated model.
The accuracy though has to be in the unit given by the precision, we cannot just take seconds, since there is no well-defined number of seconds in a month or a year, or, almost anything, actually.
Note though that the intervals that Sven mentioned -- useful for e.g. reigns or office periods -- are different beasts and should have uncertainty entries both for the start and end date. We have intervals in the data model, and plan to implement them later -- it is just that they are not such a high priority (dates appear 2.5 Million times in infoboxes, intervals only 80,000 times).
I am completely unsure what to do with a value like "about 1850" if not to interpret it at as something like 1850 +- 50, but Sven seems to dislike that.
== Location ==
After the discussion, I decided to drop altitude / elevation from the Geolocation. It can still be expressed through a property, and have all the flexibility of a normal property (including qualifiers etc.)
In a Geolocation, neither the lat nor the long is optional (sorry Nikola). The Geolocation as a whole can be optional, though (i.e. unknown), but not only one of them.
For the geolocations uncertainty I would like to use the same uncertainty model as for Quantity values and now for time. I know that "meters" have been suggested instead of degrees, but that would be kind of ugh considering that the biggest reason why we need the uncertainty is for converting units, in this case from decimals to degree-minute-seconds.
== Quantity values ==
Sorry to disagree with Daniel here, but we will definitively store a quantity value in the unit that the editor used for input. We will then internally normalize it for indexing etc., but the editor won't be bothered with that as long as they do not ask for a conversion. Storing it with the original unit is important for a number of reasons, most of which Gregor already alluded to.
I very much like Gregor's suggestion: rename the lower uncertainty and upper uncertainty to something with less semantic baggage. What about upper and lower bound? Or just upper and lower? And then leave the interpretation to others.
Gregor, an infinitively precise number (the number of apostles, e.g.) would be handled trivially by +- 0.
Also I am taking the hint from Avenue and others and drop confidence. I don't think it is useful to have it so deeply embedded in the data model, and should properly be handled through qualifiers.
Regarding the height of the Eiffel tower: 324 m +- 1m is exactly what I would like to see here if the source states 324 meter. I know the source doesn't say +-1m, but this is certainly supported by the source. Think about why we need this +-1m: it is simply so we can give a useful transformation into feet. Otherwise we cannot convert units. The +-1m would not be displayed usually.
== Units ==
I sense consensus that we should allow declaration of units in the wiki, and not to have it hardcoded in the software. Having discussed the various options and in light of the discussion here, the current suggestion would be to create a page for every quantity unit including the appropriate factors (for linear translations). This is similar to the way Freebase does it, as sent around by Tom, and what John McClure suggested.
Then on a given property, the property points to a quantity unit and furthermore lists the "usual units" for the given property (pointing to the given items), which is used for display.
Internally, for indexing, sorting, and query answering, we would always transform the input to the quantity unit so they are comparable. But this is usually neither exposed nor a useful number (e.g. it might have too many significant digits etc.)
This would allow to use historic units like Li or historic miles even though we do not know how to translate them to other units (but not by the same property).
This would also allow for other units, like Avenue has pointed out. Those are important.
Nikola, we will not have special handling for money for now. This would require a whole different spec I am afraid. Currency happen 200,000 times in Wikipedia -- it is often, but not so often to be high priority.
I hope that I managed to digest the whole discussion and bring it together.
Cheers, Denny
(if i knew the private email for Denny, I'd send this there)
"Martynas, there is no mention here of XSD etc. because it is not relevant on this level of discussion. For exporting the data we will obviously use XSD datatypes. This is so obvious that I didn't think it needed to be explicitly stated."
Maybe you don;'t realize snarky comments such as the last sentence are infinitely tiresome. It certainly demonstrates your lack of understanding what Martynas is saying: build on others work to the extent that you can and, if you can't, explain why not.
On 21.12.2012 09:14, Denny Vrandečić wrote:
Hi all,
wow! Thanks for all the input. I read it all through, and am trying to digest it currently into a new draft of the data model for the discussed data values. I will try to adress some questions here. Please be kind if I refer the wrong person at one place or the other.
Whenever I
refer to the "current model", I mean the version as it was during this discussion http://meta.wikimedia.org/w/index.php?title=Wikidata/Development/Representing_values&oldid=4859586 [2]
The term "updated model" refers to the new one, which is not
published yet. I hope I can do that soon.
== General comments ==
I want to remind everyone of the Wikidata requirements:
<http://meta.wikimedia.org/wiki/Wikidata/Notes/Requirements [3]>
Here especially:
- The expressiveness of Wikidata will be limited.
There will always be examples of knowledge that Wikidata will not be able to convey. We hope that this expressiveness can increase over time.
- The first goal of Wikidata is to serve actual use cases in
Wikipedia, not to enable some form of hypothetical perfection in knowledge representation.
- Wikidata has to balance ease of use and
expressiveness of statements. The user interface should not get complicated to merely cover a few exceptional edge cases.
- What is
an exceptional case, and what is not, will be defined by how often they appear in Wikipedia. Instead of anecdotal evidence or hypothetical examples we will analyse Wikipedia and see how frequent specific cases are.
In general this means that we cannot express everything that
is expressible. A statement should not be intended to reflect the source as close as possible, but rather to be *supported* by the source. I.e. if the source says "He died during the early days of 1876" this would also support a statement like "died in - 19th century". It does not have to be more exact than that.
Martynas, there is no mention here of
XSD etc. because it is not relevant on this level of discussion. For exporting the data we will obviously use XSD datatypes. This is so obvious that I didn't think it needed to be explicitly stated.
Tom, thanks for the links to EDTF and the Freebase work, this was certainly very enlightening.
Friedrich, the term "query answering"
simply means the ability to answer queries against the database in Phase 3, e.g. the list of cities located in Ghana with a population over 25,000 ordered by population.
A query system that deals well with
intervals -- I would need a pointer for that. For now I was always assuming to use a single value internally to answer such queries. If the values is 90+-20 then the query >100? would not contain that result. Sucks, but I don't know of any better system.
We do not anywhere
rely on floats (besides in internal representations), but always use decimals. Floats have some inherent problems in representing some numbers that could be interesting for us.
== Time ==
Marco
suggested to N/A some values of dates. This is partially the idea of the "precision" attribute in the current data. Anything below the precision would be N/A. It would not be possible to N/A the year when the month or day is known though, as Friedrich suggested.
Friedrich also
suggested to use a value like April-July 1567 for uncertain time instead of the current precision model. I prefer his suggestion to the current one and will include that in the updated model.
The accuracy
though has to be in the unit given by the precision, we cannot just take seconds, since there is no well-defined number of seconds in a month or a year, or, almost anything, actually.
Note though that the
intervals that Sven mentioned -- useful for e.g. reigns or office periods -- are different beasts and should have uncertainty entries both for the start and end date. We have intervals in the data model, and plan to implement them later -- it is just that they are not such a high priority (dates appear 2.5 Million times in infoboxes, intervals only 80,000 times).
I am completely unsure what to do with a value like
"about 1850" if not to interpret it at as something like 1850 +- 50, but Sven seems to dislike that.
== Location ==
After the
discussion, I decided to drop altitude / elevation from the Geolocation. It can still be expressed through a property, and have all the flexibility of a normal property (including qualifiers etc.)
In a
Geolocation, neither the lat nor the long is optional (sorry Nikola). The Geolocation as a whole can be optional, though (i.e. unknown), but not only one of them.
For the geolocations uncertainty I would
like to use the same uncertainty model as for Quantity values and now for time. I know that "meters" have been suggested instead of degrees, but that would be kind of ugh considering that the biggest reason why we need the uncertainty is for converting units, in this case from decimals to degree-minute-seconds.
== Quantity values ==
Sorry to
disagree with Daniel here, but we will definitively store a quantity value in the unit that the editor used for input. We will then internally normalize it for indexing etc., but the editor won't be bothered with that as long as they do not ask for a conversion. Storing it with the original unit is important for a number of reasons, most of which Gregor already alluded to.
I very much like Gregor's
suggestion: rename the lower uncertainty and upper uncertainty to something with less semantic baggage. What about upper and lower bound? Or just upper and lower? And then leave the interpretation to others.
Gregor, an infinitively precise number (the number of apostles, e.g.)
would be handled trivially by +- 0.
Also I am taking the hint from
Avenue and others and drop confidence. I don't think it is useful to have it so deeply embedded in the data model, and should properly be handled through qualifiers.
Regarding the height of the Eiffel
tower: 324 m +- 1m is exactly what I would like to see here if the source states 324 meter.
I know the source doesn't say +-1m, but this
is certainly supported by the source. Think about why we need this +-1m: it is simply so we can give a useful transformation into feet. Otherwise we cannot convert units.
The +-1m would not be displayed usually.
== Units ==
I sense consensus that we should allow declaration
of units in the wiki, and not to have it hardcoded in the software. Having discussed the various options and in light of the discussion here, the current suggestion would be to create a page for every quantity unit including the appropriate factors (for linear translations). This is similar to the way Freebase does it, as sent around by Tom, and what John McClure suggested.
Then on a given
property, the property points to a quantity unit and furthermore lists the "usual units" for the given property (pointing to the given items), which is used for display.
Internally, for indexing, sorting, and
query answering, we would always transform the input to the quantity unit so they are comparable. But this is usually neither exposed nor a useful number (e.g. it might have too many significant digits etc.)
This would allow to use historic units like Li or historic miles even
though we do not know how to translate them to other units (but not by the same property).
This would also allow for other units, like
Avenue has pointed out. Those are important.
Nikola, we will not
have special handling for money for now. This would require a whole different spec I am afraid. Currency happen 200,000 times in Wikipedia -- it is often, but not so often to be high priority.
I hope that
I managed to digest the whole discussion and bring it together.
Cheers,
Denny
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l [1]
Links: ------ [1] https://lists.wikimedia.org/mailman/listinfo/wikidata-l [2] http://meta.wikimedia.org/w/index.php?title=Wikidata/Development/Representin... [3] http://meta.wikimedia.org/wiki/Wikidata/Notes/Requirements
On 21 December 2012 18:14, Denny Vrandečić denny.vrandecic@wikimedia.de wrote:
What about upper and lower bound?
I did like these but it has a hard semantics as well (which I were not acquainted with, so I did like it until reading: http://en.wikipedia.org/wiki/Upper_and_lower_bounds )
Or just upper and lower? And then leave the interpretation to others.
I like plus and minus better then. For the limited application that this value will have (i.e. not expressing statistics, min/max, ranges, etc.) it should not be a full value, but an addition/subtraction value.
Making it a full value will throw the quantity type into the muddy waters of being abused as a (so far missing) "range-interval with optional central value" type. I think this should be avoided.
By your own priorities, I am somewhat confused why you want to cover this case at all. Can you show the Wikipedia occurrence statistics for values for which a margin of error is given? I can remember plenty of cases where an interval type is needed, but I cannot recall more than 1 or 2 times I have encountered a plus/minus number.
I propose to drop it altogether, and use the xsd based total digits + fractional digits attributes instead, as per the proposals int he discussion.
Gregor, an infinitively precise number (the number of apostles, e.g.) would be handled trivially by +- 0.
I understand that, my problem is the default behaviour. By default (unless a wikidata editor reads this thread first :-P) you make it "11 to 13"
Regarding the height of the Eiffel tower: 324 m +- 1m is exactly what I would like to see here if the source states 324 meter. I know the source doesn't say +-1m, but this is certainly supported by the source. Think about why we need this +-1m: it is simply so we can give a useful transformation into feet. Otherwise we cannot convert units. The +-1m would not be displayed usually.
If Wikidata converts all data entered into something that is supported by the data, but much less informative, the value will be drastically diminished. Your argument would hold for "I know the source doesn't say +-100 m, but this is certainly supported by the source." as well.
The margin of error is as valuable information as the value itself. I see no reason nor justification to fabricate it. Yes, you can add an additional property "fabricated margin of error" (a somewhat derogative, but I think fair rendering of "autouncertainty"). Why does Wikidata want to introduce this level of complexity?
It seems you intend to use it for dealing with conversions and return converted values with meaningful significant digits, is that correct? According to my analysis, the error introduced in applying the digits after conversion is at most 0.5 in the original unit. This is less than the error introduced in the unsupported assumption made when the software invents somewhat arbitrary margins of error (like the +/- 1 m for Eiffel tower) and identical to the error introduced by the more conventional bracketing of 0/- 0.5:
1.234 mm -> 0.001234 m = significant digits are lossless. 1.234 m -> 4.04856 ft -> round to significant digits: -> 4.049 ft 1.234 ft -> 0.376123 m -> round to significant digits: -> 0.3761 m
Gregor
Hi,
On Fri, Dec 21, 2012 at 6:14 PM, Denny Vrandečić denny.vrandecic@wikimedia.de wrote:
Friedrich, the term "query answering" simply means the ability to answer queries against the database in Phase 3, e.g. the list of cities located in Ghana with a population over 25,000 ordered by population.
A query system that deals well with intervals -- I would need a pointer for that. For now I was always assuming to use a single value internally to answer such queries. If the values is 90+-20 then the query >100? would not contain that result. Sucks, but I don't know of any better system.
So taking the current data model and our eiffel tower example: We have the entity "Eiffel Tower". We want to represent "The Eiffel Tower is 324 meters high", so we would create the statement ( Assumptions: wikipedia pages as ids, I understood the wikidata object notation correctly, Number(upperlimit lowerlimit unit) is the object notation for numbers. I omitted the quantity field because i think its redundant at least in this example and the confidence could be added as an additional PropertySnak(?)) ) Statement(' http://en.wikipedia.org/wiki/Eiffel_Tower PropertyValueSnak(' http://en.wikipedia.org/wiki/Height Number(324 324 http://en.wikipedia.org/wiki/Metre) ') {reference and rank omitted} ')
(Note: I though I read somewhere it was decided that all statements on wikidata should at least have one reference, but in the object notation definition the {references} seems to imply this is a optional argument. Also the visual has a 0..* relation, did i miss something?)
Assuming a System of tables by unit and multiples i.e. (meter is m0) is used the table could look like this
Table m0:
propertyid | property | max | min | other information 1234 | Height | 324 | 324 | ... (or 323 | 325 )
an index could be put over min and max and property and the query for all buildings higher then 300 meters could start with:
SELECT propertyid FROM m0 WHERE property = Height AND min > 300 OR max > 300;
This would allow a query for things with a "Height" greater then 300 and it would even include things defined as 290+-20, since the max value would be over 300. The much harder thing imho is the "located in Ghana" part of the query, but I think there are such things as spatial queries for the big databases ( f.e. http://dev.mysql.com/doc/refman/5.0/en/spatial-extensions.html ).
This would also work for queries on Dates with tables that have mindate, maxdate. ( appropo dates here is a short but interesting discussion on how they might be saved in databases if arbitrary dates are needed here: http://stackoverflow.com/a/2487792 )
Representation of "temporal knowledge" (?) seems to be a huge research topic anyway (f.e. http://www.math.unipd.it/~kvenable/RT/corso2009/Allen.pdf or http://www.cs.ox.ac.uk/boris.motik/pubs/nm03fuzzy.pdf, ...); where the problems about time intervals vs. time points, uncertainty and "vagueness" and their representation is discussed.
hope this helps,
Friedrich
Hi,
"Denny Vrandečić" denny.vrandecic@wikimedia.de schrieb:
- Wikidata has to balance ease of use and expressiveness of statements.
The user interface should not get complicated to merely cover a few exceptional edge cases.
So your test UI, had it been a proposal or just anything to have to type in? As a user I'd like to enter values as a whole with its unit and tolerance, absolute as well as relative and in range notation. So everything with just one <input type="text" />. With this it is easier to C+P values from sources.
After the discussion, I decided to drop altitude / elevation from the Geolocation.
I cannot understand this decision, but yes, it can be expressed outside.
I sense consensus that we should allow declaration of units in the wiki, and not to have it hardcoded in the software. Having discussed the various options and in light of the discussion here, the current suggestion would be to create a page for every quantity unit including the appropriate factors (for linear translations). This is similar to the way Freebase does it, as sent around by Tom, and what John McClure suggested.
Why not using Javascript, Lua or anything else? So one could also provide non-linear transcalculation like decibel for example.
Nikola, we will not have special handling for money for now. This would require a whole different spec I am afraid. Currency happen 200,000 times in Wikipedia -- it is often, but not so often to be high priority.
Here the only thing that could help, would be a bot updating the exchange rates. BTW, this is also kind of statistic data. So one could query those on WM as well. But sometimes transcalculations are only wanted once, so such values would also require a timestamp.
Cheers
Marco
Thanks, the prototype helps make some this more concrete.
I am increasingly wondering if "uncertainty" will be overloaded here. People seem to want to use it for various types of measurement uncertainty (e.g. the standard error), ranges with no defined central value, and distributional summaries (e.g. max and min), as well as for the precision with which a value is entered (as in the "auto-certainty" value in the prototype). These are all quite different beasts, and conflating them will probably lead to problems - particularly for precision versus the rest. Which do we choose, if both apply? How will we know which is meant? Maybe marking "auto-certainty" values somehow would mitigate the latter problem, at least.
Avenue
On Thu, Dec 20, 2012 at 4:10 PM, Denny Vrandečić < denny.vrandecic@wikimedia.de> wrote:
I am still trying to catch up with the whole discussion and to distill the results, both here and on the wiki.
In the meanwhile, I have tried to create a prototype of how a complex model can still be entered in a simple fashion. A simple demo can be found here:
The prototype is not i18n.
The user has to enter only the value, in a hopefully intuitive way (try it out), and the full interpretation is displayed here (that, alas, is not intuitive, admittedly).
Cheers, Denny
2012/12/20 jmcclure@hypergrove.com
**
(Proposal 3, modified)
value (xsd:double or xsd:decimal)
unit (a wikidata item)
totalDigits (xsd:smallint)
fractionDigits (xsd:smallint)
originalUnit (a wikidata item)
originalUnitPrefix (a wikidata item)
JMc: I rearranged the list a bit and suggested simpler naming
JMc: Is not originalUnitPrefix directly derived from originalUnit?
JMc: May be more efficient to store not reconstruct the original value. May even be better to store the original value somewhere else entirely, earlier in the process, eg within the context that you indicate would be worthwhile to capture, because I wouldnt expect alot of retrievals, but you anticipate usage patterns certainly better than I.
How about just:
Datatype: .number (Proposal 4)
:value (xsd:double or xsd:decimal)
:unit (a wikidata item) :totalDigits (xsd:smallint) :fractionDigits (xsd:smallint)
:original (a wikidata item that is a number object)
On 20.12.2012 03:08, Gregor Hagedorn wrote:
On 20 December 2012 02:20, jmcclure@hypergrove.com wrote:
For me the question is how to name the precision information. Do not the XSD facets "totalDigits" and "fractionDigits" work well enough? I mean
Yes, that would be one way of modeling it. And I agree with you that, although the xsd attributes originally are devised for datatypes, there is nothing wrong with re-using it for quantities and measurements.
So one way of expressing a measurement with significant digits is: (Proposal 1)
- normalizedValue
- totalDigits
- fractionDigits
- originalUnit
- normalizedUnit
To recover the original information (e.g. that the original value was in feet with a given number of significant digits) the software must convert normalizedUnit to originalUnit, scale to totalDigits with fractionDigits, calculate the remaining powers of ten, and use some information that must be stored together with each unit whether this then should be expressed using an SI unit prefix (the Exa, Tera, Giga, Mega, kilo, hekto, deka, centi, etc.). Some units use them, others not, and some units use only some. Hektoliter is common, hektometer would be very odd. This is slightly complicated by the fact that for some units prefix usage in lay topics differs from scientific use.
If all numbers were expressed ONLY as total digits with fraction digits and unit-prefix, i.e. no power-of-ten exponential, the above would be sufficiently complete. However, without additional information it does not allow to recover the entry:
100,230 * 10^3 tons (value 1.0023e8, 6 total, 3 fractional digits, original unit tons, normalized unit gram)
I had therefore made (on the wiki) the proposal to express it as:
(Proposal 2)
- normalizedValue
- significantDigits (= and I am happy with totalDigits instead)
- originalUnit
- originalUnitPrefix
- normalizedUnit
However I see now that the analysis was wrong, indeed it needs fractionDigits in addition to totalDigits, else a similar problem may occur, i.e. the distribution of the total order of magnitude of the number between non-fractional digits, fractional digits, powers of 10 and powers-of-10-expressed through SI units is still not unambigous.
So the minimal representation seems to be:
(Proposal 3)
- normalizedValue (xsd:double or xsd:decimal)
- totalDigits (xsd:smallint)
- fractionDigits (xsd:smallint)
- originalUnit (a wikidata item)
- originalUnitPrefix (a wikidata item)
- normalizedUnit (a wikidata item)
Adding the originalUnitPrefix has the advantage that it gathers knowledge from users and data creators or resources about which unit prefix is appropriate in a given context.
I see the current wikidata plan to solve this problem by heuristics very critical, I do not see the data set that sufficiently tests the heuristics yet. Gathering information from data entered and creating a formatting heuristics modules over the coming years (instead of weeks) will be valuable for reformatting. The Proposal 3 allows to gather this information.
Gregor
Note 1: The question of other means to express accuracy or precision, e.g. by error margins, statistical measures of spread such as variance, confidence intervals, percentiles, min/max etc. is not yet covered.
Given the present discussion, this should probably be separately agreed upon.
Note 2: Wikipedia Infoboxes may desire to override it, this is for data entering, review, curation, and a default display where no other is defined
Wikidata-l mailing listWikidata-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Not quite on topic but on the subject of uncertainty around dates I've worked with a couple of data sets where birth and death dates were unknown but activity periods [1] were known. These have either had a separate flag called is_flourished (or similar) used to modify born / died or separate flourished dates from birth / death
Some flourished dates here: http://en.wikipedia.org/wiki/List_of_British_architects http://en.wikipedia.org/wiki/Template:Clan_Maclean_Chiefs
[1] http://en.wikipedia.org/wiki/Floruit ________________________________ From: wikidata-l-bounces@lists.wikimedia.org [wikidata-l-bounces@lists.wikimedia.org] on behalf of Avenue [avenue42@gmail.com] Sent: 20 December 2012 19:59 To: Discussion list for the Wikidata project. Subject: Re: [Wikidata-l] Data values
Thanks, the prototype helps make some this more concrete.
I am increasingly wondering if "uncertainty" will be overloaded here. People seem to want to use it for various types of measurement uncertainty (e.g. the standard error), ranges with no defined central value, and distributional summaries (e.g. max and min), as well as for the precision with which a value is entered (as in the "auto-certainty" value in the prototype). These are all quite different beasts, and conflating them will probably lead to problems - particularly for precision versus the rest. Which do we choose, if both apply? How will we know which is meant? Maybe marking "auto-certainty" values somehow would mitigate the latter problem, at least.
Avenue
On Thu, Dec 20, 2012 at 4:10 PM, Denny Vrandečić <denny.vrandecic@wikimedia.demailto:denny.vrandecic@wikimedia.de> wrote: I am still trying to catch up with the whole discussion and to distill the results, both here and on the wiki.
In the meanwhile, I have tried to create a prototype of how a complex model can still be entered in a simple fashion. A simple demo can be found here:
The prototype is not i18n.
The user has to enter only the value, in a hopefully intuitive way (try it out), and the full interpretation is displayed here (that, alas, is not intuitive, admittedly).
Cheers, Denny
2012/12/20 <jmcclure@hypergrove.commailto:jmcclure@hypergrove.com>
(Proposal 3, modified) * value (xsd:double or xsd:decimal)
* unit (a wikidata item)
* totalDigits (xsd:smallint) * fractionDigits (xsd:smallint) * originalUnit (a wikidata item) * originalUnitPrefix (a wikidata item)
JMc: I rearranged the list a bit and suggested simpler naming
JMc: Is not originalUnitPrefix directly derived from originalUnit?
JMc: May be more efficient to store not reconstruct the original value. May even be better to store the original value somewhere else entirely, earlier in the process, eg within the context that you indicate would be worthwhile to capture, because I wouldnt expect alot of retrievals, but you anticipate usage patterns certainly better than I.
How about just:
Datatype: .number (Proposal 4)
----------------------------------------- :value (xsd:double or xsd:decimal)
:unit (a wikidata item) :totalDigits (xsd:smallint) :fractionDigits (xsd:smallint)
:original (a wikidata item that is a number object)
On 20.12.2012 03tel:20.12.2012%2003:08, Gregor Hagedorn wrote:
On 20 December 2012 02:20, <jmcclure@hypergrove.commailto:jmcclure@hypergrove.com> wrote:
For me the question is how to name the precision information. Do not the XSD facets "totalDigits" and "fractionDigits" work well enough? I mean
Yes, that would be one way of modeling it. And I agree with you that, although the xsd attributes originally are devised for datatypes, there is nothing wrong with re-using it for quantities and measurements.
So one way of expressing a measurement with significant digits is: (Proposal 1) * normalizedValue * totalDigits * fractionDigits * originalUnit * normalizedUnit
To recover the original information (e.g. that the original value was in feet with a given number of significant digits) the software must convert normalizedUnit to originalUnit, scale to totalDigits with fractionDigits, calculate the remaining powers of ten, and use some information that must be stored together with each unit whether this then should be expressed using an SI unit prefix (the Exa, Tera, Giga, Mega, kilo, hekto, deka, centi, etc.). Some units use them, others not, and some units use only some. Hektoliter is common, hektometer would be very odd. This is slightly complicated by the fact that for some units prefix usage in lay topics differs from scientific use.
If all numbers were expressed ONLY as total digits with fraction digits and unit-prefix, i.e. no power-of-ten exponential, the above would be sufficiently complete. However, without additional information it does not allow to recover the entry:
100,230 * 10^3 tons (value 1.0023e8, 6 total, 3 fractional digits, original unit tons, normalized unit gram)
I had therefore made (on the wiki) the proposal to express it as:
(Proposal 2) * normalizedValue * significantDigits (= and I am happy with totalDigits instead) * originalUnit * originalUnitPrefix * normalizedUnit
However I see now that the analysis was wrong, indeed it needs fractionDigits in addition to totalDigits, else a similar problem may occur, i.e. the distribution of the total order of magnitude of the number between non-fractional digits, fractional digits, powers of 10 and powers-of-10-expressed through SI units is still not unambigous.
So the minimal representation seems to be:
(Proposal 3) * normalizedValue (xsd:double or xsd:decimal) * totalDigits (xsd:smallint) * fractionDigits (xsd:smallint) * originalUnit (a wikidata item) * originalUnitPrefix (a wikidata item) * normalizedUnit (a wikidata item)
Adding the originalUnitPrefix has the advantage that it gathers knowledge from users and data creators or resources about which unit prefix is appropriate in a given context.
I see the current wikidata plan to solve this problem by heuristics very critical, I do not see the data set that sufficiently tests the heuristics yet. Gathering information from data entered and creating a formatting heuristics modules over the coming years (instead of weeks) will be valuable for reformatting. The Proposal 3 allows to gather this information.
Gregor
Note 1: The question of other means to express accuracy or precision, e.g. by error margins, statistical measures of spread such as variance, confidence intervals, percentiles, min/max etc. is not yet covered.
Given the present discussion, this should probably be separately agreed upon.
Note 2: Wikipedia Infoboxes may desire to override it, this is for data entering, review, curation, and a default display where no other is defined
_______________________________________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.orgmailto:Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
_______________________________________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.orgmailto:Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0tel:%2B49-30-219%20158%2026-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985tel:27%2F681%2F51985.
_______________________________________________ Wikidata-l mailing list Wikidata-l@lists.wikimedia.orgmailto:Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
----------------------------
http://www.bbc.co.uk This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
---------------------
On Thu, Dec 20, 2012 at 3:55 PM, Michael Smethurst michael.smethurst@bbc.co.uk wrote:
Not quite on topic but on the subject of uncertainty around dates I've worked with a couple of data sets where birth and death dates were unknown but activity periods [1] were known. These have either had a separate flag called is_flourished (or similar) used to modify born / died or separate flourished dates from birth / death
Flourished dates are a useful stake in the ground if the birth & death dates aren't known, and are sometimes used in addition to birth/death dates as well, but I think they should be represented by a separate property. Eventually one would be able to derive them automatically from the dates of all known works.
Similarly baptism dates are often used as a proxy for birth dates if the birth date is unknown, but again I think they should be stored in a separate property so that the display is "bapt. 2012-12-21" not "born ca. 2012-12-21," which may be true, but isn't necessarily.
If the date encoding supports open ended ranges, one could map fl. 1340-1360 to born before 1330, died after 1360, to support querying.
Tom
On 20.12.2012 20:59, Avenue wrote:
Thanks, the prototype helps make some this more concrete.
I am increasingly wondering if "uncertainty" will be overloaded here. People seem to want to use it for various types of measurement uncertainty (e.g. the standard error), ranges with no defined central value, and distributional summaries (e.g. max and min), as well as for the precision with which a value is entered (as in the "auto-certainty" value in the prototype). These are all quite different beasts, and conflating them will probably lead to problems -
The idea is to allow for the detailed information as qualifiers, while using the "conflated" uncertainty for query answering. Ideally, for queries, we need a single value. The current solution would be use a single value plus a range of uncertainty for answering queries, so "Primates taller than 1m" will include a species said to be 90+/-20cm or something.
-- daniel
On Fri, Dec 21, 2012 at 2:47 PM, Daniel Kinzler <daniel.kinzler@wikimedia.de
wrote:
On 20.12.2012 20:59, Avenue wrote:
I am increasingly wondering if "uncertainty" will be overloaded here.
People
seem to want to use it for various types of measurement uncertainty
(e.g. the
standard error), ranges with no defined central value, and distributional summaries (e.g. max and min), as well as for the precision with which a
value is
entered (as in the "auto-certainty" value in the prototype). These are
all
quite different beasts, and conflating them will probably lead to
problems -
The idea is to allow for the detailed information as qualifiers, while using the "conflated" uncertainty for query answering. Ideally, for queries, we need a single value. The current solution would be use a single value plus a range of uncertainty for answering queries, so "Primates taller than 1m" will include a species said to be 90+/-20cm or something.
Well, the problem with that approach is that the query results you get when the various "uncertainty" values are conflated will make little sense. For example, suppose monkey species A is recorded as having an average height of 90 cm, converted to 90 +/- 0.5 cm by the "autocertainty" rule; species B is recorded as having an average height of 90 +/- 6 cm, based on a small sample; and species C is recorded as having heights in the range 90 +/- 20 cm, with a footnote in Wikipedia saying that males' heights are in the range 100 +/- 10 cm and females 80 +/- 10 cm (based on a larger sample). Queries for "Primates taller than 1 m", "Primates taller than 95 cm", and "Primates taller than 90 cm" would return C, B+C, and A+B+C respectively. Do you think this is sensible? What if I tell you that these three species were actually all just the same species, only with its height distribution summarised in different ways?
If queries are going to incorporate uncertainty values, I think they need to do it more carefully than this.
Avenue
(Proposal 3, modified)
- value (xsd:double or xsd:decimal)
- unit (a wikidata item)
- totalDigits (xsd:smallint)
- fractionDigits (xsd:smallint)
- originalUnit (a wikidata item)
- originalUnitPrefix (a wikidata item)
JMc: I rearranged the list a bit and suggested simpler naming
We can happily drop "Normalized". I like it better with, but that is simply a matter of taste.
JMc: Is not originalUnitPrefix directly derived from originalUnit?
No, for kilo the original Unit would be gram, original Unit Prefix "k" kilo.
But yes, if we rather want to give each combination of SI prefix and unit a wikidata page, we could do that.
JMc: May be more efficient to store not reconstruct the original value. May even be better to store the original value somewhere else entirely, earlier in the process, eg within the context that you indicate would be worthwhile to capture, because I wouldnt expect alot of retrievals, but you anticipate usage patterns certainly better than I.
Happily. I just try to minimize the number of attributes.
How about just: Datatype: .number (Proposal 4)
:value (xsd:double or xsd:decimal) :unit (a wikidata item) :totalDigits (xsd:smallint) :fractionDigits (xsd:smallint) :original (a wikidata item that is a number object)
Slight correction:
Datatype: measurement (Proposal 5)
:value (xsd:double or xsd:decimal) :unit (a wikidata item) :totalDigits (xsd:smallint) :fractionDigits (xsd:smallint)
Datatype: normalizedMeasurement (inherits measurement)
:original (a wikidata item that is a number object)
Possibly yes. As said above, this implies that all combinations of units and SI prefixes get their own Wikidata item. This is possible to create and could even reflect some of the knowledge about applicable prefixes. Only drawback is that for the normalizedMeasurement fractionDigits is redundant, it will always be totalDigits minus 1, since normalized. But that is a small price for a nicer model.
Gregor
My philosophy is this: We should do whatever works best for Wikidata and Wikidata's needs. If people want to reuse our content, and the choices we've made make existing tools unworkable, they can build new tools themselves. We should not be clinging to "what's been done already" if it gets in the way of "what will make Wikidata better". Everything that we make and do is open, including the software we're going to operate the database on. Every WMF project has done things differently from the standards of the time, and people have developed tools to use our content before. Wikidata will be no different in that regard.
Sven
On Wed, Dec 19, 2012 at 12:27 PM, Martynas Jusevičius <martynas@graphity.org
wrote:
Denny,
you're sidestepping the main issue here -- every sensible architecture should build on as much previous standards as possible, and build own custom solution only if a *very* compelling reason is found to do so instead of finding a compromise between the requirements and the standard. Wikidata seems to be constantly doing the opposite -- building a custom solution with whatever reason, or even without it. This drives the compatibility and reuse towards zero.
This thread originally discussed datatypes for values such as numbers, dates and their intervals -- semantics for all of those are defined in XML Schema Datatypes: http://www.w3.org/TR/xmlschema-2/ All the XML and RDF tools are compatible with XSD, however I don't think there is even a single mention of it in this thread? What makes Wikidata so special that its datatypes cannot build on XSD? And this is only one of the issues, I've pointed out others earlier.
Martynas graphity.org
On Wed, Dec 19, 2012 at 5:58 PM, Denny Vrandečić denny.vrandecic@wikimedia.de wrote:
Martynas,
could you please let me know where RDF or any of the W3C standards covers topics like units, uncertainty, and their conversion. I would be very
much
interested in that.
Cheers, Denny
2012/12/19 Martynas Jusevičius martynas@graphity.org
Hey wikidatians,
occasionally checking threads in this list like the current one, I get a mixed feeling: on one hand, it is sad to see the efforts and resources waisted as Wikidata tries to reinvent RDF, and now also triplestore design as well as XSD datatypes. What's next, WikiQL instead of SPARQL?
On the other hand, it feels reassuring as I was right to predict this:
http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00056.html
http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00750.html
Best,
Martynas graphity.org
On Wed, Dec 19, 2012 at 4:11 PM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
On 19.12.2012 14:34, Friedrich Röhrs wrote:
Hi,
Sorry for my ignorance, if this is common knowledge: What is the use case for sorting millions of different measures from different objects?
Finding all cities with more than 100000 inhabitants requires the database to look through all values for the property "population" (or even all properties with countable values, depending on implementation an query planning), compare each value with "100000" and return those with a greater value. To
speed
this up, an index sorted by this value would be needed.
For cars there could be entries by the manufacturer, by some car-testing magazine, etc. I don't see how this could be adequatly represented/sorted by a database only query.
If this cannot be done adequatly on the database level, then it cannot be done efficiently, which means we will not allow it. So our task is to come
up
with an architecture that does allow this.
(One way to allow "scripted" queries like this to run efficiently is
to
do this in a massively parallel way, using a map/reduce framework. But that's also not trivial, and would require a whole new server infrastructure).
If however this is necessary, i still don't understand why it must affect the datavalue structure. If a index is necessary it could be done over a serialized representation of the value.
"Serialized" can mean a lot of things, but an index on some data blob
is
only useful for exact matches, it can not be used for greater/lesser
queries.
We need to map our values to scalar data types the database can understand directly, and use for indexing.
This needs to be done anyway, since the values are saved at a specific unit (which is just a wikidata item). To compare them on a database level they must all be saved at the same unit, or some sort
of
procedure must be used to compare them (or am i missing something again?).
If they measure the same dimension, they should be saved using the
same
unit (probably the SI base unit for that dimension). Saving values using different units would make it impossible to run efficient queries against these values, thereby defying one of the major reasons for Wikidata's existance. I don't see a way around this.
-- daniel
-- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.
V.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
I suspect what Martynas is driving at is that XMLS defines **FACETS** for its datatypes - accepting those as a baseline, and then extending them to your requirements, is a reasonable, community-oriented procss. However, wrapping oneself in the flag of "open development" is to me unresponsive to a simple plea to stand on the shoulders of giants gone before, to act in a responsible manner cognizant of the interests of the broader community.
And personally I have to say I don't like the word "clinging" -- clearly a red flag meant to inflame if not insult. This is no place for that!
On 19.12.2012 09:47, Sven Manguard wrote:
My philosophy is this: We should do whatever works best for
Wikidata and Wikidata's needs. If people want to reuse our content, and the choices we've made make existing tools unworkable, they can build new tools themselves. We should not be clinging to "what's been done already" if it gets in the way of "what will make Wikidata better". Everything that we make and do is open, including the software we're going to operate the database on. Every WMF project has done things differently from the standards of the time, and people have developed tools to use our content before. Wikidata will be no different in that regard.
Sven
On Wed, Dec 19, 2012 at 12:27 PM, Martynas
Jusevičius martynas@graphity.org wrote:
Denny,
you're
sidestepping the main issue here -- every sensible architecture
should build on as much previous standards as possible, and build own
custom solution only if a *very* compelling reason is found to do so
instead of finding a compromise between the requirements and the
standard. Wikidata seems to be constantly doing the opposite --
building a custom solution with whatever reason, or even without it.
This drives the compatibility and reuse towards zero.
This thread
originally discussed datatypes for values such as numbers,
dates and
their intervals -- semantics for all of those are defined in
XML
Schema Datatypes: http://www.w3.org/TR/xmlschema-2/ [1]
All the XML
and RDF tools are compatible with XSD, however I don't
think there is
even a single mention of it in this thread? What makes
Wikidata so
special that its datatypes cannot build on XSD? And this
is only one
of the issues, I've pointed out others earlier.
Martynas
graphity.org [2]
On Wed, Dec 19, 2012 at 5:58 PM, Denny
Vrandečić
denny.vrandecic@wikimedia.de wrote:
Martynas,
could you please let me know where RDF or any of the W3C
standards covers
topics like units, uncertainty, and their
conversion. I would be very much
interested in that.
Cheers,
Denny
2012/12/19 Martynas
Jusevičius martynas@graphity.org
Hey wikidatians,
occasionally checking threads in this list like the current one, I
get
a mixed feeling: on one hand, it is sad to see the efforts
and
resources waisted as Wikidata tries to reinvent RDF, and now
also
triplestore design as well as XSD datatypes. What's next,
WikiQL
instead of SPARQL?
On the other hand, it feels
reassuring as I was right to predict this:
http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00056.html [3]
http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00750.html [4]
Best,
Martynas graphity.org [2]
On Wed, Dec 19, 2012 at 4:11 PM, Daniel Kinzler
daniel.kinzler@wikimedia.de wrote:
On 19.12.2012 14:34,
Friedrich Röhrs wrote:
Hi,
Sorry for my
ignorance, if this is common knowledge: What is the use
case
for
sorting millions of different measures from different
objects?
Finding all cities with more than 100000
inhabitants requires the
database to look through all
values for the property "population" (or even all
properties
with countable values, depending on implementation an query
planning),
compare each value with "100000" and return
those with a greater value. To speed
this up, an index
sorted by this value would be needed.
For cars there
could be entries by the manufacturer, by some
car-testing
magazine, etc. I don't see how this could be adequatly
represented/sorted by a database only query.
If this
cannot be done adequatly on the database level, then it cannot
be done
efficiently, which means we will not allow it. So our
task is to come up
with an architecture that does allow
this.
(One way to allow "scripted" queries like this to
run efficiently is to
do this in a massively parallel
way, using a map/reduce framework. But that's
also not
trivial, and would require a whole new server infrastructure).
If however this is necessary, i still don't understand why it
must
affect the datavalue structure. If a index is
necessary it could be done over a
serialized
representation of the value.
"Serialized" can mean a lot
of things, but an index on some data blob is
only useful
for exact matches, it can not be used for greater/lesser queries.
We need
to map our values to scalar data types the database
can understand
directly, and use for indexing.
This needs to be done anyway, since the values are
saved at a specific unit (which is just a wikidata item). To compare
them on a
database level they must all be saved at the
same unit, or some sort of
procedure must be used to compare
them (or am i missing something
again?).
If
they measure the same dimension, they should be saved using the same
unit
(probably the SI base unit for that dimension). Saving
values using
different units would make it impossible to
run efficient queries against these
values, thereby
defying one of the major reasons for Wikidata's existance. I
don't see a
way around this.
-- daniel
-- Daniel Kinzler, Softwarearchitekt Wikimedia
Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l [5]
_______________________________________________
Wikidata-l mailing
list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l [5]
-- Project director Wikidata Wikimedia Deutschland
e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 |
Wikimedia Deutschland - Gesellschaft
zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister
des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B.
Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I
Berlin, Steuernummer 27/681/51985.
_______________________________________________
Wikidata-l mailing
list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l [5]
_______________________________________________
Wikidata-l mailing
list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l [5]
_______________________________________________
Wikidata-l mailing
list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l [5]
Links: ------ [1] http://www.w3.org/TR/xmlschema-2/ [2] http://graphity.org [3] http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00056.html [4] http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00750.html [5] https://lists.wikimedia.org/mailman/listinfo/wikidata-l [6] http://wikimedia.de
Wow, what a long thread. I was just about to chime in to agree with Sven's point about units when he interjected his comment about blithely ignoring history, so I feel compelled to comment on that first. It's fine to ignore standards *for good reasons*, but doing it out of ignorance or gratuitously is just silly. Thinking that WMF is so special it can create a better solution without even know what others have done before is the height of arrogance.
Modeling time and units can basically be made arbitrary complex, so the trick is in achieving the right balance of complexity vs utility. Time is complex enough that I think it deserves it's own thread. The first thing I'd do is establish some definitions to cover some basics like durations/intervals, uncertain dates, unknown dates, imprecise dates, etc to that everyone is using the same terminology and concepts. Much of the time discussion is difficult for me to follow because I have to guess at what people mean. In addition to the ability to handle circa/about dates already mentioned, it's also useful to be able to represent before/after dates e.g. he died before 1 Dec 1792 when his will was probated. Long term I suspect you'll need support for additional calendars rather than converting everything to a common calendar, but only supporting Gregorian is a good way to limit complexity to start with. Geologic times may (probably?) need to be modeled differently.
Although I disagree strongly with Sven's sentiments about the appropriateness of reinventing things, I believe he's right about the need to support more units than just SI units and to know what units were used in the original measurement. It's not just a matter of aesthetics but of being able to preserve the provenance. Perhaps this gets saved for a future iteration, but you may find that you need both display and computable versions of things stored separately.
Speaking of computable versions don't underestimate the issues with using floating points numbers. There are numbers that they just can't represent and their range is not infinite.
Historians and genealogists have many interminable discussions about date/time representation which can be found in various list archives, but one recent spec worth reviewing is Extended Date/Time Format (EDTF) http://www.loc.gov/standards/datetime/pre-submission.html
Another thing worth looking at is the Freebase schema since it not only represents a bunch of this stuff already, but it's got real world data stored in the schema and user interface implementations for input and rendering (although many of the latter could be improved). In particular, some of the following might be of interest:
http://www.freebase.com/view/measurement_unit / http://www.freebase.com/schema/measurement_unit http://www.freebase.com/schema/time http://www.freebase.com/schema/astronomy/celestial_object_age http://www.freebase.com/schema/time/geologic_time_period http://www.freebase.com/schema/time/geologic_time_period_uncertainty
If you rummage around, you can probably find lots of interesting examples and decide for yourself whether or not that's a good way to model things. I'm reasonably familiar with the schema and happy to answer questions.
There are probably lots of other example vocabularlies that one could review such as the Pleiades project's: http://pleiades.stoa.org/vocabularies
You're not going to get it right the first time, so I would just start with a small core that you're reasonably confident in and iterate from there.
Tom
On Wed, Dec 19, 2012 at 12:47 PM, Sven Manguard svenmanguard@gmail.comwrote:
My philosophy is this: We should do whatever works best for Wikidata and Wikidata's needs. If people want to reuse our content, and the choices we've made make existing tools unworkable, they can build new tools themselves. We should not be clinging to "what's been done already" if it gets in the way of "what will make Wikidata better". Everything that we make and do is open, including the software we're going to operate the database on. Every WMF project has done things differently from the standards of the time, and people have developed tools to use our content before. Wikidata will be no different in that regard.
Sven
On Wed, Dec 19, 2012 at 12:27 PM, Martynas Jusevičius < martynas@graphity.org> wrote:
Denny,
you're sidestepping the main issue here -- every sensible architecture should build on as much previous standards as possible, and build own custom solution only if a *very* compelling reason is found to do so instead of finding a compromise between the requirements and the standard. Wikidata seems to be constantly doing the opposite -- building a custom solution with whatever reason, or even without it. This drives the compatibility and reuse towards zero.
This thread originally discussed datatypes for values such as numbers, dates and their intervals -- semantics for all of those are defined in XML Schema Datatypes: http://www.w3.org/TR/xmlschema-2/ All the XML and RDF tools are compatible with XSD, however I don't think there is even a single mention of it in this thread? What makes Wikidata so special that its datatypes cannot build on XSD? And this is only one of the issues, I've pointed out others earlier.
Martynas graphity.org
On Wed, Dec 19, 2012 at 5:58 PM, Denny Vrandečić denny.vrandecic@wikimedia.de wrote:
Martynas,
could you please let me know where RDF or any of the W3C standards
covers
topics like units, uncertainty, and their conversion. I would be very
much
interested in that.
Cheers, Denny
2012/12/19 Martynas Jusevičius martynas@graphity.org
Hey wikidatians,
occasionally checking threads in this list like the current one, I get a mixed feeling: on one hand, it is sad to see the efforts and resources waisted as Wikidata tries to reinvent RDF, and now also triplestore design as well as XSD datatypes. What's next, WikiQL instead of SPARQL?
On the other hand, it feels reassuring as I was right to predict this:
http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00056.html
http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00750.html
Best,
Martynas graphity.org
On Wed, Dec 19, 2012 at 4:11 PM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
On 19.12.2012 14:34, Friedrich Röhrs wrote:
Hi,
Sorry for my ignorance, if this is common knowledge: What is the use case for sorting millions of different measures from different objects?
Finding all cities with more than 100000 inhabitants requires the database to look through all values for the property "population" (or even all properties with countable values, depending on implementation an query
planning),
compare each value with "100000" and return those with a greater value. To
speed
this up, an index sorted by this value would be needed.
For cars there could be entries by the manufacturer, by some car-testing magazine, etc. I don't see how this could be adequatly represented/sorted by a database only query.
If this cannot be done adequatly on the database level, then it
cannot
be done efficiently, which means we will not allow it. So our task is to
come up
with an architecture that does allow this.
(One way to allow "scripted" queries like this to run efficiently is
to
do this in a massively parallel way, using a map/reduce framework. But that's also not trivial, and would require a whole new server infrastructure).
If however this is necessary, i still don't understand why it must affect the datavalue structure. If a index is necessary it could be done over a serialized representation of the value.
"Serialized" can mean a lot of things, but an index on some data
blob is
only useful for exact matches, it can not be used for greater/lesser
queries.
We need to map our values to scalar data types the database can understand directly, and use for indexing.
This needs to be done anyway, since the values are saved at a specific unit (which is just a wikidata item). To compare them on a database level they must all be saved at the same unit, or some
sort of
procedure must be used to compare them (or am i missing something again?).
If they measure the same dimension, they should be saved using the
same
unit (probably the SI base unit for that dimension). Saving values using different units would make it impossible to run efficient queries against these values, thereby defying one of the major reasons for Wikidata's existance. I don't see a way around this.
-- daniel
-- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.
V.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Here's a suggestion. Property names for numeric information seem to be on the table -- these should be viewed systematically not haphazardly.
If all text properties had a "dotted" lower-case name, life would be simpler in SMW land all around and maybe Wikidata land too. All page names have an initial capital as a consequence of requiring all text properties to be named with an initial "period" followed by a lower-case letter. The SMW tool mandates the properties from which all derive: .text, .string and .number are basic (along with others like .page). Then, strings have language-based subproperties and number expression subproperties, and numbers have XSD datatype subpropertiess, which in turn have SI unit type subproperties, and so on.
Here's a "Consolidated Listing of ISO 639, ISO 4217, SI Measurement Symbols, and World Time Zones [1]" [1] to illustrate that it is possible to create a unified string- & numeric-type property name dictionary across a wide swath of the standards world. The document lists a few overlapping symbols then re-assigned to another symbol.
Adopting a "dotted name" text-property naming convention, can segue to easier user interfaces too for query forms at least plus impacts exploited by an SMW query engine. What is meant by these expressions seems pretty natural to most people:
Property: Height - the value is a wiki pagename or objectname for a "height" numeric object Property: .text - (on Height) the value is text markup associated with the Height object Property: .string - (on Height) the value is text non-markup data for the Height object Property: .ft - (on Height) the value is number of feet associated with the Height object Property: Height.text - the value is text markup associated with an anonymous Height object
Property: Height.string - the value is a "string" property of an anonymous Height object Property: Height.ft - the value is a "feet" property of an anonymous Height object
[1] http://www.hypergrove.com/Publications/Symbols.html
Links: ------ [1] http://www.hypergrove.com/Publications/Symbols.html
Using the dotted notation, XSD datatype facets such as below can be specified easily as properties using a simple colon:
Property: .anyType:equal - (sameAs equivaluent) redirect to page/object with actual numeric value Property: .anyType:ordered - a boolean property
Property: .anyType:bounded - a boolean property Property: .anyType:cardinality - a boolean property Property: .anyType:numeric - a boolean property Property: .anyType:length - number of chars allowed for value Property: .anyType:minLength - min nbr of chars for value
Property: .anyType:maxLength - max nbr of chars for value Property: .anyType:pattern - regex string Property: .anyType:enumeration - specified values comprising value space Property: .anyType:whiteSpace - reserve or replace or collapse Property: .anyType:maxExclusive - number for an upper bound Property: .anyType:maxInclusive - number for an upper bound Property: .anyType:minExclusive - number for an lower bound Property: .anyType:minInclusive - number for an lower bound
Property: .anyType:totalDigits - number of total digits Property: .anyType:fractionDigits - number of digits in the fractional part of a number
An anonymous object is used to represent namespace-qualified (text & url) values eg_ rdf:about_:
Property: .:rdf:about - this is a .url value for an RDF "about" property for a page/object Property: .:skos:prefLabel - this is a .name value for a page/object
I suggest that properties for "precision" can be found in XSD facets above. - john
On 19.12.2012 12:41, jmcclure@hypergrove.com wrote:
Here's a
suggestion. Property names for numeric information seem to be on the table -- these should be viewed systematically not haphazardly.
If
all text properties had a "dotted" lower-case name, life would be simpler in SMW land all around and maybe Wikidata land too. All page names have an initial capital as a consequence of requiring all text properties to be named with an initial "period" followed by a lower-case letter. The SMW tool mandates the properties from which all derive: .text, .string and .number are basic (along with others like .page). Then, strings have language-based subproperties and number expression subproperties, and numbers have XSD datatype subpropertiess, which in turn have SI unit type subproperties, and so on.
Here's a
"Consolidated Listing of ISO 639, ISO 4217, SI Measurement Symbols, and World Time Zones [2]" [1] to illustrate that it is possible to create a unified string- & numeric-type property name dictionary across a wide swath of the standards world. The document lists a few overlapping symbols then re-assigned to another symbol.
Adopting a "dotted
name" text-property naming convention, can segue to easier user interfaces too for query forms at least plus impacts exploited by an SMW query engine. What is meant by these expressions seems pretty natural to most people:
Property: Height - the value is a wiki pagename or
objectname for a "height" numeric object
Property: .text - (on Height)
the value is text markup associated with the Height object
Property:
.string - (on Height) the value is text non-markup data for the Height object
Property: .ft - (on Height) the value is number of feet
associated with the Height object
Property: Height.text - the value is
text markup associated with an anonymous Height object
Property:
Height.string - the value is a "string" property of an anonymous Height object
Property: Height.ft - the value is a "feet" property of an
anonymous Height object
[1]
http://www.hypergrove.com/Publications/Symbols.html
_______________________________________________
Wikidata-l mailing
list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l [1]
Links: ------ [1] https://lists.wikimedia.org/mailman/listinfo/wikidata-l [2] http://www.hypergrove.com/Publications/Symbols.html
I think that Tom Morris tragically misunderstood my point, although that was likely helped by the fact that, as I've insinuated already, the many standards and acronyms being thrown about are largely lost on me.
My point is not "We can just throw everything out because we're big and awesome and have name brand power". My point was "We're going to reach a point where some of the existing standards and tools just don't work because when they were built things like Wikidata weren't envisioned. We need to have the mindset that developing new pieces that work for us is better than trying to force a square peg into a round hole just because something is already widely used. If what exists doesn't work, we're going to do more harm than good if we have to start cutting corners or cutting features to try and get it to work. We have an infrestructure that would allow third parties to come along later and build tools that allow there to be a bridge between whatever we create and whatever exists already".
Sven
On Wed, Dec 19, 2012 at 2:40 PM, Tom Morris tfmorris@gmail.com wrote:
Wow, what a long thread. I was just about to chime in to agree with Sven's point about units when he interjected his comment about blithely ignoring history, so I feel compelled to comment on that first. It's fine to ignore standards *for good reasons*, but doing it out of ignorance or gratuitously is just silly. Thinking that WMF is so special it can create a better solution without even know what others have done before is the height of arrogance.
Modeling time and units can basically be made arbitrary complex, so the trick is in achieving the right balance of complexity vs utility. Time is complex enough that I think it deserves it's own thread. The first thing I'd do is establish some definitions to cover some basics like durations/intervals, uncertain dates, unknown dates, imprecise dates, etc to that everyone is using the same terminology and concepts. Much of the time discussion is difficult for me to follow because I have to guess at what people mean. In addition to the ability to handle circa/about dates already mentioned, it's also useful to be able to represent before/after dates e.g. he died before 1 Dec 1792 when his will was probated. Long term I suspect you'll need support for additional calendars rather than converting everything to a common calendar, but only supporting Gregorian is a good way to limit complexity to start with. Geologic times may (probably?) need to be modeled differently.
Although I disagree strongly with Sven's sentiments about the appropriateness of reinventing things, I believe he's right about the need to support more units than just SI units and to know what units were used in the original measurement. It's not just a matter of aesthetics but of being able to preserve the provenance. Perhaps this gets saved for a future iteration, but you may find that you need both display and computable versions of things stored separately.
Speaking of computable versions don't underestimate the issues with using floating points numbers. There are numbers that they just can't represent and their range is not infinite.
Historians and genealogists have many interminable discussions about date/time representation which can be found in various list archives, but one recent spec worth reviewing is Extended Date/Time Format (EDTF) http://www.loc.gov/standards/datetime/pre-submission.html
Another thing worth looking at is the Freebase schema since it not only represents a bunch of this stuff already, but it's got real world data stored in the schema and user interface implementations for input and rendering (although many of the latter could be improved). In particular, some of the following might be of interest:
http://www.freebase.com/view/measurement_unit / http://www.freebase.com/schema/measurement_unit http://www.freebase.com/schema/time http://www.freebase.com/schema/astronomy/celestial_object_age http://www.freebase.com/schema/time/geologic_time_period http://www.freebase.com/schema/time/geologic_time_period_uncertainty
If you rummage around, you can probably find lots of interesting examples and decide for yourself whether or not that's a good way to model things. I'm reasonably familiar with the schema and happy to answer questions.
There are probably lots of other example vocabularlies that one could review such as the Pleiades project's: http://pleiades.stoa.org/vocabularies
You're not going to get it right the first time, so I would just start with a small core that you're reasonably confident in and iterate from there.
Tom
On Wed, Dec 19, 2012 at 12:47 PM, Sven Manguard svenmanguard@gmail.comwrote:
My philosophy is this: We should do whatever works best for Wikidata and Wikidata's needs. If people want to reuse our content, and the choices we've made make existing tools unworkable, they can build new tools themselves. We should not be clinging to "what's been done already" if it gets in the way of "what will make Wikidata better". Everything that we make and do is open, including the software we're going to operate the database on. Every WMF project has done things differently from the standards of the time, and people have developed tools to use our content before. Wikidata will be no different in that regard.
Sven
On Wed, Dec 19, 2012 at 12:27 PM, Martynas Jusevičius < martynas@graphity.org> wrote:
Denny,
you're sidestepping the main issue here -- every sensible architecture should build on as much previous standards as possible, and build own custom solution only if a *very* compelling reason is found to do so instead of finding a compromise between the requirements and the standard. Wikidata seems to be constantly doing the opposite -- building a custom solution with whatever reason, or even without it. This drives the compatibility and reuse towards zero.
This thread originally discussed datatypes for values such as numbers, dates and their intervals -- semantics for all of those are defined in XML Schema Datatypes: http://www.w3.org/TR/xmlschema-2/ All the XML and RDF tools are compatible with XSD, however I don't think there is even a single mention of it in this thread? What makes Wikidata so special that its datatypes cannot build on XSD? And this is only one of the issues, I've pointed out others earlier.
Martynas graphity.org
On Wed, Dec 19, 2012 at 5:58 PM, Denny Vrandečić denny.vrandecic@wikimedia.de wrote:
Martynas,
could you please let me know where RDF or any of the W3C standards
covers
topics like units, uncertainty, and their conversion. I would be very
much
interested in that.
Cheers, Denny
2012/12/19 Martynas Jusevičius martynas@graphity.org
Hey wikidatians,
occasionally checking threads in this list like the current one, I get a mixed feeling: on one hand, it is sad to see the efforts and resources waisted as Wikidata tries to reinvent RDF, and now also triplestore design as well as XSD datatypes. What's next, WikiQL instead of SPARQL?
On the other hand, it feels reassuring as I was right to predict this:
http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00056.html
http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00750.html
Best,
Martynas graphity.org
On Wed, Dec 19, 2012 at 4:11 PM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
On 19.12.2012 14:34, Friedrich Röhrs wrote: > Hi, > > Sorry for my ignorance, if this is common knowledge: What is the
use
> case for > sorting millions of different measures from different objects?
Finding all cities with more than 100000 inhabitants requires the database to look through all values for the property "population" (or even all properties with countable values, depending on implementation an query
planning),
compare each value with "100000" and return those with a greater value. To
speed
this up, an index sorted by this value would be needed.
> For cars there could be entries by the manufacturer, by some > car-testing magazine, etc. I don't see how this could be adequatly > represented/sorted by a database only query.
If this cannot be done adequatly on the database level, then it
cannot
be done efficiently, which means we will not allow it. So our task is to
come up
with an architecture that does allow this.
(One way to allow "scripted" queries like this to run efficiently
is to
do this in a massively parallel way, using a map/reduce framework. But
that's
also not trivial, and would require a whole new server infrastructure).
> If however this is necessary, i still don't understand why it must > affect the > datavalue structure. If a index is necessary it could be done over
a
> serialized > representation of the value.
"Serialized" can mean a lot of things, but an index on some data
blob is
only useful for exact matches, it can not be used for greater/lesser
queries.
We need to map our values to scalar data types the database can understand directly, and use for indexing.
> This needs to be done anyway, since the values are > saved at a specific unit (which is just a wikidata item). To
compare
> them on a > database level they must all be saved at the same unit, or some
sort of
> procedure must be used to compare them (or am i missing something > again?).
If they measure the same dimension, they should be saved using the
same
unit (probably the SI base unit for that dimension). Saving values using different units would make it impossible to run efficient queries against
these
values, thereby defying one of the major reasons for Wikidata's existance. I don't see a way around this.
-- daniel
-- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens
e. V.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
If one has time to read prior art, I'd suggest giving the Health Level 7 v3.0 Data Types Specification http://amisha.pragmaticdata.com/v3dt/report.html a look.
Of course HL7 has a lot of things to worry about which are off topic for us, starting with a prior completely different version of the standard. And much emphasis goes to coded values (enums) and coding systems, but it is a nice review of issues found and solved by many eyeballs and years.
Peter
On 2012-12-19 15:11, Daniel Kinzler wrote:
On 19.12.2012 14:34, Friedrich Röhrs wrote:
Hi,
Sorry for my ignorance, if this is common knowledge: What is the use case for sorting millions of different measures from different objects?
Finding all cities with more than 100000 inhabitants requires the database to look through all values for the property "population" (or even all properties with countable values, depending on implementation an query planning), compare each value with "100000" and return those with a greater value. To speed this up, an index sorted by this value would be needed.
To be added by multiple simultaneous sorting operations.
For cars there could be entries by the manufacturer, by some car-testing magazine, etc. I don't see how this could be adequatly represented/sorted by a database only query.
If this cannot be done adequatly on the database level, then it cannot be done efficiently, which means we will not allow it. So our task is to come up with an architecture that does allow this.
(One way to allow "scripted" queries like this to run efficiently is to do this in a massively parallel way, using a map/reduce framework. But that's also not trivial, and would require a whole new server infrastructure).
Software developers are not allowed to just think of the status quo they also have to think of use case the solution might gonna be used.
There is e.g. the idea of pushing the monuments lists into wikidata. Only in Austria there are 36.000-37.000 of those. Germany is much bigger but has a similar history with probably an equal number per square kilometers. Sorting this by distance to a specific place needs to be done by the database. Everything else will be too ineffective.
If however this is necessary, i still don't understand why it must affect the datavalue structure. If a index is necessary it could be done over a serialized representation of the value.
"Serialized" can mean a lot of things, but an index on some data blob is only useful for exact matches, it can not be used for greater/lesser queries. We need to map our values to scalar data types the database can understand directly, and use for indexing.
+1
This needs to be done anyway, since the values are saved at a specific unit (which is just a wikidata item). To compare them on a database level they must all be saved at the same unit, or some sort of procedure must be used to compare them (or am i missing something again?).
If they measure the same dimension, they should be saved using the same unit (probably the SI base unit for that dimension). Saving values using different units would make it impossible to run efficient queries against these values, thereby defying one of the major reasons for Wikidata's existance. I don't see a way around this.
IMHO this should be part of a model. E.g. Altitudes are usually measured in metres or feet, never in km or yards. Distances have the same SI base unit but are measured also measured in km, depending of the use case.
Maybe we should make a difference between internal usage and visualization. Comparing meters with kilometers and feet is quite difficult, transcaling everything on visualization not.
Cheers
Marco
On Wed, Dec 19, 2012 at 2:32 PM, Marco Fleckinger < marco.fleckinger@wikipedia.at> wrote:
IMHO this should be part of a model. E.g. Altitudes are usually measured in metres or feet, never in km or yards. Distances have the same SI base unit but are measured also measured in km, depending of the use case.
No, altitudes are sometimes measured in km, at least once you get beyond the Earth's surface.
Orbit height 559 km (347 mi)
Peak 21 km (69,000 ft) above datum
No, altitudes are sometimes measured in km, at least once you get beyond the Earth's surface.
From http://en.wikipedia.org/wiki/Hubble_Space_Telescope: Orbit height 559 km (347 mi)
From http://en.wikipedia.org/wiki/Olympus_Mons: Peak 21 km (69,000 ft) above datum
Yes, but the first is an altitude, the second more precisely termed an elevation. Good to have examples for both...
Altitude may be above Mean Sea Level (MSL) or above local ground level (Above Ground Level, or AGL) or even further definitions exist. Elevation is always Ground Level above Mean Sea Level. I believe Wikidata should only deal with the latter.
Alternatively, if sticking with Altitude a required qualifier would be need to distinguish which kind of altitude is meant.
Else a geolocation of a radiotower: with altitude = 400 m could mean that the ground is 400 m above MSL or that the top is 400 m about GL.
Gregor
On 19.12.2012 15:32, Marco Fleckinger wrote:
Maybe we should make a difference between internal usage and visualization. Comparing meters with kilometers and feet is quite difficult, transcaling everything on visualization not.
Not maybe. Definitely. Visualization is based on user preference, interface language, and heuristics for picking a decent unit based on dimension and accuracy. The internal representation should use the same unit for all quantities of a given dimension.
-- daniel
On 19 December 2012 15:11, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
If they measure the same dimension, they should be saved using the same unit (probably the SI base unit for that dimension). Saving values using different units would make it impossible to run efficient queries against these values, thereby defying one of the major reasons for Wikidata's existance. I don't see a way around this.
Daniel confirms (in separate mail) that Wikidata indeed intends to convert any derived SI units to a common formula of base units.
Example: a quantity like "1013 hektopascal", the common unit for meterological barometric pressure (this used to be millibar), would be stored and re-displayed as 1.013 10^5 kg⋅m−1⋅s−2
I see several problems with this approach:
1. Many base units are little known. "kg⋅m2⋅s−3⋅A−2" for Ohm... It breaks communication with humans curating data on wikidata. It will make it very difficult to compare data entered into wikidata for correctness, because the data displayed after saving will have little relation with the data entered. This makes Wikidata inherently unsuitable for an effort like Wikipedia with many authors and the reliance on fact checking.
2. Even for standard base units, there is often a 1:n relation. e,g, both gray and sievert have the same base unit. The base unit for lumen is candela (because the steradians is not a unit, but part of the derived unit applicability definition)
Gregor
On 19.12.2012 16:47, Gregor Hagedorn wrote:
Daniel confirms (in separate mail) that Wikidata indeed intends to convert any derived SI units to a common formula of base units.
Example: a quantity like "1013 hektopascal", the common unit for meterological barometric pressure (this used to be millibar), would be stored and re-displayed as 1.013 10^5 kg⋅m−1⋅s−2
Converted and stored, yes, but not displayed. For display, it would be converted to a customary/convenient unit according to the user's (or client site's) locale, using a bit of heuristics to get the scale (order of magnitude) right.
Of course, in wikitext, the desired output unit can be specified.
-- daniel
On 2012-12-19 16:56, Daniel Kinzler wrote:
On 19.12.2012 16:47, Gregor Hagedorn wrote:
Daniel confirms (in separate mail) that Wikidata indeed intends to convert any derived SI units to a common formula of base units.
Example: a quantity like "1013 hektopascal", the common unit for meterological barometric pressure (this used to be millibar), would be stored and re-displayed as 1.013 10^5 kg⋅m−1⋅s−2
Converted and stored, yes, but not displayed. For display, it would be converted to a customary/convenient unit according to the user's (or client site's) locale, using a bit of heuristics to get the scale (order of magnitude) right.
Of course, in wikitext, the desired output unit can be specified.
Actually we have 3 different use cases of values:
1. Internally in the data base 2. On wikidata.org 3. On other projects like WP and also WM external projects
SI shall be used internally (1.)
On 2. the user can decide what he wants.
On 3. either some standard setting of the MW-project says, what is desired or the article's author.
Via API (also 3.) you should be able to choose: * precision * displaying of accuracy * unit
-- Marco
When we speak about dimensions, we talk about properties right?
So when I define the property "height of a person" as an entity, i would supply the SI unit (m) and the SI multiple (-2, cm) that it should be saved in (in the database).
When someone then inputs the height in meters (e.g. 1.86m) it would be converted to the matching SI multiple before being saved (i.e. 186 (cm)).
On the database side each SI multiple would get its own table so that indexes can easily be made. Depending on which multiple we choose in the property the datavalue would be saved to a different table. Did i get the idea correctly?
On Wed, Dec 19, 2012 at 4:47 PM, Gregor Hagedorn g.m.hagedorn@gmail.comwrote:
On 19 December 2012 15:11, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
If they measure the same dimension, they should be saved using the same
unit
(probably the SI base unit for that dimension). Saving values using
different
units would make it impossible to run efficient queries against these
values,
thereby defying one of the major reasons for Wikidata's existance. I
don't see a
way around this.
Daniel confirms (in separate mail) that Wikidata indeed intends to convert any derived SI units to a common formula of base units.
Example: a quantity like "1013 hektopascal", the common unit for meterological barometric pressure (this used to be millibar), would be stored and re-displayed as 1.013 10^5 kg⋅m−1⋅s−2
I see several problems with this approach:
- Many base units are little known. "kg⋅m2⋅s−3⋅A−2" for Ohm... It
breaks communication with humans curating data on wikidata. It will make it very difficult to compare data entered into wikidata for correctness, because the data displayed after saving will have little relation with the data entered. This makes Wikidata inherently unsuitable for an effort like Wikipedia with many authors and the reliance on fact checking.
- Even for standard base units, there is often a 1:n relation. e,g,
both gray and sievert have the same base unit. The base unit for lumen is candela (because the steradians is not a unit, but part of the derived unit applicability definition)
Gregor
On 19.12.2012 08:53, Gregor Hagedorn wrote:
Displaying the numbers is another question. There I have to agree that it always makes sense to also store a typical used unit for that type of data.
I agree. What I propose is that the user interface supports entering and proofreading "10.6 nm" as "10.6" plus "n" (= nano) plus "meter".
Yes, absolutely.
How the value is stored in the data property, whether as 10.6 floating point or as 1.6e-8 is a second issue -- the latter is probably preferable.
I think neither is sufficient: we need a representation that allows for arbitrary (or at least very great) precision, and can still be indexed and compared natively by (different!) database systems. Fixed length strings can easily do that, if they are long enough. That's pretty inefficient, though.
IEEE floats work natively, but don't guarantee enough precision (well, maybe 128 bit floats come close?). The SQL, "decimal" might be sufficient: in MySQL, it allows 30 decimal digits before the decimal point, and up to 64 after. But that's still not enough to measure the extent of the universe in Plancks.
In addition to a storage option of the desired unit prefix (this may be considered a original-prefix, since naturally re-users may wish to reformat this).
I see no point in storing the unit used for input.
it is probably necessary to store the number of significant decimals.
That's how Denny proposed to calculate the default accuracy. If the accuracy is given by a complex model (e.g. a gamma distribution), then it might be handy to have a simple value that tells us the significant digits.
Hm... perhaps it's best to always express accuracy as "+/-n", and allow for more detailed information (standard deviation, whatever) as *additional* information about the accuracy (could be modelled as a qualifier internally).
I believe in the user interface this needs not be any visible setting, simply the number of digits can be preserved. Without these is impossible to store and reproduce information like "10.20 nm", it would be returned as 1.02 10^-8 m.
No, it would return using whatever system of measurement the user has selected in their preferences.
Complex heuristic may "guess" when to use the scientific SI prefixes instead. The trailing zero cannot be reproduced however when completely relying on IEEE floating-point.
We'll need heuristics to pick the correct secondary unit (e.g. nm or km). The general rule could be to pick a unit so that the actual value is between 1 and 10, with some additional rules for dealing with cultural specialities (decimeter is rarely used, hectoliter however is pretty common. The decagram is commonly used in Austria only, etc).
Note that for rendering of values in infoboxes, the desired unit and precision can always be given explicitly.
Note "precision" vs "accuracy" here: the precision controls how many digits are shown, while the accuracy indicates how exact our knowledge is. The Precision can be derived from the accuracy and vice versa, using appropriate heuristics. But they are not the same. IMHO, the accuracy should always be stored with the value, the precision never.
-- daniel
In addition to a storage option of the desired unit prefix (this may be considered a original-prefix, since naturally re-users may wish to reformat this).
I see no point in storing the unit used for input.
I think you plan to store the unit (which would be meter), so you don't want to store prefixes, correct?
Please argue why you don't see a point. You want to both the size of the universe, distance to New York, size of the proton in "meter"? If not, with which algorithm will you restore the SI prefix, or rather, recognize with SI-prefix is usable? We do not use Mm in common language, so we do give the circumference of the earth as roughly 40 000 km and not as 40 Mm. We don't write 4*10^7 m either.
it is probably necessary to store the number of significant decimals.
That's how Denny proposed to calculate the default accuracy. If the accuracy is given by a complex model (e.g. a gamma distribution), then it might be handy to have a simple value that tells us the significant digits.
Hm... perhaps it's best to always express accuracy as "+/-n", and allow for more detailed information (standard deviation, whatever) as *additional* information about the accuracy (could be modelled as a qualifier internally).
I fear that is two separate levels of precision of giving a measure of measurement _precision_ (I believe "accuracy" is the wrong term here, precision and accuracy are related but distinct concepts). So 4.10 means that the last digit is significant, i.e. the best estimate is at least between 4.095 and 4.105 (but it may be better). . 4.10 +/- 0.005 means it is precisely 4.095 and 4.105, as opposed to 4.10 +/- 0.004, 4.10 +/- 0.003, 4.10 +/- 0.002 etc.
Futhermore, a quantity may be given as 4.10-4.20-4.35. The precision of measurement and the the measure of variance and dispersion are separate concepts.
I believe in the user interface this needs not be any visible setting, simply the number of digits can be preserved. Without these is impossible to store and reproduce information like "10.20 nm", it would be returned as 1.02 10^-8 m.
No, it would return using whatever system of measurement the user has selected in their preferences.
then you have lost the information. There is no "user selection" in this in science.
Complex heuristic may "guess" when to use the scientific SI prefixes instead. The trailing zero cannot be reproduced however when completely relying on IEEE floating-point.
We'll need heuristics to pick the correct secondary unit (e.g. nm or km). The
(I believe there is no such thing as a "secondary unit", did you make that term up? Only "m" is a unit of measurement, the n or k are prefixes see http://en.wikipedia.org/wiki/SI_prefix )
general rule could be to pick a unit so that the actual value is between 1 and 10, with some additional rules for dealing with cultural specialities (decimeter is rarely used, hectoliter however is pretty common. The decagram is commonly used in Austria only, etc).
You would need to also know which prefix is applicable to which unit in which context. In a scientific context different prefixes are used than in a lay context. In a lay context astronomical temperatures may be given as degree celsius, in a scientific as kelvin. This is not just a user preference.
I agree that the system should allow explicit conversion in infoboxes. I disagree that you should create an artifical intelligence system for wikidata that knows more about unit usage than the authors. To store the wisdom of authors, storing both unit and original unit prefix is necessary.
You write "The Precision can be derived from the accuracy and vice versa, using appropriate heuristics."
I _terrible strongly_ doubt that. Can you give any proof of that? For precision I can use statistics, for accuracy and need an indirect, separate and precise method to estimate accuracy. If you have a laser-distance measurement device, the precision can be estimated by yourself by repeated measurements at various times, temperatures, etc. But unless you have an objective distance standard, you have no means to determine whether the accuracy of the device is always off by 10 cm because someone screwed up the software program inside the device.
But they are not the same. IMHO, the accuracy should always be stored with the value, the precision never.
I fear that is a view of how data in a perfect world should be known, not a reflection of the kind of data that people need to store in Wikidata. Very often only the precision will be known or available to its authors, or worse, the source may not say which it is.
Gregor
On 19.12.2012 15:12, Gregor Hagedorn wrote:
I see no point in storing the unit used for input.
I think you plan to store the unit (which would be meter), so you don't want to store prefixes, correct?
Please argue why you don't see a point. You want to both the size of the universe, distance to New York, size of the proton in "meter"?
Yes. Otherwise, they would not be comparable in a database query. It's generally a good idea to normalize all values of a given dimension to use the same unit, for the samerason that it's best to convert dates in a database th UTC, instead of storing time zone info for each entry.
If not, with which algorithm will you restore the SI prefix, or rather, recognize with SI-prefix is usable? We do not use Mm in common language, so we do give the circumference of the earth as roughly 40 000 km and not as 40 Mm. We don't write 4*10^7 m either.
This would be done based on the accuracy. If the accuracy is ~1000m, the heuristic would decide that km is probably a good unit, with no digits after the decimal point. Besides that, the desired unit and precision can be specified when rendering values on a wiki page.
it is probably necessary to store the number of significant decimals.
Yes, that *is* the accuracy value i mean.
That's how Denny proposed to calculate the default accuracy. If the accuracy is given by a complex model (e.g. a gamma distribution), then it might be handy to have a simple value that tells us the significant digits.
Hm... perhaps it's best to always express accuracy as "+/-n", and allow for more detailed information (standard deviation, whatever) as *additional* information about the accuracy (could be modelled as a qualifier internally).
I fear that is two separate levels of precision of giving a measure of measurement _precision_ (I believe "accuracy" is the wrong term here, precision and accuracy are related but distinct concepts).
Ok, there's some terminology confusion here. I'm using "accuracy" to refer to the accuracy of measurement (e.g. standard deviation), and "precision" to refer to the precision of presentation (e.g. significant digits). We need these two things at least, and words for them. I don't care much which words we use.
So 4.10 means that the last digit is significant, i.e. the best estimate is at least between 4.095 and 4.105 (but it may be better). . 4.10 +/- 0.005 means it is precisely 4.095 and 4.105, as opposed to 4.10 +/- 0.004, 4.10 +/- 0.003, 4.10 +/- 0.002 etc.
Yes, all this should be handled by the component responsible for parsing user input for quantity values.
Futhermore, a quantity may be given as 4.10-4.20-4.35. The precision of measurement and the the measure of variance and dispersion are separate concepts.
True, though I think that we can lump them together for the sake of comparison, picking appropriate units, etc. The finer points can be handled by qualifiers (of the value, or even of the accuracy).
No, it would return using whatever system of measurement the user has selected in their preferences.
then you have lost the information. There is no "user selection" in this in science.
I have lost the information which unit was used when originally entering the data. That's it.
Complex heuristic may "guess" when to use the scientific SI prefixes instead. The trailing zero cannot be reproduced however when completely relying on IEEE floating-point.
We'll need heuristics to pick the correct secondary unit (e.g. nm or km). The
(I believe there is no such thing as a "secondary unit", did you make that term up? Only "m" is a unit of measurement, the n or k are prefixes see http://en.wikipedia.org/wiki/SI_prefix )
I made up the term, yes :) Whatever you call "km", if "m" is the real unit and "k" is the prefix.
I agree that the system should allow explicit conversion in infoboxes. I disagree that you should create an artifical intelligence system for wikidata that knows more about unit usage than the authors. To store the wisdom of authors, storing both unit and original unit prefix is necessary.
It can be stored as an auxilliary data point, that is, as a qualifier ("measured in feet"). It should not IMHO be part of the data value as such, because that would make it extremely hard to use the values in a database.
You write "The Precision can be derived from the accuracy and vice versa, using appropriate heuristics."
I _terrible strongly_ doubt that. Can you give any proof of that? For precision I can use statistics, for accuracy and need an indirect, separate and precise method to estimate accuracy. If you have a laser-distance measurement device, the precision can be estimated by yourself by repeated measurements at various times, temperatures, etc. But unless you have an objective distance standard, you have no means to determine whether the accuracy of the device is always off by 10 cm because someone screwed up the software program inside the device.
Sorry, I didn't mean to say that you can calculate one from the other, but that you can make a decent guess that is useful for parsing and display. e.g. if a value is given as 5km, we can assume that it wasn't measured with millimeter precision. And if something is declared to have millimeter accuracy, it probably makes sense to use an precision of 3 or 4 digits after the decimal point when converting to feet.
This is just heuristics, and yes, we shouldn't try to be too smart here. Just smart enough to be useful.
But they are not the same. IMHO, the accuracy should always be stored with the value, the precision never.
I fear that is a view of how data in a perfect world should be known, not a reflection of the kind of data that people need to store in Wikidata. Very often only the precision will be known or available to its authors, or worse, the source may not say which it is.
Ok, so my use of precision is not precise :) We are stumbling over words here. What would you call the level of detail used for presentation, if not "precision"?
-- daniel
it is probably necessary to store the number of significant decimals.
Yes, that *is* the accuracy value i mean.
Daniel, please use correct terms. Accuracy is a defined concept and although by convention it may be roughly expressed by using the number of significant figures, that is not the same concept. Without additional information you cannot infer backwards whether usage of significant figures expresses accuracy or precision. See http://en.wikipedia.org/wiki/Accuracy_and_precision
Ok, there's some terminology confusion here. I'm using "accuracy" to refer to the accuracy of measurement (e.g. standard deviation), and "precision" to refer to the precision of presentation (e.g. significant digits). We need these two things at least, and words for them. I don't care much which words we use.
I do. And I think it is important for WIkidata to precisely express what it wants to achieve.
Accuracy has nothing to do with s.d., which is a measure of dispersion. You can have an accuracy of +/- 10 measured with a precision of +/- 0.1 (and a standard deviation for the population of objects that you have measured of 2).
-----
So 4.10 means that the last digit is significant, i.e. the best estimate is at least between 4.095 and 4.105 (but it may be better). . 4.10 +/- 0.005 means it is precisely 4.095 and 4.105, as opposed to 4.10 +/- 0.004, 4.10 +/- 0.003, 4.10 +/- 0.002 etc.
Yes, all this should be handled by the component responsible for parsing user input for quantity values.
But it cannot be because you have lost the information. I don't know whether +/- 0.005 indicates significant figures/digits or whether is is an exact precision_or_accuracy interval.
I think this may become clearer if you consider a value entered in inches:
1.20 inches. you convert: 1.20 +/- 0.05 in = 3.048 10^-2 m +/- 1.27 10^-3 m
If this is the only information stored, I have no information left whether I should display 3.048 or 3.0480 and whether the information +/- 1.27 10^-3 m is meaningful (no) or an artifact of conversion (yes).
It can be stored as an auxilliary data point, that is, as a qualifier ("measured in feet"). It should not IMHO be part of the data value as such, because that would make it extremely hard to use the values in a database.
You are correct insofar that I propose you need to store two units: the normalized one (SI units only, and no prefix - and even though the SI base unit is kg I would store gram) and the original one plus the original unit prefix.
If you do that, you can store the value in a single normalized unit, provided you back-convert it prior to display in Wikidata.
I don't think the original unit is a meaningless qualifier, it is vital information for context.
Gregor
Ok, so my use of precision is not precise :) We are stumbling over words here. What would you call the level of detail used for presentation, if not "precision"?
I would call them significant digits, Wikipedia seems to prefer the lemma http://en.wikipedia.org/wiki/Significant_figures -- both will do and have the same concept. But accuracy and precision are interpretations of this, and there are more flexible ways to express these other than through significant digits/figures like an upper or lower value. But the two things are not the same. I think it is not an overcomplication to provide both for significant digits and for precise upper and lower interval.
Gregor
Hi,
On 2012-12-19 15:12, Gregor Hagedorn wrote:
In addition to a storage option of the desired unit prefix (this may be considered a original-prefix, since naturally re-users may wish to reformat this).
I see no point in storing the unit used for input.
I think you plan to store the unit (which would be meter), so you don't want to store prefixes, correct?
Please argue why you don't see a point. You want to both the size of the universe, distance to New York, size of the proton in "meter"? If not, with which algorithm will you restore the SI prefix, or rather, recognize with SI-prefix is usable? We do not use Mm in common language, so we do give the circumference of the earth as roughly 40 000 km and not as 40 Mm. We don't write 4*10^7 m either.
I assume there's a table for usual units for different purposes. E.g. altitudes are displayed in m and ft. Out of that one of those is chosen by the user's locale setting. My locale-setting would be kind of "metric system", therefore it will be displayed in m on my wikidata-surface. On enwiki it will probably be displayed in ft.
it is probably necessary to store the number of significant decimals.
That's how Denny proposed to calculate the default accuracy. If the accuracy is given by a complex model (e.g. a gamma distribution), then it might be handy to have a simple value that tells us the significant digits.
Hm... perhaps it's best to always express accuracy as "+/-n", and allow for more detailed information (standard deviation, whatever) as *additional* information about the accuracy (could be modelled as a qualifier internally).
I fear that is two separate levels of precision of giving a measure of measurement _precision_ (I believe "accuracy" is the wrong term here, precision and accuracy are related but distinct concepts). So 4.10 means that the last digit is significant, i.e. the best estimate is at least between 4.095 and 4.105 (but it may be better). . 4.10 +/- 0.005 means it is precisely 4.095 and 4.105, as opposed to 4.10 +/- 0.004, 4.10 +/- 0.003, 4.10 +/- 0.002 etc.
My suggestion would be:
* Somebody types in 4.10, so 4.10 will be saved. There is no accuracy available so n/a is been saved for the accuracy or even the javascript way could be used, which will be undefined (because not mentioned). Retrieving this will result in 4.10 or {value:4.10}.
* Somebody types in 4.1 with an accuracy of 0.05. So 4.1 will be saved and an accuracy of 0.05. Anybody who wants to retrieve this will get 4.1 or {value:4.1, accuracy:0.05}. Retrieving this with precision 3 will result in 4.100 or {value:4.100, accuracy:0.05}.
Futhermore, a quantity may be given as 4.10-4.20-4.35. The precision of measurement and the the measure of variance and dispersion are separate concepts.
Hm, somewhere in the scope of mechanical engineering there are also existing ±-values where the tolerances up and down differ from each other. E.g: it should be 11.2, but it may be 11.1 or 11.35.
I believe in the user interface this needs not be any visible setting, simply the number of digits can be preserved. Without these is impossible to store and reproduce information like "10.20 nm", it would be returned as 1.02 10^-8 m.
No, it would return using whatever system of measurement the user has selected in their preferences.
then you have lost the information. There is no "user selection" in this in science.
Lengths, distances, sizes, etc. are measured in meters, that's how science would do it. Displaying is totally apart from that.
Complex heuristic may "guess" when to use the scientific SI prefixes instead. The trailing zero cannot be reproduced however when completely relying on IEEE floating-point.
We'll need heuristics to pick the correct secondary unit (e.g. nm or km). The
(I believe there is no such thing as a "secondary unit", did you make that term up? Only "m" is a unit of measurement, the n or k are prefixes see http://en.wikipedia.org/wiki/SI_prefix )
(Actually it's not a real unit but k is a step of 1000, so let's call it internally a "secondary unit", maybe more like a unity.)
general rule could be to pick a unit so that the actual value is between 1 and 10, with some additional rules for dealing with cultural specialities (decimeter is rarely used, hectoliter however is pretty common. The decagram is commonly used in Austria only, etc).
You would need to also know which prefix is applicable to which unit in which context. In a scientific context different prefixes are used than in a lay context. In a lay context astronomical temperatures may be given as degree celsius, in a scientific as kelvin. This is not just a user preference.
I agree that the system should allow explicit conversion in infoboxes. I disagree that you should create an artifical intelligence system for wikidata that knows more about unit usage than the authors. To store the wisdom of authors, storing both unit and original unit prefix is necessary.
If somebody enters 32 degrees Farenheit it should be stored as 273.15 Kelvin. In German Wikipedia it will be displayed as 0 degrees Celsius, on the English as 32 degrees Fahrenheit and on Wikidata, whatever the user wants it to be displayed.
You write "The Precision can be derived from the accuracy and vice versa, using appropriate heuristics."
I _terrible strongly_ doubt that. Can you give any proof of that? For precision I can use statistics, for accuracy and need an indirect, separate and precise method to estimate accuracy. If you have a laser-distance measurement device, the precision can be estimated by yourself by repeated measurements at various times, temperatures, etc. But unless you have an objective distance standard, you have no means to determine whether the accuracy of the device is always off by 10 cm because someone screwed up the software program inside the device.
But they are not the same. IMHO, the accuracy should always be stored with the value, the precision never.
I fear that is a view of how data in a perfect world should be known, not a reflection of the kind of data that people need to store in Wikidata. Very often only the precision will be known or available to its authors, or worse, the source may not say which it is.
I think this is kind of Wikidata definitions. Since years now precision is used for the amount of digits behind the comma. Now we need another word for expressing how accurate a value is. Therefore: Do we have a glossary?
Cheers
Marco
On 19.12.2012 16:41, Marco Fleckinger wrote:
I assume there's a table for usual units for different purposes. E.g. altitudes are displayed in m and ft. Out of that one of those is chosen by the user's locale setting. My locale-setting would be kind of "metric system", therefore it will be displayed in m on my wikidata-surface. On enwiki it will probably be displayed in ft.
I'd have thought that we'd have one such table per dimension (such as "length" or "weight"). It may make sense to override that on a per-property basis, so 2300m elevation isn't shown as 2.3km. Or that can be done in the template that renders the value.
My suggestion would be:
- Somebody types in 4.10, so 4.10 will be saved. There is no accuracy available
so n/a is been saved for the accuracy or even the javascript way could be used, which will be undefined (because not mentioned). Retrieving this will result in 4.10 or {value:4.10}.
What is saved would depend on unit conversion, the value actually stored in the database would be in a base unit. In addition, the input'S precision would be usewd to derive the value'S accuracy: entering 4.10m will make the accuracy default to 10cm (+/- 5cm).
Futhermore, a quantity may be given as 4.10-4.20-4.35. The precision of measurement and the the measure of variance and dispersion are separate concepts.
Hm, somewhere in the scope of mechanical engineering there are also existing ±-values where the tolerances up and down differ from each other. E.g: it should be 11.2, but it may be 11.1 or 11.35.
I'd suggest to store such additional information in a Qualifier instead of the Data Value itself.
I fear that is a view of how data in a perfect world should be known, not a reflection of the kind of data that people need to store in Wikidata. Very often only the precision will be known or available to its authors, or worse, the source may not say which it is.
I think this is kind of Wikidata definitions. Since years now precision is used for the amount of digits behind the comma. Now we need another word for expressing how accurate a value is. Therefore: Do we have a glossary?
Indeed we do: https://wikidata.org/wiki/Wikidata:Glossary
I use "precision" exactly like that: significant digits when rendering output or parsing intput. It can be used to *guess* at the values accuracy, but is not the same.
-- daniel
On 19 December 2012 17:03, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
I'd have thought that we'd have one such table per dimension (such as "length" or "weight"). It may make sense to override that on a per-property basis, so 2300m elevation isn't shown as 2.3km. Or that can be done in the template that renders the value.
here and in the entire discussion I fear that the need to support data curation on Wikidata data for correctness is not sufficiently in the focus.
If someone enters the height of a mountain in feet and I see the converted value in meter in my wikidata preferences-converted view, I will correct the seemingly senseless and unjustified precision to three digits after the meter. Only if we understand in which unit the data were originally valid, we will be able to successfully communicate and collaborate.
Yes, Wikidata shall store a normalized version of the value, but it also needs to store an original one. Whether it needs to store the value twice I am not sure, I believe not. If it store the original prefix, original unit and original significant digits, it can generally recreate the original form. I know that there are some pitfalls with IEEE numbers in this, and it may be safer to store the original number as well initially (and perhaps drop it later when enough data are available to test the effects).
Of course, Wikipedias can use the API to display the value in any other form, just as they like, but that does not solve the problem of data curation on wikidata (which includes the data curation by wikipedia authors).
Gregor
On 19.12.2012 18:00, Gregor Hagedorn wrote:
Yes, Wikidata shall store a normalized version of the value, but it also needs to store an original one. Whether it needs to store the value twice I am not sure, I believe not. If it store the original prefix, original unit and original significant digits, it can generally recreate the original form. I know that there are some pitfalls with IEEE numbers in this, and it may be safer to store the original number as well initially (and perhaps drop it later when enough data are available to test the effects).
I was trying to avoid storing the original input value and unit, but perhaps that was a bad idea. Wikidata will have to store such values twice anyway: once as structured data in the primary data record, and once as an index value.
The index value has to be normalized and "dumb" - we can't put structure there, it's going to be a single value, possibly augmented by some kind of accuracy indicator. The index value needs to have a form that can be easily sorted, compared, indexed and queried by a wide variety of database systems. The value we store in the primary data record however can be as complex as we need it, and can be in any unit we like.
The rendering of a value will be based on the primary data record, to the conversion and rendering logic has access to all the additional information it may want to use.
-- daniel
On 19 December 2012 17:03, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
Indeed we do: https://wikidata.org/wiki/Wikidata:Glossary
I use "precision" exactly like that: significant digits when rendering output or parsing intput. It can be used to *guess* at the values accuracy, but is not the same.
(I cannot see that definition there, the word "precision" does not exist on that page.)
There is an issue here that we can speak of a the precision of a number in this sense (number of significant digits).
However, since the concept of the precision of a data value or a measurement is both narrower (it distinguishes precision and accuracy http://en.wikipedia.org/wiki/Accuracy_and_precision ) and wider (significant digits/figures are only one way to express precision or accuracy) I suggest to follow the definitions on the english wikipedia and use "significant digits" for what you call precision.
Gregor
PS (I am still unsure how you define in "Wikidata speak" the term "accuracy"; Daniel gave an def. but I could not follow it yet).
On 19.12.2012 18:13, Gregor Hagedorn wrote:
On 19 December 2012 17:03, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
Indeed we do: https://wikidata.org/wiki/Wikidata:Glossary
I use "precision" exactly like that: significant digits when rendering output or parsing intput. It can be used to *guess* at the values accuracy, but is not the same.
(I cannot see that definition there, the word "precision" does not exist on that page.)
Oh, I didn't mean to imply it's there. But we do have a glossary, and that's where any definition should go, once we have them.
There is an issue here that we can speak of a the precision of a number in this sense (number of significant digits).
I'd prefer a concept that is independent of representation, the number of significant digits depends on the unit and also a base-10 representation.
PS (I am still unsure how you define in "Wikidata speak" the term "accuracy"; Daniel gave an def. but I could not follow it yet).
I was using "accuracy" to describe how sure we are about a given value, and "precision" to describe how exactly we represent (render) the value. I now see that this choice of terminology may be confusing, and that there are also multiple aspects of certainty.
So, please suggest terms to use for at least these two things:
1) value certainty (ideally, not using "digits", but something that is independent of unit and rendering) 2) output exactness (here, the number of digits is actually what we want to talk about)
Perhaps we need additional distinctions, but I'm sure we at least need these two concepts.
-- daniel
On 2012-12-20 12:46, Daniel Kinzler wrote:
So, please suggest terms to use for at least these two things:
- value certainty (ideally, not using "digits", but something that is
independent of unit and rendering)
We want to specify the "limits of (possible) variation" of a value, which would be tolerce ([[Engineering tolerance]])
E.g. the value of electrical resistances, capacitors, etc. are measured in Ω ± % or F ± %. We could also either use/allow/display absolute or relative values.
- output exactness (here, the number of digits is actually what we want to talk
about)
Everywhere in the realm of software development precision is used for this. Therefore also here the suggestion of precision was not that bad.
Perhaps we need additional distinctions, but I'm sure we at least need these two concepts.
Marco
So, please suggest terms to use for at least these two things:
- value certainty (ideally, not using "digits", but something that is
independent of unit and rendering)
Here we want to talk about something that the true value is with a certain probability within a given interval, something like: "2.3 +/-0.2 µm"
I am not too sure here myself. Different terms exist whether you talk about an inherent measurement error of a single individual with a single true value, or whether you speak of statistical measures or estimates.
Marco gives yet another example: "We want to specify the "limits of (possible) variation" of a value, which would be Engineering tolerance. E.g. the value of electrical resistances, capacitors, etc. are measured in Ω ± % or F ± %. We could also either use/allow/display absolute or relative values." -- In this case, it is actually not a uncertainty of the actual sample of resistors, but a design specification, i.e. the specification that resistors must be (all or only 95%?) within _at least_ these limits.
So what to do here?
List the different use cases of a value plus-minus other values? * measurement-method limited precision range of single measurements (e.g.small structures in light microscope, limited by resolution capability of blue light, approx. 0.2 µm) * measurement-method limited accuracy range (or accuracy plus precision) * Confidence interval for mean (or other statistical parameters: mode, variance, etc.) of the population as estimated based on a sample * one of potentially several percentiles (incl. +- s.d.) measuring spread, but giving no information about the probability that the true mean is between these values * engineering design specifications that a given (unknown) fraction of individuals must be within these limits I believe for the moment you don't want to go into certainty in the sense that a number is an estimate of a
All these different concepts have rightly so different names. There can be: * precision +/- 0.2 * accuracy +/- 0.2 * tolerance +/- 0.2 * error margin +/- 0.2 * +/- 1 or 2 s.d. +/- 0.2 * 95% confidence interval (CI) +/- 0.2 * 10 to 90% percentile +/- 0.2 * uncertainty (of what?) +/- 0.2
(ASIDE: the +/1 2 s.d. defines roughly a 95% probability that the next value from a random sample is in the interval, the 95% CI that the true value of the mean is in that interval. These are completely different things -- for the same measurements you can report validly 100 +/- 50 for the first and 100 +-0.001 for the second. That is, with probability 95% the next randomly sampled measurement will be between 50 and 150, and with probability 95% it is known that the true mean is between 99.999 and 100.001. Semantic matters, not only the "pattern" of plus-minus a value.)
Because of the widely varying use cases listed above, I believe we need very neutral labels for the plus-minus values if the data type shall simple provide two "variables" in a generic sense, the true semantics of which are then provided by qualifier information.
I could think of something: * lower range (lowerRange) and upper range (upperRange). * lower/upper interval value/endpoint but I don't very much like this because it would force people to abandon the plus/minus notation and calculate actual values.
Better may be something like: * upwardsAbsolute * downwardsAbsolute * upwardsPercent * downwardsPercent or * plusValueAbsolute * minusValueAbsolute * plusValuePercent * minusValuePercent * as neutral terms - but I would be glad if someone comes up with other neutral terms.
However, I hope we start realizing that all of us seem to look at this primarily from only one of the use cases listed above (me included, I usually have cases with variance spread or CI of mean). We should stop using terms that are specific to one but not the other of the cases. The assumption "these things are all more or less the same" is not true. A confidence interval is neither a manufacturing tolerance nor a measurement precision. And precision is not accuracy, etc.
- output exactness (here, the number of digits is actually what we want to talk
about)
xsd:totalDigits or Wikipedia: significantDigits or significantFigures
that is one way to express value exactness, albeit a course on.
Marco writes: "Everywhere in the realm of software development precision is used for this. Therefore also here the suggestion of precision was not that bad."
-> In software development, the term is about the precision of the numeric data type, i.e. the precision of the storage mechanism. The term precision is correctly applied here. However, we talk about the actually significant digits of a measurement, which are part of the potential information on precision and accuracy of the value. The measured value with e.g. 6 digits may be stored in a data type which has a precision of 16 digits. I think applying "precision" to significant digits is and produces a fundamental misunderstand of what precision is, see the Wikipedia topic on precision and accuracy.
Gregor
Gregor Hagedorn g.m.hagedorn@gmail.com schrieb:
So, please suggest terms to use for at least these two things:
- value certainty (ideally, not using "digits", but something that
is
independent of unit and rendering)
Here we want to talk about something that the true value is with a certain probability within a given interval, something like: "2.3 +/-0.2 µm"
I am not too sure here myself. Different terms exist whether you talk about an inherent measurement error of a single individual with a single true value, or whether you speak of statistical measures or estimates.
Marco gives yet another example: "We want to specify the "limits of (possible) variation" of a value, which would be Engineering tolerance. E.g. the value of electrical resistances, capacitors, etc. are measured in Ω ± % or F ± %. We could also either use/allow/display absolute or relative values." -- In this case, it is actually not a uncertainty of the actual sample of resistors, but a design specification, i.e. the specification that resistors must be (all or only 95%?) within _at least_ these limits.
So what to do here?
List the different use cases of a value plus-minus other values?
- measurement-method limited precision range of single measurements
(e.g.small structures in light microscope, limited by resolution capability of blue light, approx. 0.2 µm)
- measurement-method limited accuracy range (or accuracy plus
precision)
- Confidence interval for mean (or other statistical parameters: mode,
variance, etc.) of the population as estimated based on a sample
- one of potentially several percentiles (incl. +- s.d.) measuring
spread, but giving no information about the probability that the true mean is between these values
- engineering design specifications that a given (unknown) fraction of
individuals must be within these limits I believe for the moment you don't want to go into certainty in the sense that a number is an estimate of a
All these different concepts have rightly so different names. There can be:
- precision +/- 0.2
- accuracy +/- 0.2
- tolerance +/- 0.2
- error margin +/- 0.2
- +/- 1 or 2 s.d. +/- 0.2
- 95% confidence interval (CI) +/- 0.2
- 10 to 90% percentile +/- 0.2
- uncertainty (of what?) +/- 0.2
(ASIDE: the +/1 2 s.d. defines roughly a 95% probability that the next value from a random sample is in the interval, the 95% CI that the true value of the mean is in that interval. These are completely different things -- for the same measurements you can report validly 100 +/- 50 for the first and 100 +-0.001 for the second. That is, with probability 95% the next randomly sampled measurement will be between 50 and 150, and with probability 95% it is known that the true mean is between 99.999 and 100.001. Semantic matters, not only the "pattern" of plus-minus a value.)
Because of the widely varying use cases listed above, I believe we need very neutral labels for the plus-minus values if the data type shall simple provide two "variables" in a generic sense, the true semantics of which are then provided by qualifier information.
I could think of something:
- lower range (lowerRange) and upper range (upperRange).
- lower/upper interval value/endpoint
but I don't very much like this because it would force people to abandon the plus/minus notation and calculate actual values.
Better may be something like:
- upwardsAbsolute
- downwardsAbsolute
- upwardsPercent
- downwardsPercent
or
- plusValueAbsolute
- minusValueAbsolute
- plusValuePercent
- minusValuePercent
as neutral terms - but I would be glad if someone comes up with other neutral terms.
However, I hope we start realizing that all of us seem to look at this primarily from only one of the use cases listed above (me included, I usually have cases with variance spread or CI of mean). We should stop using terms that are specific to one but not the other of the cases. The assumption "these things are all more or less the same" is not true. A confidence interval is neither a manufacturing tolerance nor a measurement precision. And precision is not accuracy, etc.
- output exactness (here, the number of digits is actually what we
want to talk
about)
xsd:totalDigits or Wikipedia: significantDigits or significantFigures
that is one way to express value exactness, albeit a course on.
Marco writes: "Everywhere in the realm of software development precision is used for this. Therefore also here the suggestion of precision was not that bad."
-> In software development, the term is about the precision of the numeric data type, i.e. the precision of the storage mechanism. The term precision is correctly applied here. However, we talk about the actually significant digits of a measurement, which are part of the potential information on precision and accuracy of the value. The measured value with e.g. 6 digits may be stored in a data type which has a precision of 16 digits. I think applying "precision" to significant digits is and produces a fundamental misunderstand of what precision is, see the Wikipedia topic on precision and accuracy.
Hm the second one is only relevant for output. Why not using the Term outputformat as a pattern just like Excel, OpenOffice, and LibreOffice do? This could include the number of digits behind the comma, the optional accuracy/whatever and the unit. This will be fine for the API, and the MW-Syntax.
Cheers
Marco
On 21.12.2012 11:29, Marco Fleckinger wrote:
Hm the second one is only relevant for output. Why not using the Term outputformat as a pattern just like Excel, OpenOffice, and LibreOffice do? This could include the number of digits behind the comma, the optional accuracy/whatever and the unit. This will be fine for the API, and the MW-Syntax.
Well, like you said, the output format includes a lot of things, like the decimal separator, the thousands separator, location of the negative sign, whether to put a zero before the decimal point, and digits after the decimal point. We still need a name for the latter :)
-- daniel
Hm the second one is only relevant for output.
I think this is a fundamental misunderstanding: The "original" one is not for output but is the primary value for interpretation, for understanding whether a value in Wikidata is correct of fake, or a software conversion error, or what. If I want to learn something in Wikipedia, I have to have access to this information. Having access to this I can understand whether to seemingly different values from the same source are justified or an error (like in the example from previous mail that 100 +/- 50 and 100 +/- 0.1 can be both valied for the same quantity and the same source observations).
I view the "roughly unreliably with lots of heuristics normalized converted" version the secondary. It has its uses, but I would in fact put it second and show it only with a large warning banner that this version contains lots of unwarranted assumptions which may or may not hold. But I don't care which is primary or secondary, I only want to encourage you not to forget the "data" in wikidata over implementing the essential search, retrieval, conversion, etc. functionality.
:-)
In an ideal world all data would be in a fully convertible state and no-one would simply use significant digits to express margins of error, reliability, tolerance etc. But I have not encountered this world yet.
Why not using the Term outputformat as a pattern just like Excel, OpenOffice, and LibreOffice do? This could include the number of digits behind the comma, the optional accuracy/whatever and the unit. This will be fine for the API, and the MW-Syntax.
I don't care how the information is encoded, if you develop your own language to encode information in a string and provide a syntax for that that is fine. Only already within Microsoft products the "formatting strings" are only similar, but not fully compatible, and I have doubts that this is a good way for global interoperability.
Gregor
The xsd:minInclusive, xsd:maxInclusive, xsd:minExclusive and xsd:maxExclusive facets are absolute expressions not relative +/- expressions, in order to accommodate fast queries. These four facets permit specification of ranges with an unspecified median and ranges with a specified mode, inclusie or exclusive of endpoints, a six-fer. For these reasons I believe the XSD approach is superior for specifying value set when compared to storing the dispersion factors themselves, eg the "3" of +/- 3.
On 21 December 2012 19:36, jmcclure@hypergrove.com wrote:
The xsd:minInclusive, xsd:maxInclusive, xsd:minExclusive and xsd:maxExclusive facets are absolute expressions not relative +/- expressions, in order to accommodate fast queries. These four facets permit specification of ranges with an unspecified median and ranges with a specified mode, inclusie or exclusive of endpoints, a six-fer. For these reasons I believe the XSD approach is superior for specifying value set when compared to storing the dispersion factors themselves, eg the "3" of +/- 3.
yes, provided they are actually tied to the semantics of min. and maximum, which the xsd examples are. As long as the semantics of the proposed "value bracketing" in Wikidata is unknown, their use is questionable if not impossible. If I know something is plus/minus 2 s.d. or plus minus 2 s.e. or 10 to 90 % percentile ... I again can use them to the benefit of the query system. But not without.
Gregor
I detect a need to characterize the range expression - most important of which is whether the range is complete, or whether it excludes (equal) tails on each end. XSD presumes a complete range is being specified, not a subset, is the issue you're raising?
Could an additional facet for "percentage-tails-excluded" effectively communicate this estimate?
On 21.12.2012 10:41, Gregor Hagedorn wrote:
On 21
December 2012 19:36, jmcclure@hypergrove.com wrote:
The
xsd:minInclusive, xsd:maxInclusive, xsd:minExclusive and xsd:maxExclusive facets are absolute expressions not relative +/- expressions, in order to accommodate fast queries. These four facets permit specification of ranges with an unspecified median and ranges with a specified mode, inclusie or exclusive of endpoints, a six-fer. For these reasons I believe the XSD approach is superior for specifying value set when compared to storing the dispersion factors themselves, eg the "3" of +/- 3.
yes, provided they are actually tied to the
semantics of min. and
maximum, which the xsd examples are. As long as
the semantics of the
proposed "value bracketing" in Wikidata is
unknown, their use is
questionable if not impossible. If I know
something is plus/minus 2
s.d. or plus minus 2 s.e. or 10 to 90 %
percentile ... I again can use
them to the benefit of the query
system. But not without.
Gregor
_______________________________________________
Wikidata-l mailing
list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l [1]
Links: ------ [1] https://lists.wikimedia.org/mailman/listinfo/wikidata-l
On 18/12/12 16:52, Denny Vrandečić wrote:
Thank you for your comments, Marco. 2012/12/18 Marco Fleckinger <marco.fleckinger@wikipedia.at mailto:marco.fleckinger@wikipedia.at> IMHO it would be make sense to have something hybrid. The datatype for geolocation should accept something like a NAN-value for optional altitudes. But it should also be possible to use altitudes without longitude and latitude.
Why would it make sense to have both? What is the altitude in the geolocation good for? Is there an example in Wikipedia where it is or would be used, and where a property of its own would not be better?
While at it, why not separate longitude and latitude? There are items that only have one, f.e. http://en.wikipedia.org/wiki/Prime_meridian
IMHO it would make sense to use the [[International System of Units]] for internal storage. It is not consequently used in other realms, not even in the German spoken countries (PS vs. kW for cars). Maybe it would be possible to use small scripts (such as WP-templates) to transcalc values, which can easily be developed by the community.
Internally we would translate it, yes, otherwise the numbers would not be comparable. But for editing we need to keep the unit of the source / of the editor, or else we will loose precision.
What I wanted to say. Additionally, in some cases historical units are not accurate or accurately known, so possibly we won't even be able to make the conversion.
On 19.12.2012 08:34, Nikola Smolenski wrote:
Why would it make sense to have both? What is the altitude in the geolocation good for? Is there an example in Wikipedia where it is or would be used, and where a property of its own would not be better?
While at it, why not separate longitude and latitude? There are items that only have one, f.e. http://en.wikipedia.org/wiki/Prime_meridian#
I'd argue that the prime meridian does not have a location. And I think it would be bad to make longitude or latitude optional in a geo-coordinate data type.
There is however nothing that would keep you from defining "longitude" and "latitude" as separate properties, and assigning them a 1D scalar value. But that's unrelated to the geo-coordinate data type.
Internally we would translate it, yes, otherwise the numbers would not be comparable. But for editing we need to keep the unit of the source / of the editor, or else we will loose precision.
What I wanted to say. Additionally, in some cases historical units are not accurate or accurately known, so possibly we won't even be able to make the conversion.
I don't think we can sensibly support historical units with unknown conversions, because they cannot be compared directly to SI units. So, they couldn't be used to answer queries, can't be converted for display, etc - they arn't units in any sense the software can understand. This is a solvable problem, but would add a tremendous amount of complexity.
I think we should always store a values accuracy (which, if not given explicitly, can be derived from the input's unit and number of digits). I think it would be a bad idea to store the original input unit, though. In order to make values comparable (and thus usable in queries), they need to be converted to internal units, with very good precision; for output, they need to be converted to whatever the user prefers. I see no useful place for the original input units.
-- daniel
On Wed, Dec 19, 2012 at 11:23 AM, Daniel Kinzler < daniel.kinzler@wikimedia.de> wrote:
On 19.12.2012 08:34, Nikola Smolenski wrote:
While at it, why not separate longitude and latitude? There are items
that only
have one, f.e. http://en.wikipedia.org/wiki/Prime_meridian#
I'd argue that the prime meridian does not have a location. And I think it would be bad to make longitude or latitude optional in a geo-coordinate data type.
What about the North and South Poles?
What I wanted to say. Additionally, in some cases historical units are
not
accurate or accurately known, so possibly we won't even be able to make
the
conversion.
I don't think we can sensibly support historical units with unknown conversions, because they cannot be compared directly to SI units. So, they couldn't be used to answer queries, can't be converted for display, etc - they arn't units in any sense the software can understand. This is a solvable problem, but would add a tremendous amount of complexity.
Won't we need lots of units that are not SI units (e.g. base pairs, IQ points, Scoville heat units, $ and €) and can't readily be translated into them? Why would historical units with unknown conversions pose any more problem than these?
Avenue
On 19.12.2012 15:26, Avenue wrote:
What about the North and South Poles?
I'm sure standard coordinate systems have a convention for representing them.
Won't we need lots of units that are not SI units (e.g. base pairs, IQ points, Scoville heat units, $ and €) and can't readily be translated into them? Why would historical units with unknown conversions pose any more problem than these?
These all pose the same problems, correct. At the moment, I'm very unsure about how to accommodate these at all. Maybe we can have them as "custom units", which are fixed for a given property, and can not be converted.
-- daniel
These all pose the same problems, correct. At the moment, I'm very unsure about how to accommodate these at all. Maybe we can have them as "custom units", which are fixed for a given property, and can not be converted.
I think the proposal to use wikidata items for the units (that is both base and derived SI as well as Imperial units/US customary units) is most sensible.
Let people use the units they need. Then write software that picks up the units that people use (after verifying common and correct use) by means of their Wikidata item ID. With successive versions of Wikidata, pick up more and more of these and make them available for conversion.
This way Wikidata will become what is needed.
I fear the discussion presently is about anticipating the needs of the next years and not allowing any data into wikidata that have not been foreseen.
There may be a way that Wikidata can have enough artifical intelligence to predict which unit prefixes are usable in common topics versus scientific topics, which units shall be used. Where Megaton is used (TNT of atomic bombs) and where "10^x ton" are preferred (shipping). And that the base unit for weight is kilogram, but for gold in a certain value range ounce may be preferred and gemstones and pearls in carat (http://en.wikipedia.org/wiki/Carat_(unit) ).
But I believe forcing Wikidata to solve that problem first and ignoring the wisdom of the users is the wrong path.
Modelling Wikidata on the feet versus meter and Fahrenheit versus Celsius problem, where US citizens have a different personal preference is misleading. The issue is much more complex.
Gregor
On 19/12/12 12:23, Daniel Kinzler wrote:
On 19.12.2012 08:34, Nikola Smolenski wrote:
What I wanted to say. Additionally, in some cases historical units are not accurate or accurately known, so possibly we won't even be able to make the conversion.
I don't think we can sensibly support historical units with unknown conversions, because they cannot be compared directly to SI units. So, they couldn't be used to answer queries, can't be converted for display, etc - they arn't units in any sense the software can understand. This is a solvable problem, but would add a tremendous amount of complexity.
Ah, but they could still be meaningfully compared to each other. And if approximate conversion is known, this could be still be used to make the conversion so that the measure is converted and its uncertainty increased.
Just throwing more info here: there might also be cases where we could have multiple competing conversions. Somewhat similar to units, something that I would very much like to see is comparison of various monetary values, adjusted for inflation or exchange rate. But then you would have various estimates of inflation by various bodies and you might want to compare by either of them (or a combination of them?).
On 19/12/12 15:33, Nikola Smolenski wrote:
On 19/12/12 12:23, Daniel Kinzler wrote:
I don't think we can sensibly support historical units with unknown conversions, because they cannot be compared directly to SI units. So, they couldn't be used to answer queries, can't be converted for display, etc - they arn't units in any sense the software can understand. This is a solvable problem, but would add a tremendous amount of complexity.
Ah, but they could still be meaningfully compared to each other. And if approximate conversion is known, this could be still be used to make the conversion so that the measure is converted and its uncertainty increased.
Just throwing more info here: there might also be cases where we could have multiple competing conversions. Somewhat similar to units, something that I would very much like to see is comparison of various monetary values, adjusted for inflation or exchange rate. But then you would have various estimates of inflation by various bodies and you might want to compare by either of them (or a combination of them?).
Appropriate conversion might also depend on the item in question. For example, old censuses sometimes measure population not in people but in households. In some cases we might have the idea of how large a household is to give estimate of the population.
I don't think we can sensibly support historical units with unknown conversions, because they cannot be compared directly to SI units. So, they couldn't be used to answer queries, can't be converted for display, etc - they arn't units in any sense the software can understand. This is a solvable problem, but would add a tremendous amount of complexity.
I get the feeling that I might be the only person on this thread that doesn't have a maths/sciences/computers background here. I'm going to be frank here: We need to snap out of the mindset that all of the data we're collecting is going to be easily expressible using modern scientific units and methodologies. If we try and cram everything into a small number of common units, without giving the users some method of expressing non-standard/uncommon/non-scientific values, we're going to have a massive database that is going to at best be cumbersome and at worst be useless for a great deal of information. Traditional Chinese units of measurement [1] have changed their actual value over time. A "li" in one century is not as long as it is in another century, and while there is a li to SI conversion, it's artificial and when we try to use the modern li to measure something, we get a different value for that thing than the historically documented li value states it should be.
There is a balance. The more flexible the parameters, the easier it is to put data in, but the harder it is for computers to make useful connections with it. I'm not sure how to handle this, but I am sure that we can't just keep pretending that all of the data we're going to collect falls nicely into the metric system. Reality just doesn't work that way, and for Wikidata to be useful, we can't discount data that doesn't fit in the mold of modern units.
Sven
[1] http://en.wikipedia.org/wiki/Chinese_units_of_measurement
On 19.12.2012 16:57, Sven Manguard wrote:
There is a balance. The more flexible the parameters, the easier it is to put data in, but the harder it is for computers to make useful connections with it. I'm not sure how to handle this, but I am sure that we can't just keep pretending that all of the data we're going to collect falls nicely into the metric system. Reality just doesn't work that way, and for Wikidata to be useful, we can't discount data that doesn't fit in the mold of modern units.
First off: our target use case is Wikipedia infoboxes. Do you have examples and numbers about the usage of such ancient units in infoboxes on wikipedia? If they are not in main stream use there, I don't see why Wikidata would have to support them.
Though, of course, it would be nice to support them.
Basically, what I envision is this: for every property with the data type "quantity", you can specify a default unit. That can be kilogram or pascal or kilometer, or li or ancient egypt foot (I'll leave the question of how to define such units for later). The default unit implicitly defines the dimension that is measured by the property (length, weight, charge, whatever) and thereby implies the base unit to be used internally (kg, m, etc).
Some units have derived units (km derives from m, g derives from kg, etc) and support automatic conversion to different units of the same dimension (meter to foot, kg to stone, whatever).
Some, like li, will not have a conversion to the metric system defined. You can still use them, but you cannot compare them to values in another system of measurement.
-- daniel
PS: the above reflects my personal ideas on how to best do this, this is not a finished design and wasn't discussed with the rest of the wikidata team.
On 20/12/12 11:26, Daniel Kinzler wrote:
First off: our target use case is Wikipedia infoboxes. Do you have examples and numbers about the usage of such ancient units in infoboxes on wikipedia? If they are not in main stream use there, I don't see why Wikidata would have to support them.
I don't think old units are used a lot in Wikipedia infoboxes. But one use case is digitalization and importing of various datasets, and in some of them old units are used and will be needed.
I really, really hope that this isn't the mindset of the development team as a whole. If so, my confidence in the viability of Wikidata would take a major hit.
Yes, collecting the information that goes into infoboxes is going to be important, and yes, centralizing that information so that it can be used by all projects is a worthwhile initial goal. It's not the only thing this project is ever going to be used for though. To say that things that aren't currently in infoboxes aren't worth supporting is, quite frankly, a really awful artificial limit the usefulness of the project.
First off, 'what is and is not a field in a Wikipedia infobox' is a metric that changes over time, and often in large or unpredictable ways. Entire infoboxes have been created and depreciated, to say nothing of individual fields in those infoboxes. Someone might come along tomorrow and say "Yes, in fact, we should include the dimensions of historical buildings in their historical units" or "Yes, in fact, we should list both Old Style and New Style dates [1] where applicable". We're then going to be in the position where Wikidata doesn't have the information that people want to include. Had we allowed those things in from the beginning, and properly supported them, we might have had the information ready when it was asked for. If we only add new fields when people request them, those fields won't be ready until long after they're needed.
But more importantly, Wikidata is eventually going to be used for things other than Wikipedia infoboxes. Those uses are going to happen both on Wikipedia and off, and some of those uses are impossible to envision now. We should focus on collecting as much useful data as we can; not just what's in an infobox today, but what might be in an infobox, or the body text of an article, tomorrow.
Please don't sell out Wikidata's future utility for today's convenience.
Sven
[1] http://en.wikipedia.org/wiki/Old_Style_and_New_Style_dates
On Thu, Dec 20, 2012 at 5:26 AM, Daniel Kinzler <daniel.kinzler@wikimedia.de
wrote:
First off: our target use case is Wikipedia infoboxes. Do you have examples and numbers about the usage of such ancient units in infoboxes on wikipedia? If they are not in main stream use there, I don't see why Wikidata would have to support them.
Hi,
* Time:
Would it make sense to use time periods instead of partial datetimes with lower precision levels? Instead of using May 1918 as birth date it would be something like "birth date in the interval 01.05.1918 - 31.05.1918". This does not necessarly need to be reflected in the UI of course, it could still allow the "leave the field you dont know blank" way. This would allow the value to be 2nd-5th century AD (01.01.100 - 31.12.400). Going with this idea all the way, datetimes at the highest precision level could be handled as periods too, just as zero length periods..
Another question that popped is how to represent (or if it should be represented) times where for example the hour is known, but the day isn't. If it is known someone was killed at Noon, but not the specific day.
Friedrich
On Tue, Dec 18, 2012 at 3:29 PM, Denny Vrandečić < denny.vrandecic@wikimedia.de> wrote:
Thanks for the input so far. Here are a few explicit questions that I have:
- Time: right now the data model assumes that the precision is given on
the level "decade / year / month" etc., which means you can enter a date of birth like 1435 or May 1918. But is this sufficient? We cannot enter a value like 2nd-5th century AD (we could enter 1st millenium AD, which would be a loss of precision).
- Geo: the model assumes latitude, longitude and altitude, and defines
altitude as over mean sea level (simplified). Is altitude at all useful? Should it be removed from Geolocation and be moved instead to a property called "height" or "altitude" which is dealt with outside of the geolocation?
- Units are currently planned to be defined on the property page (as it is
done in SMW). So you say that the height is measured in Meter which corresponds to 3.28084 feet, etc. Wikidata would allow to defined linear translations within the wiki and can thus be done by the community. This makes everything a bit more complicated -- one could also imagine to define all dimensions and units in PHP and then have the properties reference the dimensions. Since there are only a few hundred units and dimensions, this could be viable.
(Non-linear transformations -- most notoriously temperature -- will get its own implementation anyway)
Opinions?
2012/12/17 Denny Vrandečić denny.vrandecic@wikimedia.de
As Phase 2 is progressing, we have to decide on how to represent data values.
I have created a draft for representing numbers and units, points in time, and locations, which can be found here:
<https://meta.wikimedia.org/wiki/Wikidata/Development/Representing_values
including a first suggestion on the functionality of the UI which we would be aiming at eventually.
The draft is unfortunately far from perfect, and I would very welcome comments and discussion.
We probably will implement them in the following order: geolocation, date and time, numbers.
Cheers, Denny
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Thank you for your comments, Friedrich.
It would be possible and very flexible, and certainly more powerful than the current system. But we would loose the convenience of having one date, which we need for query answering (or we could default to the lower or upper bound, or the middle, but all of these are a bit arbitrary).
The other option would be, as discussed in the answer to Marco, to use one data and an uncertainty, probably an uncertainty with a unit (and probably different lower and upper bounds). This would make it more consistent to the ways numbers are treated.
I start to think that the additional complexity for this solution might be warranted.
2012/12/18 Friedrich Röhrs f.roehrs@mis.uni-saarland.de
Hi,
- Time:
Would it make sense to use time periods instead of partial datetimes with lower precision levels? Instead of using May 1918 as birth date it would be something like "birth date in the interval 01.05.1918 - 31.05.1918". This does not necessarly need to be reflected in the UI of course, it could still allow the "leave the field you dont know blank" way. This would allow the value to be 2nd-5th century AD (01.01.100 - 31.12.400). Going with this idea all the way, datetimes at the highest precision level could be handled as periods too, just as zero length periods..
Another question that popped is how to represent (or if it should be represented) times where for example the hour is known, but the day isn't. If it is known someone was killed at Noon, but not the specific day.
Friedrich
On Tue, Dec 18, 2012 at 3:29 PM, Denny Vrandečić < denny.vrandecic@wikimedia.de> wrote:
Thanks for the input so far. Here are a few explicit questions that I have:
- Time: right now the data model assumes that the precision is given on
the level "decade / year / month" etc., which means you can enter a date of birth like 1435 or May 1918. But is this sufficient? We cannot enter a value like 2nd-5th century AD (we could enter 1st millenium AD, which would be a loss of precision).
- Geo: the model assumes latitude, longitude and altitude, and defines
altitude as over mean sea level (simplified). Is altitude at all useful? Should it be removed from Geolocation and be moved instead to a property called "height" or "altitude" which is dealt with outside of the geolocation?
- Units are currently planned to be defined on the property page (as it
is done in SMW). So you say that the height is measured in Meter which corresponds to 3.28084 feet, etc. Wikidata would allow to defined linear translations within the wiki and can thus be done by the community. This makes everything a bit more complicated -- one could also imagine to define all dimensions and units in PHP and then have the properties reference the dimensions. Since there are only a few hundred units and dimensions, this could be viable.
(Non-linear transformations -- most notoriously temperature -- will get its own implementation anyway)
Opinions?
2012/12/17 Denny Vrandečić denny.vrandecic@wikimedia.de
As Phase 2 is progressing, we have to decide on how to represent data values.
I have created a draft for representing numbers and units, points in time, and locations, which can be found here:
< https://meta.wikimedia.org/wiki/Wikidata/Development/Representing_values
including a first suggestion on the functionality of the UI which we would be aiming at eventually.
The draft is unfortunately far from perfect, and I would very welcome comments and discussion.
We probably will implement them in the following order: geolocation, date and time, numbers.
Cheers, Denny
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
How about this:
- Values default to a non-range value - You can click a checkbox that says "range" to turn the input into a range value instead - An entry can only be represented by either a non-range or a range number, not both
This relieves our issue with query answering:
Query: When was XXX invented?
Non-range answer: June 1988
Range answer: sometime between May 1988 and October 1989
Does that work?
Sven M
On Tue, Dec 18, 2012 at 10:56 AM, Denny Vrandečić < denny.vrandecic@wikimedia.de> wrote:
Thank you for your comments, Friedrich.
It would be possible and very flexible, and certainly more powerful than the current system. But we would loose the convenience of having one date, which we need for query answering (or we could default to the lower or upper bound, or the middle, but all of these are a bit arbitrary).
The other option would be, as discussed in the answer to Marco, to use one data and an uncertainty, probably an uncertainty with a unit (and probably different lower and upper bounds). This would make it more consistent to the ways numbers are treated.
I start to think that the additional complexity for this solution might be warranted.
2012/12/18 Friedrich Röhrs f.roehrs@mis.uni-saarland.de
Hi,
- Time:
Would it make sense to use time periods instead of partial datetimes with lower precision levels? Instead of using May 1918 as birth date it would be something like "birth date in the interval 01.05.1918 - 31.05.1918". This does not necessarly need to be reflected in the UI of course, it could still allow the "leave the field you dont know blank" way. This would allow the value to be 2nd-5th century AD (01.01.100 - 31.12.400). Going with this idea all the way, datetimes at the highest precision level could be handled as periods too, just as zero length periods..
Another question that popped is how to represent (or if it should be represented) times where for example the hour is known, but the day isn't. If it is known someone was killed at Noon, but not the specific day.
Friedrich
On Tue, Dec 18, 2012 at 3:29 PM, Denny Vrandečić < denny.vrandecic@wikimedia.de> wrote:
Thanks for the input so far. Here are a few explicit questions that I have:
- Time: right now the data model assumes that the precision is given on
the level "decade / year / month" etc., which means you can enter a date of birth like 1435 or May 1918. But is this sufficient? We cannot enter a value like 2nd-5th century AD (we could enter 1st millenium AD, which would be a loss of precision).
- Geo: the model assumes latitude, longitude and altitude, and defines
altitude as over mean sea level (simplified). Is altitude at all useful? Should it be removed from Geolocation and be moved instead to a property called "height" or "altitude" which is dealt with outside of the geolocation?
- Units are currently planned to be defined on the property page (as it
is done in SMW). So you say that the height is measured in Meter which corresponds to 3.28084 feet, etc. Wikidata would allow to defined linear translations within the wiki and can thus be done by the community. This makes everything a bit more complicated -- one could also imagine to define all dimensions and units in PHP and then have the properties reference the dimensions. Since there are only a few hundred units and dimensions, this could be viable.
(Non-linear transformations -- most notoriously temperature -- will get its own implementation anyway)
Opinions?
2012/12/17 Denny Vrandečić denny.vrandecic@wikimedia.de
As Phase 2 is progressing, we have to decide on how to represent data values.
I have created a draft for representing numbers and units, points in time, and locations, which can be found here:
< https://meta.wikimedia.org/wiki/Wikidata/Development/Representing_values
including a first suggestion on the functionality of the UI which we would be aiming at eventually.
The draft is unfortunately far from perfect, and I would very welcome comments and discussion.
We probably will implement them in the following order: geolocation, date and time, numbers.
Cheers, Denny
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Denny,
could you maybe elaborate on what you mean by query answering? Do you talk about some technical aspect of the wiki-software?
thanks,
On Tue, Dec 18, 2012 at 5:08 PM, Sven Manguard svenmanguard@gmail.comwrote:
How about this:
- Values default to a non-range value
- You can click a checkbox that says "range" to turn the input into a
range value instead
- An entry can only be represented by either a non-range or a range
number, not both
This relieves our issue with query answering:
Query: When was XXX invented?
Non-range answer: June 1988
Range answer: sometime between May 1988 and October 1989
Does that work?
Sven M
On Tue, Dec 18, 2012 at 10:56 AM, Denny Vrandečić < denny.vrandecic@wikimedia.de> wrote:
Thank you for your comments, Friedrich.
It would be possible and very flexible, and certainly more powerful than the current system. But we would loose the convenience of having one date, which we need for query answering (or we could default to the lower or upper bound, or the middle, but all of these are a bit arbitrary).
The other option would be, as discussed in the answer to Marco, to use one data and an uncertainty, probably an uncertainty with a unit (and probably different lower and upper bounds). This would make it more consistent to the ways numbers are treated.
I start to think that the additional complexity for this solution might be warranted.
2012/12/18 Friedrich Röhrs f.roehrs@mis.uni-saarland.de
Hi,
- Time:
Would it make sense to use time periods instead of partial datetimes with lower precision levels? Instead of using May 1918 as birth date it would be something like "birth date in the interval 01.05.1918 - 31.05.1918". This does not necessarly need to be reflected in the UI of course, it could still allow the "leave the field you dont know blank" way. This would allow the value to be 2nd-5th century AD (01.01.100 - 31.12.400). Going with this idea all the way, datetimes at the highest precision level could be handled as periods too, just as zero length periods..
Another question that popped is how to represent (or if it should be represented) times where for example the hour is known, but the day isn't. If it is known someone was killed at Noon, but not the specific day.
Friedrich
On Tue, Dec 18, 2012 at 3:29 PM, Denny Vrandečić < denny.vrandecic@wikimedia.de> wrote:
Thanks for the input so far. Here are a few explicit questions that I have:
- Time: right now the data model assumes that the precision is given on
the level "decade / year / month" etc., which means you can enter a date of birth like 1435 or May 1918. But is this sufficient? We cannot enter a value like 2nd-5th century AD (we could enter 1st millenium AD, which would be a loss of precision).
- Geo: the model assumes latitude, longitude and altitude, and defines
altitude as over mean sea level (simplified). Is altitude at all useful? Should it be removed from Geolocation and be moved instead to a property called "height" or "altitude" which is dealt with outside of the geolocation?
- Units are currently planned to be defined on the property page (as it
is done in SMW). So you say that the height is measured in Meter which corresponds to 3.28084 feet, etc. Wikidata would allow to defined linear translations within the wiki and can thus be done by the community. This makes everything a bit more complicated -- one could also imagine to define all dimensions and units in PHP and then have the properties reference the dimensions. Since there are only a few hundred units and dimensions, this could be viable.
(Non-linear transformations -- most notoriously temperature -- will get its own implementation anyway)
Opinions?
2012/12/17 Denny Vrandečić denny.vrandecic@wikimedia.de
As Phase 2 is progressing, we have to decide on how to represent data values.
I have created a draft for representing numbers and units, points in time, and locations, which can be found here:
< https://meta.wikimedia.org/wiki/Wikidata/Development/Representing_values
including a first suggestion on the functionality of the UI which we would be aiming at eventually.
The draft is unfortunately far from perfect, and I would very welcome comments and discussion.
We probably will implement them in the following order: geolocation, date and time, numbers.
Cheers, Denny
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
On 18.12.2012 17:08, Sven Manguard wrote:
How about this:
- Values default to a non-range value
- You can click a checkbox that says "range" to turn the input into a range
value instead
- An entry can only be represented by either a non-range or a range number, not both
Please be careful not to mix ranges as actual data values with ranges as accuracies:
There could be an "in-office" property which takes a range. The beginning and end of the range may each have an accuracy (for example, for some medieval king, we may have an exact date of death (and end of office), but only a vague idea of the coronation date.
This is conceptually different from a property that has a specific date/time as a value, which is only known with some given level of accuracy, which may be expressed as a range.
Now, I don't think we need or want ranges as a data type at all (better have separate properties for the beginning and end). But I still think the above distinction is an important one.
-- daniel
Now, I don't think we need or want ranges as a data type at all (better have separate properties for the beginning and end).
I am afraid this will then put a heavy burden on users to enter, proofread, and output values. Data input becomes dispersed, because the value "18-25 cm length " has to be split and entered separately. You have to write a custom output for each property then, and do all the query logic (> lower, < upper) for each property in each Wikipedia client.
I believe this is something that is healthy to do centrally. I believe the concept of intervals exists because of that.
Gregor
Denny,
Thanks for this.
Are there ways to structure this geolocation data now to anticipate more 'fluid' uses of it, say 5 or 10 years from now, or beyond, in representing water, or astronomical processes, in something like interactive, realistic models of the earth or the universe, which would also be useful to Wikipedia's developing goals / mission?
Scott
On Tue, Dec 18, 2012 at 8:57 AM, Gregor Hagedorn g.m.hagedorn@gmail.comwrote:
Now, I don't think we need or want ranges as a data type at all (better
have
separate properties for the beginning and end).
I am afraid this will then put a heavy burden on users to enter, proofread, and output values. Data input becomes dispersed, because the value "18-25 cm length " has to be split and entered separately. You have to write a custom output for each property then, and do all the query logic (> lower, < upper) for each property in each Wikipedia client.
I believe this is something that is healthy to do centrally. I believe the concept of intervals exists because of that.
Gregor
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
The great thing about MediaWiki is that we don't have to anticipate new features, we can build them in later when we discover that they're possible and that they're wanted. In fact, there's no requirement that the Wikidata developers are even the ones that do develop said hypothetical future modules. If you could code, you could build them and offer them up for integration.
All that being said, there are already websites that map out astronomical features in a geolocation-like way. It's worthwhile to consider supporting that type of geolocation data on Wikidata.
Sven
On Tue, Dec 18, 2012 at 12:14 PM, Scott MacLeod < worlduniversityandschool@gmail.com> wrote:
Denny,
Thanks for this.
Are there ways to structure this geolocation data now to anticipate more 'fluid' uses of it, say 5 or 10 years from now, or beyond, in representing water, or astronomical processes, in something like interactive, realistic models of the earth or the universe, which would also be useful to Wikipedia's developing goals / mission?
Scott
On Tue, Dec 18, 2012 at 8:57 AM, Gregor Hagedorn g.m.hagedorn@gmail.comwrote:
Now, I don't think we need or want ranges as a data type at all (better
have
separate properties for the beginning and end).
I am afraid this will then put a heavy burden on users to enter, proofread, and output values. Data input becomes dispersed, because the value "18-25 cm length " has to be split and entered separately. You have to write a custom output for each property then, and do all the query logic (> lower, < upper) for each property in each Wikipedia client.
I believe this is something that is healthy to do centrally. I believe the concept of intervals exists because of that.
Gregor
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
--
Scott MacLeod Founder & President
-- World University and School (like Wikipedia with MIT Open Course Ware)
http://worlduniversityandschool.blogspot.com/
http://worlduniversity.wikia.com/wiki/World_University
P.O. Box 442, (86 Ridgecrest Road), Canyon, CA 94516
415 480 4577 worlduniversityandschool@gmail.com Skype: scottm100
Google + main, WUaS pages: https://plus.google.com/u/0/115890623333932577910/posts
https://plus.google.com/u/0/b/108179352492243955816/108179352492243955816/po...
Please contribute, and invite friends to contribute, tax deductibly, via PayPal and credit card:
http://scottmacleod.com/worlduniversityandschool.htm
World University and School is a 501 (c) (3) tax-exempt educational organization.
World University and School is sending you this because of your interest in free, online, higher education. If you don't want to receive these, please reply with 'remove' in the subject line. Thank you.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Sven,
To complement your thought, and for an historical perspective influencing this discussion, UNIX has many plusses and minuses informed by the computer science of the 1950s and 1960s, and it would be great to anticipate as many 'plusses' as possible, and as sophisticatedly as possible, in the development of Wikidata et al. in the 2010s, especially in terms of values such as geolocation, - for example, with both virtual worlds and brainwave headsets (so, neural interfaces ... e.g. see Tan Le's TED Talk), but in many other ways, as I see it - which is why Denny is asking.
Scott
On Tue, Dec 18, 2012 at 9:24 AM, Sven Manguard svenmanguard@gmail.comwrote:
The great thing about MediaWiki is that we don't have to anticipate new features, we can build them in later when we discover that they're possible and that they're wanted. In fact, there's no requirement that the Wikidata developers are even the ones that do develop said hypothetical future modules. If you could code, you could build them and offer them up for integration.
All that being said, there are already websites that map out astronomical features in a geolocation-like way. It's worthwhile to consider supporting that type of geolocation data on Wikidata.
Sven
On Tue, Dec 18, 2012 at 12:14 PM, Scott MacLeod < worlduniversityandschool@gmail.com> wrote:
Denny,
Thanks for this.
Are there ways to structure this geolocation data now to anticipate more 'fluid' uses of it, say 5 or 10 years from now, or beyond, in representing water, or astronomical processes, in something like interactive, realistic models of the earth or the universe, which would also be useful to Wikipedia's developing goals / mission?
Scott
On Tue, Dec 18, 2012 at 8:57 AM, Gregor Hagedorn g.m.hagedorn@gmail.comwrote:
Now, I don't think we need or want ranges as a data type at all
(better have
separate properties for the beginning and end).
I am afraid this will then put a heavy burden on users to enter, proofread, and output values. Data input becomes dispersed, because the value "18-25 cm length " has to be split and entered separately. You have to write a custom output for each property then, and do all the query logic (> lower, < upper) for each property in each Wikipedia client.
I believe this is something that is healthy to do centrally. I believe the concept of intervals exists because of that.
Gregor
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
--
Scott MacLeod Founder & President
-- World University and School (like Wikipedia with MIT Open Course Ware)
http://worlduniversityandschool.blogspot.com/
http://worlduniversity.wikia.com/wiki/World_University
P.O. Box 442, (86 Ridgecrest Road), Canyon, CA 94516
415 480 4577 worlduniversityandschool@gmail.com Skype: scottm100
Google + main, WUaS pages: https://plus.google.com/u/0/115890623333932577910/posts
https://plus.google.com/u/0/b/108179352492243955816/108179352492243955816/po...
Please contribute, and invite friends to contribute, tax deductibly, via PayPal and credit card:
http://scottmacleod.com/worlduniversityandschool.htm
World University and School is a 501 (c) (3) tax-exempt educational organization.
World University and School is sending you this because of your interest in free, online, higher education. If you don't want to receive these, please reply with 'remove' in the subject line. Thank you.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
On 18.12.2012 17:57, Gregor Hagedorn wrote:
Now, I don't think we need or want ranges as a data type at all (better have separate properties for the beginning and end).
I am afraid this will then put a heavy burden on users to enter, proofread, and output values. Data input becomes dispersed, because the value "18-25 cm length " has to be split and entered separately.
No, that's exactly *not* what I was talking about - that's not a range, that's an uncertein single value, specified using a range. That is, the value isn't a range, the value's accuracy is represented by a range. That's fine!
What I don't want is having a data type "range" for a value. That would make things complicated, because each "edge" of the range may have an accuracy. In fact, in some cases, the different edges may have different sources and qualifiers. So it makes more sense to have separate properties for them. I can't think of an example where that wouldn't feel natural.
-- daniel
It would be possible and very flexible, and certainly more powerful than the current system. But we would loose the convenience of having one date, which we need for query answering (or we could default to the lower or upper bound, or the middle, but all of these are a bit arbitrary).
I believe it would be more profitable to build a query system which always queries for the range. This would work for interval-only values (see my comment on the wiki page) as well as for value with interval.
I don't see this as a big overhead. It is more a problem for ordering, but internally, wikidata could store a "midpoint" value for intervals where no explicit central value is given, and use these for ordering purposes.
I think it would be great if the system is consistent for quantities, dates, geographical longitude/latitude, etc.
Gregor
On 18.12.2012 17:52, Gregor Hagedorn wrote:
It would be possible and very flexible, and certainly more powerful than the current system. But we would loose the convenience of having one date, which we need for query answering (or we could default to the lower or upper bound, or the middle, but all of these are a bit arbitrary).
I believe it would be more profitable to build a query system which always queries for the range. This would work for interval-only values (see my comment on the wiki page) as well as for value with interval.
I'm getting the impression that we mean the same thing, but are talking about it in different terms.
I think it would be bad to have ranges/intervals as values. They are hard to handle, misleading and easy to abuse.
I think however it would be great to have values (magnitudes) with manifest accuracy values, which can be represented as a range (or, ideally, a gamma distribution). Reading your comments, I'm getting the impression that this is basically what you want.
I don't see this as a big overhead. It is more a problem for ordering, but internally, wikidata could store a "midpoint" value for intervals where no explicit central value is given, and use these for ordering purposes.
Well, I would call that "mid point" simple "the value", and the range would be the accuracy. There's an important conceptual distinction here to having ranges as actual values.
I think it would be great if the system is consistent for quantities, dates, geographical longitude/latitude, etc.
Indeed.
-- daniel
I don't see this as a big overhead. It is more a problem for ordering, but internally, wikidata could store a "midpoint" value for intervals where no explicit central value is given, and use these for ordering purposes.
Well, I would call that "mid point" simple "the value", and the range would be the accuracy. There's an important conceptual distinction here to having ranges as actual values.
Can this conceptually distinguish between a meaningful midpoint value, and one that is useful for ordering, but has no meaning and should not be displayed as a result value? See the examples on
https://meta.wikimedia.org/wiki/Talk:Wikidata/Development/Representing_value...
Gregor
PS: With accuracy you introduce a new concept here which was not in the representing values paper (see http://en.wikipedia.org/wiki/Accuracy_and_precision). This is different from confidence interval ("uncertainty") where it is not yet decided whether the value indicates accuracy or dispersion. Confidence interval is a measure of Accuracy only if the sample measurements are normally distributed and if no systematic bias exist. --- I believe it is important that wikidata is flexible enought so it can capture both, especially because in many cases dispersion is used as a rough estimate for otherwise unknown accuracy, and since in many cases there is no "true single value" and the dispersion is systematic (see e.g. car model length example).
Thanks for this Denny.
Time:
Historians **need** to be able to have date ranges of some sort. They also need to express confidence in non-numerical terms. Take for example, the invention of gunpowder in China. Not only do several major historians have different ranges entirely (which would, of course, be treated as different line items anyways), but the premier authorities are all giving date ranges. Some will say things like "between XXX and YYY", which requires date ranges, while others say "around ZZZ", which requires us to have some sort of way to represent "about". As to the first issue, we could try pairing entries, like we already are likely to be doing for things like "reign start date/reign end date" but that would be clumsy and very easily broken. As for the latter, I'm really not sure what the proper solution is. I am sure though, that if a historian says "about 850" and we put in "850", we're going to be **wrong** and that's going to be **bad data**. Additionally, unless the historian gives a range himself, we can't say "850 +/- 25" or some such thing. That would also be wrong.
Geo:
I can definitely see how altitude would be good for things like a rest lodge halfway up a mountain or a shipwreck below sea level. I'm not sure if any of the map makers can handle altitude right now; as far as I know things like Open Street Map and Google Maps are two dimensional maps with 'fake' three dimensional protrusions. That being said, I think that we should build the feature in and then trust that the map making companies will eventually figure out what to do with it. Google is probably crazy enough to mount cameras and GPS software on sherpas and send them up mountains if they think that maps accounting for altitude are something that could, erm, sell.
Units:
Not sure I understand the post, but I might. I advocate that we should have the unit translations stored on some page in the (already automatically full protected) MediaWiki namespace, and that conversions should be handled on this project before the data is sent out to client projects. The reason for this is that it makes adoption by (non WMF) end users much, much easier. It's not like the conversions are a subject of debate.
On Tue, Dec 18, 2012 at 9:29 AM, Denny Vrandečić < denny.vrandecic@wikimedia.de> wrote:
Thanks for the input so far. Here are a few explicit questions that I have:
- Time: right now the data model assumes that the precision is given on
the level "decade / year / month" etc., which means you can enter a date of birth like 1435 or May 1918. But is this sufficient? We cannot enter a value like 2nd-5th century AD (we could enter 1st millenium AD, which would be a loss of precision).
- Geo: the model assumes latitude, longitude and altitude, and defines
altitude as over mean sea level (simplified). Is altitude at all useful? Should it be removed from Geolocation and be moved instead to a property called "height" or "altitude" which is dealt with outside of the geolocation?
- Units are currently planned to be defined on the property page (as it is
done in SMW). So you say that the height is measured in Meter which corresponds to 3.28084 feet, etc. Wikidata would allow to defined linear translations within the wiki and can thus be done by the community. This makes everything a bit more complicated -- one could also imagine to define all dimensions and units in PHP and then have the properties reference the dimensions. Since there are only a few hundred units and dimensions, this could be viable.
(Non-linear transformations -- most notoriously temperature -- will get its own implementation anyway)
Opinions?
2012/12/17 Denny Vrandečić denny.vrandecic@wikimedia.de
As Phase 2 is progressing, we have to decide on how to represent data values.
I have created a draft for representing numbers and units, points in time, and locations, which can be found here:
<https://meta.wikimedia.org/wiki/Wikidata/Development/Representing_values
including a first suggestion on the functionality of the UI which we would be aiming at eventually.
The draft is unfortunately far from perfect, and I would very welcome comments and discussion.
We probably will implement them in the following order: geolocation, date and time, numbers.
Cheers, Denny
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
my 2¢
On 18.12.2012 15:29, Denny Vrandečić wrote:
Thanks for the input so far. Here are a few explicit questions that I have:
- Time: right now the data model assumes that the precision is given on the
level "decade / year / month" etc., which means you can enter a date of birth like 1435 or May 1918. But is this sufficient? We cannot enter a value like 2nd-5th century AD (we could enter 1st millenium AD, which would be a loss of precision).
We need a system for expressing accuracy anyway, for empirical values (measurements). I suggest to apply the same concepts (maybe even the same code) to dates. We'll need to be able to express accuracy in several ways, e.g. as standard deviation, +/-% range, gamma distribution, etc. Simply specifying a range as "+/- 3600 seconds" to indicate a precision level in the hour range would fit in, I think.
- Geo: the model assumes latitude, longitude and altitude, and defines altitude
as over mean sea level (simplified). Is altitude at all useful? Should it be removed from Geolocation and be moved instead to a property called "height" or "altitude" which is dealt with outside of the geolocation?
I'd include altitude, but make it optional (while long/lat are mandatory). Expressing accuracy here can and should work similarly to "normal" (1D) measurements and dates, usually given as a +/- range.
- Units are currently planned to be defined on the property page (as it is done
in SMW). So you say that the height is measured in Meter which corresponds to 3.28084 feet, etc. Wikidata would allow to defined linear translations within the wiki and can thus be done by the community. This makes everything a bit more complicated -- one could also imagine to define all dimensions and units in PHP and then have the properties reference the dimensions. Since there are only a few hundred units and dimensions, this could be viable.
I don't like the idea of defining units on the property pages at all. For one thing, it means a lot of overhead introducing new units, or at least standardizing the names of units (should that be "ton" or "metric ton"?...). This makes it hard to compare values of different properties (say, comparing average life expectancy to maximum life span or some such).
At least the base units (SI units) should be implemented in software. Derived units could be defined on the wiki, but on their own pages, not for each property.
(Non-linear transformations -- most notoriously temperature -- will get its own implementation anyway)
A far more tricky conversion would be between different reference globes for coordinates. But that wouldn't be handled as a "unit", I suppose.
-- daniel