On 19.12.2012 15:12, Gregor Hagedorn wrote:
I see no point in storing the unit used for input.
I think you plan to store the unit (which would be meter), so you don't want to store prefixes, correct?
Please argue why you don't see a point. You want to both the size of the universe, distance to New York, size of the proton in "meter"?
Yes. Otherwise, they would not be comparable in a database query. It's generally a good idea to normalize all values of a given dimension to use the same unit, for the samerason that it's best to convert dates in a database th UTC, instead of storing time zone info for each entry.
If not, with which algorithm will you restore the SI prefix, or rather, recognize with SI-prefix is usable? We do not use Mm in common language, so we do give the circumference of the earth as roughly 40 000 km and not as 40 Mm. We don't write 4*10^7 m either.
This would be done based on the accuracy. If the accuracy is ~1000m, the heuristic would decide that km is probably a good unit, with no digits after the decimal point. Besides that, the desired unit and precision can be specified when rendering values on a wiki page.
it is probably necessary to store the number of significant decimals.
Yes, that *is* the accuracy value i mean.
That's how Denny proposed to calculate the default accuracy. If the accuracy is given by a complex model (e.g. a gamma distribution), then it might be handy to have a simple value that tells us the significant digits.
Hm... perhaps it's best to always express accuracy as "+/-n", and allow for more detailed information (standard deviation, whatever) as *additional* information about the accuracy (could be modelled as a qualifier internally).
I fear that is two separate levels of precision of giving a measure of measurement _precision_ (I believe "accuracy" is the wrong term here, precision and accuracy are related but distinct concepts).
Ok, there's some terminology confusion here. I'm using "accuracy" to refer to the accuracy of measurement (e.g. standard deviation), and "precision" to refer to the precision of presentation (e.g. significant digits). We need these two things at least, and words for them. I don't care much which words we use.
So 4.10 means that the last digit is significant, i.e. the best estimate is at least between 4.095 and 4.105 (but it may be better). . 4.10 +/- 0.005 means it is precisely 4.095 and 4.105, as opposed to 4.10 +/- 0.004, 4.10 +/- 0.003, 4.10 +/- 0.002 etc.
Yes, all this should be handled by the component responsible for parsing user input for quantity values.
Futhermore, a quantity may be given as 4.10-4.20-4.35. The precision of measurement and the the measure of variance and dispersion are separate concepts.
True, though I think that we can lump them together for the sake of comparison, picking appropriate units, etc. The finer points can be handled by qualifiers (of the value, or even of the accuracy).
No, it would return using whatever system of measurement the user has selected in their preferences.
then you have lost the information. There is no "user selection" in this in science.
I have lost the information which unit was used when originally entering the data. That's it.
Complex heuristic may "guess" when to use the scientific SI prefixes instead. The trailing zero cannot be reproduced however when completely relying on IEEE floating-point.
We'll need heuristics to pick the correct secondary unit (e.g. nm or km). The
(I believe there is no such thing as a "secondary unit", did you make that term up? Only "m" is a unit of measurement, the n or k are prefixes see http://en.wikipedia.org/wiki/SI_prefix )
I made up the term, yes :) Whatever you call "km", if "m" is the real unit and "k" is the prefix.
I agree that the system should allow explicit conversion in infoboxes. I disagree that you should create an artifical intelligence system for wikidata that knows more about unit usage than the authors. To store the wisdom of authors, storing both unit and original unit prefix is necessary.
It can be stored as an auxilliary data point, that is, as a qualifier ("measured in feet"). It should not IMHO be part of the data value as such, because that would make it extremely hard to use the values in a database.
You write "The Precision can be derived from the accuracy and vice versa, using appropriate heuristics."
I _terrible strongly_ doubt that. Can you give any proof of that? For precision I can use statistics, for accuracy and need an indirect, separate and precise method to estimate accuracy. If you have a laser-distance measurement device, the precision can be estimated by yourself by repeated measurements at various times, temperatures, etc. But unless you have an objective distance standard, you have no means to determine whether the accuracy of the device is always off by 10 cm because someone screwed up the software program inside the device.
Sorry, I didn't mean to say that you can calculate one from the other, but that you can make a decent guess that is useful for parsing and display. e.g. if a value is given as 5km, we can assume that it wasn't measured with millimeter precision. And if something is declared to have millimeter accuracy, it probably makes sense to use an precision of 3 or 4 digits after the decimal point when converting to feet.
This is just heuristics, and yes, we shouldn't try to be too smart here. Just smart enough to be useful.
But they are not the same. IMHO, the accuracy should always be stored with the value, the precision never.
I fear that is a view of how data in a perfect world should be known, not a reflection of the kind of data that people need to store in Wikidata. Very often only the precision will be known or available to its authors, or worse, the source may not say which it is.
Ok, so my use of precision is not precise :) We are stumbling over words here. What would you call the level of detail used for presentation, if not "precision"?
-- daniel